Constructing the corpus of Science Fiction Anime dialogue (SciFAn)

on

by Kelvin K.-H. Lee

For my recently-completed PhD thesis, I combined corpus linguistics and sociolinguistics to investigate characterisation through the use of first-person pronouns in Japanese animation, i.e. anime (see Figure 1). To do so, I first had to design and build my own specialised corpus of anime dialogue – namely, the corpus of Science Fiction Anime dialogue (SciFAn). While there are existing Japanese corpora such as the Balanced Corpus of Contemporary Written Japanese (Maekawa et al., 2011–) and the Oxford-NINJAL Corpus of Old Japanese (National Institute for Japanese Language and Linguistics, 2020–), such corpora tend not to include dialogue from anime films or television series.

Composite image showing green, blue, yellow and red drawings of The Bebop crew from Cowboy Bebop (left to right, top to bottom): Spike (gooie_duck, 2011d), Jet (gooie_duck, 2011c), Faye (gooie_duck, 2011b), Ein the corgi, and Ed (gooie_duck, 2011a)
Figure 1 The Bebop crew from Cowboy Bebop (left to right, top to bottom): Spike (gooie_duck, 2011d), Jet (gooie_duck, 2011c), Faye (gooie_duck, 2011b), Ein the corgi, and Ed (gooie_duck, 2011a)

The anime series I decided to include in my corpus are five highly rated and popular science fiction series: Kōdo Giasu: Hangyaku no Rurūshu (English title: Code Geass: Lelouch of the Rebellion, 2006–2007), Cowboy Bebop (1998–1999), Gintama (2006–2010), Kiseijū Sei no Kakuritsu (English title: Parasyte: The Maxim, 2014–2015), and Steins;Gate (2011). As a starting point, I mainly used Japanese fan subtitles, i.e. intralingual subtitles. First, the subtitles had to be modified – for example, timestamps had to be removed and speaker names had to be manually added (Figure 2). Second, since Japanese is a character-based language, the Japanese anime dialogue required segmenting before it could be used in the corpus linguistic software that I used – specifically, AntConc (Anthony, 2022). The main reason is that AntConc recognises words if the word boundaries are identified through spaces, but this is a feature that is rarely used in Japanese writing, if at all, although it is standard in languages such as English.

Figure shows a sample of the dialogue (in Excel) for Steins;Gate, episode 1, Hajimari to owari no purorōgu (‘Prologue of the Beginning and End’) with speaker names added (in red box)
Figure 2 A sample of the dialogue (in Excel) for Steins;Gate, episode 1, Hajimari to owari no purorōgu (‘Prologue of the Beginning and End’) with speaker names added (in red box)

To segment the Japanese text, I used text segmenting software. A number of text segmenters and morphological parsers exist that can be used for Japanese – including SegmentAnt (Anthony, 2017), MeCab (Kudō, 2013), ChaSen (NAIST Computational Linguistics Lab, 2007), and kuromoji (Atilika, 2014; Lambertsen et al., 2016). Given that these are automated procedures, such segmenters can have high error rates, either inconsistently segmenting words or segmenting words that do not require segmentations. After trialling both SegmentAnt and MeCab, MeCab was chosen as the more accurate segmenter. A drawback is that it requires programming knowledge to use.[i]

The amended MeCab version that I was able to use contains Japanese neologisms or colloquialisms (e.g. 事故る jikoru ‘to have an accident’, コピペkopipe ‘copy-and-paste’), proper names (e.g. 健人kenji) as well as word forms and morphemes from Kansai dialects (e.g. ほんま honma lit. ‘truth/reality’). Using the amended dictionary, the resulting tagged segmentations were much more accurate than the segmentations produced using the standard MeCab UTF-8 dictionary. However, some errors in segmentations did occur, albeit in fewer numbers, and the tagging of these errors was consistent, which made them easier for me to locate and correct in Notepad++ (Ho, 2021).

While these issues with the data meant that the corpus building was very time-consuming, it did allow me to then efficiently analyse the data in AntConc. This, in turn, enabled me to identify the complex ways in which Japanese first-person pronouns work to achieve characterisation – and to combine corpus linguistics with the sociolinguistic concept of indexicality. In addition, the thesis makes a useful contribution to Japanese linguistics, demonstrating how corpus linguistic techniques can be used in the examination of spoken Japanese data – specifically, anime dialogue. You can read more about my analysis and findings in my PhD thesis Language and Character Identity: A Study of First-Person Pronouns in a Corpus of Science Fiction Anime Dialogue, which is freely available for download at https://hdl.handle.net/2123/28687.


[i] Acknowledgments

Prof. Laurence Anthony kindly trained me in using MeCab during his 2019 visit at the Sydney Corpus Lab. The training mainly involved him showing me the specific software to use to run MeCab and the specific codes needed to initiate the segmentation and tagging processes as well as troubleshoot any software-related issues that resulted. Later in 2019, Prof. Kevin Heffernan offered to let me use a more accurate version of MeCab, which he supplemented with a modified UTF-8 MeCab dictionary. I am very grateful to both scholars for their support.

References

Anthony, L. (2017). SegmentAnt (Version 1.1.3) [Computer software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/

Anthony, L. (2022). AntConc (Version 4.0.11) [Computer software]. Tokyo, Japan: Waseda University. Available from http://www.laurenceanthony.net/

Atilika. (2014). Kuromoji. https://www.atilika.org/

gooie_duck. (2011a). Edward Wong Hau Pepelu Tivruskii IV and Ein [Image]. Flickr. https://www.flickr.com/photos/gooieduck/6154459101/in/photostream/

gooie_duck. (2011b). Faye Valentine [Image]. Flickr. https://www.flickr.com/photos/gooieduck/6154459051/in/photostream/

gooie_duck. (2011c). Jet Black [Image]. Flickr. https://www.flickr.com/photos/gooieduck/6154458941/in/photostream/

gooie_duck. (2011d). Spike Spiegel [Image]. Flickr. https://www.flickr.com/photos/gooieduck/6154458869/in/photostream/

Ho, D. (2021). Notepad++ (Version 8.1.9) [Computer software]. Paris, France: SYSTRAN. Available from https://notepad-plus-plus.org/

Kudō, T. (2013). MeCab (Version 0.996) [Computer software]. Tokyo: Google Inc. Available from http://taku910.github.io

Lambertsen, G., Hasegawa, M., & Moen, C. (2013). Kuromoji (Version 1.0) [Computer software]. Tokyo, Japan: Atilika. Available from https://github.com/atilika/kuromoji

Maekawa, K., Yamazaki, M., Ogiso, T., Maruyama, T., Ogura, H., Kashino, W., Koiso, H., Yamaguchi, M., Tanaka, M., & Den, Y. (2011–). Balanced Corpus of Contemporary Written Japanese. Available online at https://ccd.ninjal.ac.jp/bccwj/en/

NAIST Computational Linguistics Lab. (2007). ChaSen [Computer software]. Ikoma, Japan: Nara Institute of Science and Technology (NAIST). Available from https://chasen-legacy.osdn.jp/

National Institute for Japanese Language and Linguistics. (2020–). Oxford-NINJAL Corpus of Old Japanese. Available online at https://oncoj.ninjal.ac.jp/