Australian corpora: the state of affairs, by Annabelle Lukin

on

How are we faring in the development of Australian corpora? It is fair to say that there is scope for improving our grounding in empirical data both in the study of Australia’s Indigenous languages, as well as in studies of Australian English. 

For researchers in Indigenous languages, the ARC Centre for the Dynamics of Language has an overview of the state of play. The team is archiving both existing and newly developing collections in Australia’s Indigenous languages (as well as Papuan and Austronesian languages of Indonesia, PNG, and island Melanesia).

The Centre is connected to PARADISEC (Pacific And Regional Archive for Digital Sources in Endangered Cultures), which hosts endangered language data, including for some Australian Indigenous languages. For the 2019 Year of Indigenous Languages, PARADISEC is running a “mystery language” project, hoping to discover the provenance of some of their as yet unidentified recordings.

The data available to study the sounds, structures or discourse habits of Australian English speakers is small and patchy. While corpora of American and British English have grown over the last couple of decades, we simply don’t have the scale and diversity of Australian English data we need for comprehensive studies of our national dialect/s.

There is at least an “Australian National Corpus” website. Various corpora are hosted there: The Australian version of the International Corpus of English (ICE), based on 500 x 2000 word samples of various registers (both spoken and written) of Australian English is available. ICE-Aus includes samples of published texts from 15 different registers, such as newspaper reports, government and corporate documents, fiction, short stories, popular magazines, and academic discourse. The spoken data includes face-to face spoken conversations, telephone conversations, monologues, broadcast dialogues, and scripted speech.

There’s also the Australian Corpus of English (ACE), compiled to match Australian data from 1986 with the American (Brown) and British (LOB) corpora of written English from the 1960s. It includes 500 samples of published texts taken from 15 different categories of nonfiction and fiction.

And there’s also a small corpus of Australian Talkback Radio from 2004-2006 (c. 250K words) from both ABC and commercial radio stations. Macquarie University’s Pam Peters developed the ICE-Aus, the ACE, and this talkback radio corpus.

For historians or diachronic linguists, there is a corpus of Early Oz English (called ‘Cooee’), made up of 1353 samples of written text from Australia, New Zealand or Norfolk Island totalling 1.5 million words. The texts include unpublished letters, books, and historical texts, and runs from 1788 up to 1900. The corpus was developed by Clemens Fritz. Figure 1 shows the 55 most frequent words in the Early Oz English corpus: it would be interesting to look into the contexts for the discussions around “water” in this historical data, given its prominent place in our current lives.

Figure 1: Top lexical items in the Cooee corpus

The interface at the Australian National Corpus website, no doubt due to lack of funds, has limited functionality. It can produce concordance lines which can be downloaded, and word frequencies. But the data files can be downloaded onto your desktop. With the growth of corpus tools, you can explore the data yourself.

The Sydney Corpus Lab, launched earlier this year, is home to three Australian datasets. These include the recently compiled “Australian Brown corpus” (put together by Peter Collins and Xinyue Yao). Based on the Brown sampling frame, the corpus is diachronic, with samples from 1931, 1961, 1991 and 2006. The Australian budget speech corpus, which I compiled, contains the annual budget speech from 1981 up to the current year (c. 200, 000 words). The Diabetes News Corpus is a corpus of Australian news articles on diabetes from 2013-2017, put together by Monika Bednarek and Georgia Carr.

There is other Australian data floating around, because more and more disciplines are interested in text/linguistic data. Corpus linguists have a chance to connect with scholars in the digital humanities, who are also building data sets. Tim Sherratt – historian, hacker and part-time academic – hosts a github with, among other things, the complete works of Hansard, and a collection of 20, 000 speeches, press releases, and interviews by Australian Prime Ministers (see Figure 2 below).

Figure 2: Data on Australian prime ministers collected by Tim Sherratt

The field needs more resources and closer collaborations including with colleagues outside linguistics to help build a better foundation for studies of Australian languages.