A brief history of modern Australian English corpora, by Monika Bednarek

The Sydney Corpus Lab recently presented a timeline that attempts to trace the development of large computer corpora of modern Australian English. This blog post provides further details and includes links for accessing the corpora.

The first sample corpus of Australian English, the Australian Corpus of English (ACE), was compiled by Pam Peters, Peter Collins, and David Blair (1986-1989), to mirror the design of the American BROWN and British LOB corpora.

Concordance of ‘mate’ from ACE

A second sample corpus compiled at the Macquarie University Linguistics Department is the Australian component of the International Corpus of English (ICE-AUS, 1992-1996), which includes spoken as well as written data, and was subsequently annotated. To get access to ACE and ICE-AUS (as well as ART; see below), users can register here, and then access the corpora via a simple interface (interface to be updated).

Concordance of ‘yeah nah/no’ from ICE-AUS

The AusBrown corpus, a new diachronic corpus of written Australian English ranging from 1931-2006, also matches the well-known BROWN ‘family’ of corpora in design. It’s available through the Sydney Corpus Lab CQPweb interface (requires free registration).

Selected spoken corpora include the Australian Radio Talkback (ART) corpus, about a quarter of a million words of talkback radio talk recorded in 2004-2006, and the AusTalk database, audio-visual data for 861 adult speakers (ages 18-83) from 15 different locations in all Australian states & territories. The Sydney Speaks 2010s corpus is currently in preparation. It will consist of recordings from older and younger adult males and females who are of varied socio-economic status, region within Sydney, and ethnic community (e.g. Anglo, Italian, Chinese, Greek, Vietnamese, Lebanese). Importantly, there is also a corpus of Australian Sign Language, the Auslan corpus, which contains data from language recording sessions with one hundred deaf native and near-native signers of Auslan (twenty participants each from Adelaide, Brisbane, Melbourne, Perth, and Sydney).

AusNC logo

Finally, the Australian National Corpus (AusNC), is not a corpus, but rather a national meta-collection that includes various corpora and collections of Australian English text, and provides access to some of these.

Not all of the corpora surveyed above are corpora of naturally-occurring texts; some of these datasets contain elicited data (e.g. interviews, reading tasks, story-telling tasks). There is currently no corpus of Australian English that is similar in design to large national corpora of contemporary British and American English.

In addition to the development of computer corpora, corpus linguistics has also been included in Australian university curricula, either by integrating corpus linguistic methodology into existing units or by designing new units dedicated to corpus linguistics. The first-ever unit of study in Australia dedicated solely to corpus linguistics was designed and developed by Pam Peters at Macquarie University. As she recalls,

LING 317 was indeed the MQ Corpus Linguistics unit which I designed, developed and convened from about 1993 until my retirement in 2008.As far as I know, it was the first-ever unit of study in Australia dedicated solely to corpus linguistics. Lectures on technical aspects of corpus compilation and processing, and the original corpus website were contributed by Steve Cassidy, who moved to the Computing Department in the early 2000s. After that Canzhong Wu came into the technical role, and he continued to run LING317 for a couple of years — maybe to 2010. The unit was then absorbed into the Linguistics capstone course as a component of ’empirical linguistics’.

The University of Sydney has offered a senior undergraduate unit on corpus linguistics since 2010, designed and convened by Monika Bednarek, who later founded the Sydney Corpus Lab.

Many thanks to Pam Peters for feedback on a draft version of this blog post. Corrections can be emailed to info@sydneycorpuslab.com. A longer version of this post will appear in a 2020 special issue of the Australian Review of Applied Linguistics (ARAL) dedicated to ‘Corpus Linguistic Approaches to Education in Australia’, and edited by Sydney Corpus Lab members/affiliates Alexandra García, Peter Crosthwaite, and Monika Bednarek.

Selected references

Collins, P. & X. Yao (2019). AusBrown: A new diachronic corpus of Australian English. ICAME Journal 43: 5-21.

Estival, D., S. Cassidy, F. Cox & D. Burnham (2014). AusTalk: an audio-visual corpus of Australian English. 9th Language Resources and Evaluation Conference (LREC 2014). Reykjavik, Iceland. 3105-3109.

Green, E. & P. Peters (1987). Towards a corpus of Australian English. ICAME Journal 11: 1-13.

Haugh, M., K. Burridge, J. Mulder & P. Peters (Eds.) (2009). Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages. Somerville, MA: Cascadilla Press. 

Wong, D., S. Cassidy & P. Peters (2011). Updating the ICE annotation system: Tagging, parsing and validation. Corpora 6 (2): 115-144.

Brief descriptions of some of the corpora are available on the website for the Australian National Corpus (AusNC).