Corpus linguistics in Australia: An interview with Pam Peters

on

In this blog post, we look back at the history of corpus linguistics in Australia, through the recollections of Emeritus Professor Pam Peters (Macquarie University).

1. How and when did you first get involved in corpus linguistics?

During the 1980s, there was growing interest in Australian English at large, following publication of the first Macquarie Dictionary (1981). I became a member of the Dictionary’s editorial committee in 1986, and was curious about how far the lexicogrammar of Australian English (AusE) still reflected that of British English — when influences from American English were increasingly seen and heard. By then the first fruits of grammatical research on the American Brown corpus and British LOB corpus were beginning to be discussed at ICAME conferences in Europe, and there was a great incentive to build an Australian counterpart to the Brown and LOB corpora so as to further research on AusE usage. 

A photo of Pam Peters backed by the new online Australian Manual of Style (2021)
Image 1 Recent shot of Pam Peters backed by the new online Australian Manual of Style (2021), a Macquarie University–Biotext co-publication which embodies her corpus-based research on Australian usage

2. Can you tell us more about compiling the first sample corpus of Australian English, the Australian Corpus of English (ACE)?

Who was involved, how was it funded?

At that time, the digital revolution was in its early days in Australia, and we managed to get hold of a couple of digitised newspaper “dumps” from the Sydney Morning Herald and the Adelaide Advertiser to try our hands at extracting linguistic data from them. But there was no way that we could get digitised samples from all the different newspaper subtypes or other text types needed to match the Brown and LOB corpora. So we (Peter Collins, David Blair and I) did it the hard way — as the makers of Brown and LOB had done – typed up the 2000-word samples from selected newspapers, magazines, reports and books, turned them into digital files, and thus compiled the 1 million-word Australian Corpus of English (ACE). Optical character readers (the Kurzweil) were just becoming available then, but still not so reliable (either on newsprint or glossy publications). Also, they were far too expensive for us on the rather small funds we had from a starter Australian Research Council  grant and from Macquarie University and the University of New South Wales.  We had great technical help from research staff at Macquarie’s then Speech and Language Research Centre, using its “mainframe” computer to process the relatively large digital files. The text was printed out on sheets of continuously folded computer paper as large as a standard brief case, but portable enough to be taken away for proof-reading. An early photo has us celebrating with a line of computing paper cascading out of the printer down to the floor.

Black and white photo showing Macquarie researchers (L-R Peter Collins, Pam Peters, Alison Moore [research assistant], David Blair) circa 1988
Image 2 First fruits of the ACE corpus in the hands of Macquarie researchers (L-R Peter Collins, Pam Peters, Alison Moore [research assistant], David Blair) circa 1988

What were some of the challenges, you faced, and do you have any interesting anecdotes from that project?

Though our aim was to match the Brown and LOB categories with similar categories from Australian publications from the year 1986, this proved to be easier said than done, especially when it came to finding samples of fiction.  Australia’s publishing industry was by then strong in mass market products, e.g. magazines and newspapers (there were many more independent newspapers in the 1980s).  But book publishing (except for education) was underdeveloped. For most of the fiction categories we had to take samples from literary magazines as well as  full novels, and some categories such as Romance and Westerns could not be filled from Australian sources.  The romance literature that Australians read tended to be Mills and Boon novels published in London, and Westerns came of course from American publishing houses. We therefore had to reinvent those categories in ACE as “Women’s writing” and “Bush narratives” respectively, to make use of what was available from niche Australian publishers. And to make up the overall shortage of fiction samples we added the category of historical fiction, in which there was good supply of Australian monographs as well as short stories.

3. What other corpora have you since been involved in designing and building and can you tell us a little bit about the history and process?

ICE-AUS corpus

Following completion of ACE in 1990, we were pleased to be invited by Professor Sidney Greenbaum at University College London to join the network compiling the International Corpus of English (ICE) and to compile ICE-AUS.  Again, it required us to match the established ICE categories of written and spoken English, and to engage with the challenges of collecting and transcribing speech. The overall aim of the ICE project was to collect the speech of “educated adult native-speakers” (to represent the “standard” variety).  This was not so straightforward in the Australian context with its large immigrant population from different parts of the British Isles as well as non-anglophone countries. Having made the benchmark for “educated” having a complete secondary education, we allowed immigrants whose school education had been in Australia to be grouped with those actually born here as “native speakers”. “Adult” was defined as being at least 18 years of age.

Some of the spoken categories established by the British team in London had to be adapted in the Australian context. One was sports commentary, which is often dialogic here, as parodied by “Roy and HG”, whereas it was purely monologic in the ICE model. In other categories (e.g. institutional speech) there was plenty of latitude for collecting data beyond the inevitable lecture and tutorial, and our three adventurous research assistants (Robert Jenkins, Heather Middleton, Wendy Young) took off to record a soap-box speaker in Hyde Park debating with their audience, a wine-tasting in the Hunter Valley, and an occasional meeting of the Sydney skeptics.

With its large spoken component, AUS-ICE took almost five years to complete (1992-7), and we benefited from a large grant which I secured from the National Languages and Literacy Institute of Australia. The project naturally involved paperwork in seeking consent from all participants to have their voice recordings included in the corpus, though they were amazingly obliging about it. The transcription of spoken recordings presented new challenges: how to deal systematically with pauses, pause fillers, backchannels, and with the inaudible bits on less-than-perfect audio tapes from authentic noisy surroundings. Taping dinner party conversations gives the constant sounds of cutlery in contact with china plates in an erratic rhythm. 

ART (Australian Radio Talk corpus)

The ICE spoken categories ensured we have plenty of samples of casual and private conversations as well as more formal institutional speech of structured settings such as the classroom and the meeting room. But there was little by way of broadcast dialogues, that in-between category of interactive speech which is ad hoc and somewhat personal but intended for wide audiences. Given the mixture of public and commercial broadcasters, it made sense to create the ART corpus to collect samples of both, and we (Peter Collins, Adam Smith and I) obtained an ARC grant on which we could proceed. To collect a range of broadcast discussions from all capital cities on different days of the week and different times of day we took advantage of the streamed versions of radio programs available.  We amassed a smallish but rich collection of dialogues (just on 300,000 words) with more and less opinionated disk jockeys across the country — showing how much care the ABC interviewers took to let their callers speak, and how little interested the typical commercial talkshow anchors were in what callers were saying. The notional topic areas varied enormously from real estate to Sharina’s Psychic encounters. Transcribing the ART recordings had us concerned with capturing as much by way of natural speech as possible, including assimilated pronunciations (dunno, goodio, whaddya) and other features of rapid speech production, as well as the various phonetic forms of backchannels rendered “phonemically” as mm and uh.  Deanna Wong and Yasmin Funk compiled a list of all these non-standard forms for reference by the transcription team. It helped to maintain inter-transcriber consistency — when we found that our Irish research assistant was studiously avoiding using er for a fronted pronunciation of uh – being a rhotic speaker himself!

4. What corpus linguistic research projects are you currently working on?

Just now I’m creating and using a suite of text corpora in two areas of public health.  One set of these is associated with the HealthTermFinder platform, which focuses on the 12 commonest types of cancer, with individual “termbanks” for each.  The 12 specialised corpora of documents sourced from public or professional health websites helps to prioritise the lists of terms to be covered, and supplies us with examples of current usage to show something of their contexts of use. In this project we collaborate with colleagues at Fudan University, Shanghai, where they provide Chinese translations of each termbank for their medical students, and we use them as an extra dimension for Australian citizens with Chinese heritage.

The second set of health corpora are more broadly focused on areas of community health, where we’re researching the readability of health messaging for both L1 and L2 readers in the community. Messaging in relation to Covid19 is just one of these corpora under compilation in a project I’m involved in with colleagues Marc Orlando and Jan-Louis Kruger which is associated with NAATI (National Accreditation Authority for Translators and Interpreters). We’ve just published a paper in Text and Talk on a preceding research study on community access to health information, using corpora relating to mental health issues.

5. You designed, developed and taught the first-ever unit of study in Australia dedicated solely to corpus linguistics. How did you go about this? How did students react? What did you learn from this experience?  Are there any corpus linguistic units of study at Macquarie now and how do they differ from the unit you designed in the 1990s?

Australia’s first corpus linguistics unit was launched in 1993 in the Macquarie Linguistics Department for third-year undergraduates. It was co-convened by myself and Steve Cassidy, until he moved over to the Computing Department in the early 2000s.  Between us we could cover technical aspects of corpus methodology as well as their design, and the kinds of linguistic research questions that corpora can well answer. Steve established the departmental corpus website, where students could access the key sample corpora (Brown, LOB), but also create their own, using built-in tools. With the website facilities, students investigated a great variety of linguistic topics of personal interest — such as the language of Australian TV “infomercials”, or of insurance law, or Porgy and Bess, or comparing the styles of JK Rowling and Enid Blyton. The versatility of corpus research was a great satisfaction to them and to us as conveners also. Corpus linguistics remains a key element of the “capstone” unit for Linguistics majors, alongside other research methods in quantitative and qualitative analysis.

6. Do you think there is a bright future for corpus linguistics in Australia? What is specific/unique to the Australian linguistic contexts that will shape the future of corpus linguistics here?

Three things that will underscore the future role of corpus linguistics in Australia:

1)The place of corpus linguistics in future linguistic research has been secured with its recent recognition in the ARC’s classification scheme for research grants, following representations organised by Monika Bednarek.

2) Recognition of the need for reliable long-term curation of corpora is to be addressed through the Linguistics Data Commons of Australia, led by Michael Haugh with the participation of many colleagues across multiple Australian universities.

3) Infrastructure and nation-wide upskilling for computational humanities research in HASS has been established in Queensland University’s Language Technology and Data Analysis Laboratory (LADAL) and is a crucial part of collaborative projects such as the ARDC Australian Text Analytics Platform (ATAP)

7. Any tips for budding corpus linguists, advice, feedback, comments or other interesting corpus linguistic anecdotes?

Corpus projects are rather costly and require technical backup. Creating large sample corpora involves considerable teamwork. They give back the joys of working with colleagues on a common resource, exploiting its currency, knowing that it will yield research benefits long into the future, being part of the expanding universe of corpus linguistics. 

To find out more about Australian corpora, check out this blog post or this timeline.