Large language models (LLMs) in corpus linguistics – Using GenAI with corpora

written by Monika Bednarek

The Sydney Corpus Lab recently published a post containing a synthesis of how large language models (LLMs) and generative artificial intelligence (GenAI) tools have been incorporated into corpus linguistic research. This blog post is intended as a companion to that much longer post. It presents the main take-aways that researchers may want to consider when using these tools. Given the issues and debates reviewed in the longer post, we encourage researchers to take a careful and considered approach, and to be responsible, informed, critical, and ethical in using GenAI tools with corpora:

Responsible use means to incorporate human intervention, scrutiny, and oversight, for example careful checking of GenAI/LLM outputs whether they are used in corpus preparation, annotation or analysis.
Informed use means to know what tool is most suitable for the task, what the respective tool does, what it is good at and where it struggles (e.g. false negatives/positives, no frequency data, hallucinations, inconsistencies, opacity of content/processes, lack of reproducibility, concerns around training data).
Critical use means to be aware of the many concerns that have been raised about GenAI/LLMs, which may range from environmental concerns (such as water and electricity consumption) to issues around bias, concentration of power, injustice and other social harms.
Ethical use means to act ethically when using GenAI/LLM tools, for example not violating copyright and data sovereignty, or endangering privacy by sharing corpora with public tools (local, offline versions – which are not cloud-based – may be used for corpora instead, where this is permitted).

The longer blog post contains multiple references to empirical studies and concrete examples that support these recommendations.

Figure 1 presents these points as a visualisation, for potential use in the corpus linguistics classroom. As always with digital methods, these recommendations reflect a specific point in time (here early 2026) and we offer them here as a springboard for reflections, discussions and further development. For example, they do not yet consider the use of GenAI/LLM tools in data creation, i.e. using such tools to create linguistic corpora. This is just one example of an area where we see opportunities for additional critique and discussion. In addition, these technologies have a rapid pace of advancement and new models may perform differently to those tested in existing studies on which this synthesis is based.

A visualisation that presents the recommendations to form the acronym RICE — Figure 1 Visualisation of the recommendations

A final, important point to make concerns the fact that users may decide that the most ethical action is not to use a GenAI tool at all or to use it as minimally as possible. It is always worth considering alternatives, given the existence of ‘traditional’ (less energy-intensive) natural language processing models and the high-level interpretation skills of human analysts. Ultimately, we hope that together our two blog posts contribute to enabling each user to make their own informed decision.

Acknowledgments

I am grateful to the following people who have provided feedback on this post (in alphabetical order):

Matteo Fuoli, Sam Hames, Martin Schweinberger, Maite Taboada, Catherine Travis

(This work is openly licensed via CC BY-NC 4.0. Please cite it using the appropriate conventions outlined in the license.)

Acknowledgments

Share this: