Generative AI in corpus linguistics: A synthesis

on

Written by Kelvin Lee

The advent of large language models (LLMs) and generative artificial intelligence (GenAI) tools such as ChatGPT has led to AI being used in many facets of everyday life as well as in education and research. In this blog post, I will explore how AI has been incorporated into (primarily English) corpus linguistic research so far as well as look into current debates around its use in corpus linguistics.

Uses of AI in corpus linguistics

Some researchers have explored the affordances of AI for DIY corpora. Zappavigna (2023), for example, notes that AI is helpful for laborious corpus cleaning and normalisation tasks such as removing unwanted characters, converting text to and from lowercase or uppercase, removing HTML tags, and eliminating special symbols. A number of researchers experimented with using AI to annotate word forms for semantic domains (e.g., figurative vs. literal in Fonteyn et al. 2024), pragmatic/discursive functions (e.g., apologies in Yu et al. 2024), and genre moves (e.g., Yu 2025) using an existing data coding scheme or analytical framework – i.e., types of annotations that would otherwise be time-consuming and labour intensive when done manually, but nevertheless would produce annotated corpora useful for more fine-grained analysis. These studies have found that the quality and accuracy of the annotations conducted by the AI is almost the same as that of a human annotator. However, human intervention is necessary since AI struggles to deal with word forms that can carry multiple functions, which often result in inconsistent and incorrect annotations based on context (Yu et al. 2024; Yu 2025). Additionally, AI may overlook word forms that are related to the phenomena being annotated or incorrectly annotate word forms not related to the phenomena within the context of use.

Others have experimented with using AI for automated qualitative analyses of corpora including semantically grouping (key)words (e.g., Curry et al. 2024), concordance analysis (e.g., Curry et al. 2024; Zappavigna 2023), function-to-form analysis of pragmatic and discursive phenomena (e.g., questions in Curry et al. 2024), and identifying topics/discourses (e.g., Gillings & Jaworska 2025). The accuracy of the resulting analyses varied. Curry et al. (2024), for example, found that ChatGPT was able to semantically group key words reasonably well but deemed its ability to conduct concordance analysis and form-to-function analysis to be quite poor with the results being largely inaccurate. Furthermore, they found that ChatGPT was unable to replicate any of the analyses, which poses a major issue for repeatability and replicability of research. Additionally, it should be noted that, due to current word limits, ChatGPT (and likely other GenAI tools) may be unable to process a whole corpus (Zappavigna 2023: 1) or even certain full texts (Curry et al. 2024: 5). This would make it difficult and inefficient to use AI as analysis tools in lieu of existing tools like WordSmith or AntConc for those who intend to upload their own corpus for analysis. Uploading corpora into a public GenAI tool may also violate copyright or privacy laws and may not be allowed due to ethical concerns. Local versions of tools should be used, so that corpus data is not shared.

Overall, many researchers  advocate for using AI to assist with annotation and/or analysis with a human researcher overseeing the processes and scrutinising the results rather than using AI as the sole analyst or annotator of data (e.g., Brezina 2023: 82-83; Curry et al. 2024: 8; Curry et al. 2025: 320; Yu et al. 2024: 20; Zappavigna 2023: 5).

Some researchers present a case for the use of AI for cleaning, processing, and analysing corpora of certain languages and language varieties. Fonteyn et al. (2024: 588), for example, note that using AI for annotation may be necessary for historical language varieties especially when the pre-defined, non-adjustable tag sets used by existing corpus annotating tools do not have the necessary or sufficient distinctions (e.g., semantic ones) to help address research questions or objectives. Advocating the use of AI for analysis, Zappavigna (2023: 2) notes that AI can “handle non-standard characters and other peculiarities commonly found in social media texts” such as emoji. When off-the-shelf corpus tools cannot process or analyse specific characters or a particular language well, using AI may be a good workaround.

Another use case involves using AI to generate frequency word lists (e.g., Davies 2025), collocates (e.g., Davies 2025), concordances (e.g., Lin 2023), or p-frames (e.g., Uchida 2024). This essentially uses the built-in or training data of the LLM powering the AI tool to generate language use examples often for language teaching and learning purposes (specifically, corpus-led or data-driven learning). Some researchers  state that one advantage of using AI in this context is that AI is much easier to use than corpus analysis software  (e.g., Crosthwaite & Baisa 2023: 3; Flowerdew 2024: 14). However, other researchers point out that some time and effort through rounds of trial-and-error may be needed to formulate prompts to use the AI tool effectively (e.g., Lin, 2023; Yu 2025: 36-39; Zappvigna 2023: 4). On further comparison of AI with corpus tools, Lin (2023) notes that ChatGPT is unable to perform keyness and collocation analyses, which are standard analyses conducted in corpus linguistic tools. However, Davies (2025) found that collocates generated using the GPT and Gemini LLMs were generally much better than the collocates from corpora from Sketch Engine and English-Corpora.org. Davies notes that the data from the LLMs is very accurate and insightful by going beyond the “surface level” association measures used in corpus linguistics and by providing summaries of information such as similarities in context and functional role. The difference in outcome of the collocation generation between Lin (2023) and Davies (2025) highlights the rapid development of AI tools. Furthermore, Flowerdew (2024:10) states that ChatGPT and Copilot are unable to provide actual frequency data since these GenAI tools do not seem to be working with specific corpora but seem to be drawing on some unknown set of built-in or training data instead. While it is unlikely that LLMs have access to actual word frequencies, Davies (2025) found that AI seem to be able to rank words in a way that align with actual word frequency data. Nevertheless, being unable to verify the frequency data as well as looking up the concordances and where a specific word is used in the source files (i.e., things you can do using corpus tools) are clear drawbacks for using AI tools for generating frequency word lists.

Crosthwaite and Baisa (2023: 2) raise concerns about the authenticity of AI-generated text data. They argue that language produced by AI may be grammatically correct but may not be representative of actual language-in-use (i.e., not actually used in writing or speech, not appropriate contextually). While AI-generated language data cannot be a substitute for or be treated as equivalent to human written or spoken data due to issues of authenticity, it is appropriate to study AI-generated language as its own language variety. Many researchers are compiling AI-generated texts into corpora for comparison with corpora of human writing (e.g., Berber Sardinha 2024; Muñoz-Ortiz et al. 2024; Reinhart et al. 2025). Findings from such studies are often used to develop and finetune AI-writing detection. Motivated by concerns about how the reliance on using AI affects language learning and academic honesty, there is a breadth of research focussing on comparing AI-generated essays with those reportedly written by students, with most research focussing on L2 English students (e.g., Goulart et al. 2024; Herbold et al. 2023; Mizumoto et al. 2024; Tudino & Qin 2024).

Critiques and debates

A major concern regarding the use of AI in corpus research is the opacity of its content and processes. One of the strengths and advantages of using a corpus is that we know what kinds of texts and which texts have been incorporated into the corpus. This is generally not the case for AI, since it is currently not possible to look up the exact data sources of most tools (Crosthwaite & Baisa 2023: 2; Flowerdew 2024: 8, 10, 15). However, Flowerdew (2024: 8) notes that Copilot does display data sources or at least provide information where it is possible to read more on the current topic being queried. Furthermore, AI is often referred to as ‘black box’ since decision-making processes (e.g., during annotation or analysis of data) are not transparent to the user (Berber Sardinha 2024: 2; Curry et al. 2024: 8). Zappavigna (2023: 4), however, argues that the internal processes of both AI and corpus analysis tools are opaque to the user but suggests that AI is more scrutable since you can ask the AI to explain the decision and then correct the output. Davies (2025) states that we need to be wary with the ‘introspections’ produced by the AI tools themselves (i.e., explaining how and why they generated a particular output) since AI is generally no better at analysing their own decisions and processes than humans. Additionally, Flowerdew (2024: 8) points out that AI likely ‘regurgitates’ what has been loaded into the data, which could raise questions about the dependability of an ‘introspective’ explanation generated by the AI tool. In general, the dependence of the output on the content (within the AI) and the opacity of said content raises concerns about potential bias and skewing of the output. Without knowing the content and processes, it is also difficult to know whether the AI tool will be appropriate for the research being conducted.

Even if the issue of opacity is resolved, the data used to train AI is limited in capturing linguistic variations (Grieve et al. 2025) and (socio)cultural perspectives (Curry et al. 2025). Certain biases can result from the uncurated selection of data for training AI, which may only capture the most dominant linguistic and cultural ideologies and perspectives (Curry et al. 2025: 319; Grieve et al. 2025). The inherent biases within the AI can propagate potentially harmful language ideologies (e.g., regarding ‘standard’ or prestige language varieties) which leads to the homogenisation of language use (Curry et al. 2025: 319). Grieve et al. (2025) highlight two potential types of issues arising from the uncurated sampling of data for AI – specifically, LLMs:

The first is the social bias against users from certain social groups which result in quality-of-service harms (Grieve et al. 2025: 7). The most commonly used AI tools are chatbots, which require the user to enter written prompts. If language data from certain social groups is under-represented in the training data for an LLM, it is possible that tools powered by the LLM will process language structures in the prompts entered by members of these social groups less accurately. This would result in poorer performance for users from these social groups.

The second type of social bias involves the AI generating outputs that directly harm or discriminate against certain social groups even when they are not directly engaging with these tools themselves, resulting in stereotyping harms (Grieve et al. 2025: 7). If the training data contains language that expresses harmful or inaccurate ideas about certain social groups, then the LLM and the AI tool powered by the LLM will inevitably reproduce these same harmful or inaccurate ideas. Members from these social groups can be harmed by seeing these harmful ideas firsthand while using the AI tools themselves or through the AI tool disseminating these ideas to people outside the social groups. As the use of AI continues to grow, AI-generated data is increasingly being included in data used for training AI. Curry et al. (2025: 319) argue that this will likely further exacerbate and reinforce the existing biases in the AI.

Researchers also highlight the unpredictability of AI output, which again is not helped by its opacity. The first issue relates to the variability of output where replication of results is not guaranteed even when using the same prompts (Crosthwaite & Baisa 2023: 2; Curry et al. 2024; Uchida, 2024: 9). As mentioned above, this raises major concerns about the replicability and reproducibility of research. Crosthwaite and Baisa (2023: 2) note that AI output is generated using complex statistical procedures and randomly sampled, which results in the variation of the output. Uchida (2024: 9) suggests that ChatGPT (and likely other LLMs) constantly undergoes refinement and updates, which can affect the replicability of outputs. Aside from the variability of output, AI has a tendency to generate misleading or erroneous results (i.e., hallucinations). For example, using ChatGPT to conduct function-to-form identification and analysis of questions, Curry et al. (2024: 6) found that the AI often fabricated new questions not found in the data by adding question marks or question tags to declarative statements and then including these in the results of the analysis. Similarly, Flowerdew (2024) found ChatGPT gave results that were not asked in the prompt. Specifically, Flowerdew asked ChatGPT to give examples of the verb indicate (used in the conclusion section of research articles), which were retrieved alongside examples of other verbs (e.g., suggest, highlight, show).

Crosthwaite and Baisa (2023: 2) point out another issue with AI regarding the presentation of its outputs. Existing corpus tools visually present corpus data in several ways such as the use of colours in concordances (e.g., in AntConc), statistical tables (e.g., collocation scores), and, more recently, visual charts and maps of relationships between words and lexico-grammatical units (e.g., in LancsBox, SketchEngine, or Voyant Tools). While certain AI tools can generate tables or detailed images when prompted, it may be difficult to do so in a chatbot context without integrating the chatbot with some other tool (Crosthwaite & Baisa 2023: 2).

Overall, it is clear that AI can assist with analysis and potentially speed up certain aspects of corpus research (e.g., data cleaning and processing). However, such processes should not be left entirely to AI, especially considering current drawbacks relating to the lack of transparency (in content and processes) and variability of results generated. Of course, we should not ignore the fact that a human researcher is necessary to oversee all aspects of research including formulating the prompts as well as interpreting and scrutinising the results produced by the AI. In addition, AI can be critiqued for non-linguistic reasons, such as its water and electricity consumption (i.e., environmental reasons) or its use of data for training purposes (i.e., copyright concerns). Further, the research I reviewed here focusses mostly on the English language (see Fonteyn et al. 2024 for semantic tagging of Dutch data) and AI tools may perform differently for other languages. Last, but not least, given the rapid developments in this field, I must emphasise that this blog post reflects the state-of-affairs at the time this synthesis/review was originally undertaken (in July 2025).

(This work is openly licensed via CC BY-NC 4.0. Please cite it using the appropriate conventions outlined in the license.)

References

Berber Sardinha, T. (2024). AI-generated vs human-authored texts: A multidimensional comparison. Applied Corpus Linguistics4(1), 100083. https://doi.org/10.1016/j.acorp.2023.100083

Brezina, V. (2025). Corpus linguistics and AI: #LancsBox X in the context of emerging technologies. International Journal of Language Studies19(2), 75-. https://doi.org/10.5281/zenodo.15250820

Crosthwaite, P., & Baisa, V. (2023). Generative AI and the end of corpus-assisted data-driven learning? Not so fast. Applied Corpus Linguistics, 3(3), 100066-. https://doi.org/10.1016/j.acorp.2023.100066

Curry, N., Baker, P., & Brookes, G. (2024). Generative AI for corpus approaches to discourse studies: A critical evaluation of ChatGPT. Applied Corpus Linguistics, 4(1), 100082. https://doi.org/10.1016/j.acorp.2023.100082

Curry, N., McEnery, T., & Brookes, G. (2025). A question of alignment – AI, GenAI and applied linguistics. Annual Review of Applied Linguistics, 45, 315–336. https://doi.org/10.1017/S0267190525000017

Davies, M. (2025). Corpora and AI/LLMs: Overview. (White paper). English-Corpora.org. https://www.english-corpora.org/ai-llms/corpora-vs-llms.html

Flowerdew, J. (2024). Data-driven learning: From Collins Cobuild Dictionary to ChatGPT. Language Teaching, 1–18. https://doi.org/10.1017/S0261444824000144

Fonteyn, L., Manjavacas, E., Haket, N., Dorst, A. G., & Kruijt, E. (2024). Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings. Linguistics Vanguard: Multimodal Online Journal10(1), 587–602. https://doi.org/10.1515/lingvan-2022-0142

Gillings, M., & Jaworska, S. (2025). How humans and machines identify discourse topics: A methodological triangulation. Applied Corpus Linguistics5(1), Article 100121. https://doi.org/10.1016/j.acorp.2025.100121

Goulart, L., Matte, M. L., Mendoza, A., Alvarado, L., & Veloso, I. (2024). AI or student writing? Analyzing the situational and linguistic characteristics of undergraduate student writing and AI-generated assignments. Journal of Second Language Writing66, Article 101160. https://doi.org/10.1016/j.jslw.2024.101160

Grieve, J., Bartl, S., Fuoli, M., Grafmiller, J., Huang, W., Jawerbaum, A., Murakami, A., Perlman, M., Roemling, D., & Winter, B. (2025). The sociolinguistic foundations of language modeling. Frontiers in Artificial Intelligence7, 1472411. https://doi.org/10.3389/frai.2024.1472411

Herbold, S., Hautli-Janisz, A., Heuer, U., Kikteva, Z., & Trautsch, A. (2023). A large-scale comparison of human-written versus ChatGPT-generated essays. Scientific Reports13(1), 18617. https://doi.org/10.1038/s41598-023-45644-9

Lin, P. (2023). ChatGPT: Friend or foe (to corpus linguists)? Applied Corpus Linguistics, 3(3), 100065. https://doi.org/10.1016/j.acorp.2023.100065

Mizumoto, A., Yasuda, S., & Tamura, Y. (2024). Identifying ChatGPT-generated texts in EFL students’ writing: Through comparative analysis of linguistic fingerprints. Applied Corpus Linguistics4(3), 100106. https://doi.org/10.1016/j.acorp.2024.100106

Muñoz-Ortiz, A., Gómez-Rodríguez, C., & Vilares, D. (2024). Contrasting Linguistic Patterns in Human and LLM-Generated News Text. The Artificial Intelligence Review57(10), Article 265. https://doi.org/10.1007/s10462-024-10903-2

Reinhart, A., Markey, B., Laudenbach, M., Pantusen, K., Yurko, R., Weinberg, G., & Brown, D. W. (2025). Do LLMs write like humans? Variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences – PNAS122(8), e2422455122. https://doi.org/10.1073/pnas.2422455122

Tudino, G., & Qin, Y. (2024). A corpus-driven comparative analysis of AI in academic discourse: Investigating ChatGPT-generated academic texts in social sciences. Lingua312, 103838. https://doi.org/10.1016/j.lingua.2024.103838

Uchida, S. (2024). Using early LLMs for corpus linguistics: Examining ChatGPT’s potential and limitations. Applied Corpus Linguistics, 4(1), 100089. https://doi.org/10.1016/j.acorp.2024.100089

Yu, D. (2025). Towards LLM-assisted move annotation: Leveraging ChatGPT-4 to analyse the genre structure of CEO statements in corporate social responsibility reports. English for Specific Purposes (New York, N.Y.)78, 33–49. https://doi.org/10.1016/j.esp.2024.11.003

Yu, D., Li, L., Su, H., & Fuoli, M. (2024). Assessing the potential of AI-assisted pragmatic annotation: The case of apologies. International Journal of Corpus Linguistics. https://doi.org/10.1075/ijcl.23087.yu

Zappavigna, M. (2023). Hack your corpus analysis: How AI can assist corpus linguists deal with messy social media data. Applied Corpus Linguistics, 3(3), 100067-. https://doi.org/10.1016/j.acorp.2023.100067