Resources

The Sydney Corpus Lab has compiled a few resources on corpus linguistics:

A list of textbooks/introductions/handbooks on corpus linguistics (in different languages)
A compilation of video playlists on corpus linguistics
A synthesis of the use of genAI in corpus linguistics and a shorter companion post that can be used as teaching resource

In addition, we have designed a range of new tools for corpus/text analysis as part of our collaboration with the Sydney Informatics Hub for the Language Data Commons of Australia (LDaCA) project. This includes the following tools:

WordFlow

Wordflow – an LDaCA Text Analytics Workbench – is the most recent and the largest tool in the LDaCA Text Analytics Tools suite — a web app for researchers who work with text. You bring a corpus; Wordflow gives you a workspace where each analysis becomes a data block you can branch, filter, and feed into the next tool. Tools talk to each other: click a word in Frequency and it opens in Concordance; send a result to the workspace and group it in Trends.

The Document Similarity Tool

The Document Similarity Tool provides a means of identifying duplication in the texts in a corpus and of excluding duplicated texts from the corpus. It allows users to review each pair of texts and to export the de-duplicated corpus for additional analysis. It includes various visualisations and can also be used for purposes beyond corpus building where there is an interest in similarity between texts/speakers.

The Quotation Tool

The Quotation Tool identifies and extracts quoted content from newspaper texts, retrieving speakers and quoted content. It distinguishes different types of reporting expressions and types of quoting and also includes an automatic classification of entities that are quoted and of entities that occur in the identified quoted content (e.g. as person or as organisation).

The Semantic Tagger Tool

The Semantic Tagger Tool (for English, Chinese, Italian or Spanish) assigns words or multi-word expressions to semantic groups, tagging the texts in a dataset or corpus accordingly. Tagged texts can be exported for additional analysis using other tools. Some analysis and visualisation can also be undertaken within the tool itself.

The Keyword Analysis tool

The Keyword Analysis tool analyses words in two (or more) corpora and identifies whether certain words are over- or under-represented in the ‘node’ or ‘study’ corpus (i.e., the corpus of interest) compared to a ‘reference’ corpus (i.e., the standard of comparison). The tool also allows users to investigate if the use of a certain word in a corpus is statistically different to the use of that same word in a different corpus (based on the Welch t-test or the Fisher permutation test).

The ATAP Concordancer (beta version)

The ATAP Concordancer is a Jupyter notebook that allows users to search a text/corpus for every instance of a search term and then presents the found instances in the form of a concordance. This concordancer has specifically been designed to allow users to:

undertake ‘dialogic’ analysis of intratextual patterns (when the input consists of related text pairs, such as question-answer or social media post-response) and/or
analyse the meta-data that are associated with each occurrence of the search term, if such meta-data are included in the input (for example, speaker identity, political affiliation, company, and so on).

This notebook still requires further development but is a proof-of-concept example that users are able to test with small datasets.

Discursis

Discursis was developed by Dan Angus, Janet Wiles and Andrew Smith and reworked as an open source tool by the Sydney Informatics Hub. It is designed to analyse participant interactions around topics in conversations.

ATAP TopSBM

TopSBM is a topic modelling approach developed by Eduardo Altmann and colleagues, which infers a hierarchy of topic clusters and word clusters in a corpus in a non-parametric manner by leveraging stochastic block models. Top stands for ‘Topic’ and SBM stands for ‘Stochastic Block Models’. This repository was an integration effort of TopSBM to the Australian Text Analytics platform (ATAP). It is a demo Jupyter notebook for TopSBM with ATAP Corpus integration. At the end of the notebook, users are able to download a corpus with TopSBM results. Further information on this approach to topic modelling is available here.

Information on where to access these tools are available in the blog posts that are linked above (or see the repositories here).