Resources

The Sydney Corpus Lab has compiled a few resources on corpus linguistics:

In addition, we have designed a range of new tools for corpus/text analysis as part of our collaboration with the Sydney Informatics Hub for the Australian Text Analytics Platform (ATAP) and the HASS Research Data Commons and Indigenous Research Capability Program. This includes the following tools:

The Document Similarity Tool

The Document Similarity Tool provides a means of identifying duplication in the texts in a corpus and of excluding duplicated texts from the corpus. It allows users to review each pair of texts and to export the de-duplicated corpus for additional analysis. It includes various visualisations and can also be used for purposes beyond corpus building where there is an interest in similarity between texts/speakers.

The Quotation Tool

The Quotation Tool identifies and extracts quoted content from newspaper texts, retrieving speakers and quoted content. It distinguishes different types of reporting expressions and types of quoting and also includes an automatic classification of entities that are quoted and of entities that occur in the identified quoted content (e.g. as person or as organisation).

The Semantic Tagger Tool

The Semantic Tagger Tool (for English, Chinese, Italian or Spanish) assigns words or multi-word expressions to semantic groups, tagging the texts in a dataset or corpus accordingly. Tagged texts can be exported for additional analysis using other tools. Some analysis and visualisation can also be undertaken within the tool itself.

The Keyword Analysis tool

The Keyword Analysis tool analyses words in two (or more) corpora and identifies whether certain words are over- or under-represented in the ‘node’ or ‘study’ corpus (i.e., the corpus of interest) compared to a ‘reference’ corpus (i.e., the standard of comparison). The tool also allows users to investigate if the use of a certain word in a corpus is statistically different to the use of that same word in a different corpus (based on the Welch t-test or the Fisher permutation test).

Discursis

Discursis was developed by Dan Angus, Janet Wiles and Andrew Smith and reworked as an open source tool by the Sydney Informatics Hub. It is designed to analyse participant interactions around topics in conversations.

Information on where to access these tools are available in the blog posts that are linked above.