Written by Monika Bednarek
We have just updated the ATAP Quotation Tool (Jufri & Sun 2022), which is an Australian Text Analytics Platform tool that allows users to identify and extract quotes from English-language newspaper texts. In addition to extracting the quotes, the tool also provides information about who the speakers are, identifies the reporting expressions used (e.g. say, tell), and classifies entities (e.g. as person, organisation, location) both regarding the speakers and the quoted content. Users can download the results as a spreadsheet for further analysis.
The tool is described in detail in this blog post on the ATAP website and if you’re unfamiliar with the Quotation Tool, we recommend you read this first. A User Guide (help pages) is now also available here.
Most recently, we have created a minor update of this notebook with two new features. This version is currently available by changing the “main” branch to the “feat/freq_lists” branch before launching the notebook via Binder. Alternatively, you can access this notebook version via this URL.
What’s new in this version?
More information on named entity types
As mentioned earlier, the Quotation Tool automatically classifies entities that are speakers or that occur in quoted content (e.g. as person, organisation, location). In order to provide users with more information on these named entity types, the updated notebook version contains a code cell in Step 3 that produces a list of all available entities of the loaded language model (e.g. PERSON, ORG) and their explanations prior to the code cell where users specify these entities. In other words, the user can now simply run a code cell that shows them a table of the available entities and their meanings. The beginning of the table is shown in Figure 1. This change makes it easier for users to modify the code in the subsequent cell in order to extract the particular entities that they are interested in.
Summary of categories
A second change is that the spreadsheet that users can download (in Step 5) after extracting quotes now includes summarising information. More specifically, users can obtain a summary of most of the categories, including reporting expressions (e.g. the verbs said, say, told and their frequencies); the quote type (from GenderGapTracker‘s quote extractor), the speaker entity types (e.g. PERSON, ORG), the speaker names (e.g. Dr Sun), etc. The notebook itself includes a full list of the respective categories and further information. These frequency lists are available as separate sheets within the downloaded Excel spreadsheet and allow easy access to summative information. As an example, Table 1 shows the sheet that summarises the frequency of entity types that are identified as ‘speaker’ in the Honi Soit corpus. Table 1 shows that this newspaper corpus mainly cites persons (PERSON=290) which are likely to be affiliated with an organisation (ORG=236).
speaker entity type | frequency |
PERSON | 290 |
ORG | 236 |
GPE | 33 |
NORP | 10 |
LOC | 2 |
To give a second example, Table 2 shows the four most frequent reporting expressions identified in the same dataset (available in the sheet ‘verb_frequencies’ in the downloaded spreadsheet). Neutral reporting expressions are the most common, most often used in the simple past tense. Results should always be manually reviewed. For instance, the sheet identifies 30 instances of “according to”, but 1 additional instance of “according” is included in a separate row, which would need to be combined by users.
verb | frequency |
said | 233 |
told | 50 |
says | 37 |
according to | 30 |
Summing up, these two new features have two goals:
- make it easier to understand and change the code for named entity recognition;
- aid users in identifying trends in their corpus/dataset.
For further information on this tool, please consult the detailed help pages.
As mentioned above, we have kept the original version of the Quotation Tool at the main link, but you can access the new version via this branch. Or you can launch it directly via binder here.
If you have any questions, feedback, and/or comments about the tool, you can contact the Sydney Informatics Hub at sih.info@sydney.edu.au.
Acknowledgments
The ATAP Quotation Tool is a Jupyter notebook containing code that was adapted and developed (with permission) from the GenderGapTracker by the Sydney Informatics Hub in collaboration with the Sydney Corpus Lab under the Australian Text Analytics Platform program and the HASS Research Data Commons and Indigenous Research Capability Program. These projects received investment from the Australian Research Data Commons (ARDC), which is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).
References
Jufri, Sony & Chao Sun. 2022. Quotation Tool. Australian Text Analytics Platform. Software. https://github.com/Australian-Text-Analytics-Platform/quotation-tool.