Written by Kelvin Lee
This blog post introduces the Honi Soit corpus, a training dataset that we compiled using a variation of what is called constructed week sampling (as explained below). We hope that the description of this sampling method will be of interest to others who want to compile newspaper corpora.
The Honi Soit corpus is an approximately 60,000 word corpus comprising 100 news articles published by the University of Sydney student newspaper Honi Soit between January 2021 and December 2022. As shown in Table 1 below, the corpus comprises 50 articles for 2021 and 50 for 2022. Articles come only from the News category section of the Honi Soit website (https://honisoit.com/category/news/) and should thus contain news reportage rather than non-news genres such as opinion or analysis.
2021 | 2022 | Total | |
No. of articles | 50 | 50 | 100 |
Number of tokens | 29,402 | 30,592 | 59,994 |
This corpus was deliberately constructed as a small training dataset for use and potential distribution with text analytics notebooks developed as part of our collaboration on the Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA). Both are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. Prior to compiling the dataset, we sought (and were given) the permission of the Honi Soit editors to do so.
As mentioned, the news articles were selected from the Honi Soit website (news category) using a variation of constructed week sampling. This is a type of stratified random sampling in which the complete sample represents all days of the week to account for cyclical variation of news content (Luke et al. 2011: 78).
For the Honi Soit corpus, an article was selected from each week within the timeframe between January 2021 and December 2022 for inclusion in the corpus. (No articles were sampled from the first week of January and final week of December since Honi Soit does not publish during this period.) The article selection process began with the second last week of December 2022 and then continued in reverse. In essence, an article is selected from a particular day of the week for one week and then for the preceding week, an article is selected from a different day of the week. For example, for the second last week of December 2022, an article published on a Wednesday is selected and thus, for the previous week (i.e., third last week of December) an article published on any day other than a Wednesday is selected, and so on. The resulting corpus is comprised of a roughly equal number of articles that are selected for each day of the week (as shown in Table 2). Importantly, the number of constructed weeks (approximately 14) included in the corpus far exceeds the minimum recommended by Hester and Dougall (2007: 820) for the content analysis of online news (i.e., two constructed weeks).
Mon | Tue | Wed | Thu | Fri | Sat | Sun | |
2021 | 8 | 6 | 7 | 7 | 6 | 7 | 9 |
2022 | 6 | 8 | 8 | 7 | 8 | 7 | 6 |
Total | 14 | 14 | 15 | 14 | 14 | 14 | 15 |
The corpus only contains verbal text from the selected articles; photos and other visuals as well as their respective captions were deleted during the cleaning process. Other information such as author, date, and when the article was updated were also deleted from the body text of each article, although date and author were used as file name. Each article is available as individual txt file with UTF-8 encoding, while the whole training dataset also exists as zipped file for easy use with our notebooks.
Acknowledgments
We are grateful to the Honi Soit editorial team for giving us permission to compile and distribute this dataset.
References
Hester, J. B., and Dougall, E. (2007). ‘The efficiency of constructed week sampling for content analysis of online news’. Journalism & Mass Communication Quarterly 84(4): 811– 824.
Luke, D. A., Caburnay, A., and Cohen, E. L. (2011). ‘How much is enough? New recommendations for using constructed week sampling in newspaper content analysis of health stories’. Communication Methods and Measures 5(1): 76– 91.