(by Monika Bednarek)
In my book Language and Television Series: A Linguistic Approach to TV Dialogue I analysed the Sydney Corpus of Television Dialogue (SydTV), a dataset made up of dialogue from one episode each from 66 different recent US American television series. 34 episodes of this corpus come from ‘quality’ series, and 32 episodes come from other series (called ‘mainstream’ here). Since quality can be defined in many different ways and an external criterion was needed to determine the status of a particular episode, a series was classified as ‘quality’ on the basis of Emmy or Golden Globe award nominations or wins for ‘best/outstanding’ TV series or ‘outstanding writing’ at the time of corpus construction. In the book, I analysed the corpus as a whole rather than analysing potential differences between episodes coming from ‘quality’ and ‘mainstream’ television. So I have often wondered whether corpus techniques could identify how similar these two subsets are in terms of their lexical profile.
To find out more, I used AntWordProfiler (Anthony 2013) to create a lexical profile for both subsets (‘quality’ and ‘mainstream’) using the original (not standardised) version of SydTV version 4. In essence, this tool profiles the vocabulary level and complexity of texts against selected wordlists.1 For example, the results show the percentage of tokens that occur in the analysed dataset that are in listed in the different word lists. The first list (level 1) typically contains core high frequency words, and a high percentage of coverage would indicate that the dataset is relatively easy to understand.
First, I created a lexical profile for both datasets in turn, using the inbuilt default wordlists, GSL (1000/2000) and AWL (570), which were created by Paul Nation based on the work of West (1953) and Coxhead (2000), and then cleaned by Laurence Anthony. The results are presented below (Tables 1 and 2) and show that the two datasets are very similar to each other in terms of their lexical profile and coverage, e.g. 82.14% (quality) and 82.79% (mainstream) of tokens occur in the GSL 1000.
I then repeated the process using the BNC/COCA family lists + extras (Version 2.00), created by Paul Nation. Again, Tables 3-4 demonstrate that the results for both datasets are very similar (focussing here on basewords 1-3), e.g. 87.02% (quality) and 87.15% (mainstream) of tokens occur in the baseword 1 list. Together, these findings suggest that the two datasets are similar as far as their lexical coverage is concerned.
To triangulate findings, I decided to also undertake a keywords analysis using CQPweb. Of the different available corpus versions, I chose the ‘Standardised Sydney Corpus of Television Dialogue_CLAWS’, and used the interface to create two sub-corpora: ‘mainstream’ and ‘quality’. I then created a keywords list for the ‘quality’ sub-corpus using the ‘mainstream’ sub-corpus as reference corpus.2 Table 5 shows the results, excluding punctuation (!) and the forms Vince, Joy, Bill, Michael, which are predominantly used as names to refer to characters.
In total, only seven keywords are identified, four of which relate to informal, colloquial or non-standard language (fuck, shit, ai[n’t], yo, occurring in 9-12 episodes), and one occurs predominantly in one episode and relates to a plot (bear, in a Weeds episode). The remaining two words are want (508 occurrences across 34 episodes) and your (953 occurrences across 34 episodes), with high frequencies and distributed across all episodes in the quality subset. Without further investigation, it is difficult to understand why these two word forms are ‘key’ in ‘quality’ television; they are certainly not complex words or specialized vocabulary. The difference in swear/taboo words (fuck and shit, both occurring in 12 different episodes) relates to the fact that more ‘quality’ episodes come from ‘uncensored’ programs, and there are known differences in the use of swear/taboo words in ‘censored’ vs. ‘uncensored’ television (e.g. Bednarek 2019, in press), influenced by different regulations for network and cable television. Further, prior analysis of ain’t in SydTV (Bednarek 2018: 169ff) showed that while it occurs in about 30% of the whole corpus, most of the instances occur in two episodes, from True Blood and The Wire, respectively. Both of these episodes are included in the ‘quality’ subset. Regarding yo (52 occurrences in 9 episodes), this could have to do with the kinds of characters that occur in the ‘quality’ dataset. This form is often (though not exclusively) uttered by non-European American characters (e.g. African American, Puerto Rican American) or young characters, and in one episode it actually stands for Spanish ‘I’ in the phrase yo no se. In sum, the keywords results largely seem to confirm the lexical similarity among the two datasets as measured by corpus techniques.
These results do not mean that there are no linguistic differences between ‘quality’ and ‘mainstream’ television, as I only focused on particular resources. It remains to be seen, for example, whether award-winning ‘quality’ television uses more literary devices or different grammatical structures than ‘mainstream’ television. Other potential differences could relate to the complexity of plots or characterization.
Notes
1 Tool settings AntWordProfiler: Hide angle tags; show statistics, word types and word groups (Families). Sort level 1 = frequency; sort level 2 = word. Batch process = No
2 Tool settings Keywords analysis: Compare word forms; show positive keywords; min frequency of 2 in both lists; using log-likelihood statistic, significance cut-off 5%; use Šidák correction. Very similar results are obtained when the Šidák correction is not used and the significance cut-offs are 0.00001%, 0.0001% and 0.001% respectively (with 0.001%, one additional keyword is identified: fucking).
Acknowledgments
Thanks go to Laurence Anthony for his help with AntWordProfiler and feedback on an earlier draft of this post.
References
Anthony, L. (2013). AntWordProfiler (Version 1.4.0.1) [Computer Software]. Waseda University, Tokyo. Available from https://www.laurenceanthony.net/software/antwordprofiler/
Bednarek, M. (in press). Swear/taboo words in US TV series: Combining corpus linguistics with selected insights from screenwriters and learners. In V. Werner & F. Tegge (eds). Pop culture in language education. Routledge.
Bednarek, M. (2019). ‘Don’t say crap. Don’t use swear words.’ – Negotiating the use of swear/taboo words in the narrative mass media. Discourse, Context & Media 29: 1-14 (available open access here)
Bednarek, M. (2018). Language and television series. A linguistic approach to TV dialogue. Cambridge University Press.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly 34(2): 213-238.
West, M. P. (Ed.). (1953). A general service list of English words: with semantic frequencies and a supplementary word-list for the writing of popular science and technology. Longman, Green, and Co.