written by Marcel Reverter-Rambaldi
The development of large-scale linguistic corpora has broadened greatly the scope of research that can be done into language. Projects including the Language Data Commons of Australia and Sydney Speaks demonstrate the value that is placed on comprehensive collections of language data. As corpora continue to grow in scale, the benefit of automated approaches to analysing data becomes more obvious. One method that is being widely used in Natural Language Processing (NLP) is topic modelling – an automated method of determining topics that occur in a text through the identification of clusters of words that are textually related. To date, the two main topic-modelling algorithms, Latent Dirichlet Allocation (LDA) and Latent Semantic Indexing (LSI), have been overwhelmingly applied to formal, written data, using, for example, journal articles, business reports, and twitter posts. As such, there appears to have been very little application of topic-modelling algorithms to spoken language data. For NLP methods to expand in their efficacy, they must also be applicable to speech data. But, are they?
To address this question, I applied topic modelling to the spoken speech data in the Sydney Speaks corpus, a 1.5 million-word corpus made up of sociolinguistic interviews, as outlined here, and described in another Sydney Corpus Lab blog post. Sociolinguistic interviews are informal and loosely structured, but as they take the form of guided conversations, they are ideal for testing the application of NLP methods to spoken data, as topics are likely to be more contained and readily identifiable than in free conversations.
I began by selecting five transcripts (each approximately 5,000 words long, representing roughly 30 minutes of speech), from speakers of different age groups (from teenage to elderly) and covering a range of topics, such as (but not limited to) employment, hobbies, school life and childhood games. I annotated these transcripts, with the purpose of identifying words which conveyed a shared idea – mainly, nouns and adjectives representing a shared semantic-frame, for example, bus, driver and depot occurring in the same transcript.One of the key findings was the large range in the number of topics per transcript, from as low as 6 to as high as 16. This suggests that rather than an exact number of topics for a transcript, a range may be preferable, as some “topics” may fall under the category of a broader “super topic” (e.g. the topics of bus driver, train driver and driving instructor could belong under the broader topic of “employment”, or each be considered a distinct topic). This is problematic, as topic-modelling algorithms are typically run with a specified number of topics.
Another way of getting some insight into the topics is to look at the most frequent words. When this approach was applied to the raw transcripts, the high token-frequency of grammatical words (e.g. prepositions, articles, pronouns) meant that more semantically meaningful words did not appear. An example of this is shown in Figure 1 (from a transcript of an interview with an Anglo Adult male in his 20s). While many of these words are informative about the nature of the text — the high frequency of I and you reflect that it is interactional, and yeah suggests informality— they tell us little about the actual topics being discussed.
To filter these grammatical words, I applied a pre-existing stoplist (from the Natural Language Toolkit (NLTK) package in Python). The result is presented in Figure 2, where we can see that the top words are still, in general, markers of interaction and informality associated with spoken speech, including backchannels, filled pauses, discourse markers, and epistemic expressions (e.g. yeah, uh, um, like, (I) think, (you) know), in addition to semantically light verbs (e.g. go, get) and contractions (e.g. ’s, ’re, ’m). So, this second list is also not very informative as to topic contents.
Thus, I compiled a “conversational” stoplist, which included a set of frequent markers of interaction and informality that occurred in these transcripts. Combining this with the NTLK stoplist for grammatical words, leaves us with a set of the most frequent words that are now informative as to topics in the text. As we see in Figure 3, topics in this transcript include dialectal variation (reflected in the words different, Australian, call, word, Australia, accent, speech, speak, American) and gaming (online, game, zombie).
While the most frequent words – once properly filtered – are certainly quite informative, this list does not reveal how these words interact and co-occur with each other in the text, which is what forms topics. This is what the word clusters of topic modelling are intended to address.
When I applied LSI and LDA to these same five transcripts, the results were rather underwhelming; the word clusters, which should represent relevant topics, were largely uninformative and incoherent. Because it was apparent from the transcripts that words from the same topic tend to co-occur in groups, I wondered if more meaningful clusters might emerge if the algorithms were to be run independently on sub-sections of the data. That is, instead of instructing the algorithm to identify, for example, six topics in a transcript, we break the transcript into six sub-sections and instruct the algorithm to identify one topic in each. Two linguistically-based partitioning methods were devised, with a third method as a control: (1) partitions based on pragmatic markers and phrases (e.g. do you think), (2) partitions based on the longest interviewer turns, and (3) partitions that divide the transcript into evenly-sized chunks, as the control. These three methods, plus the unmodified, stock LSI, are represented graphically in Figure 4, using two partitions as a simplified example.
Partitioning gave rise to much better results, as there were more clusters corresponding to topics that emerged in the manual coding and most frequent words. Examples include the clusters “accent, diversity, accentuated, speak, emphasis, Australia” and “zombie, amaze, game, awesome, online, call”. Another topic that I had identified by manual annotation, which did not occur in the most frequent words was referenced by the cluster “Lindt, siege, Sydney”, in reference to the Lindt café siege that took place in Sydney in 2014.
Comparing the automated clusters with the manually-identified topics showed that the correspondence was greater with partitioning than without — 50% of the clusters corresponded to manual topics when partitioning was employed (regardless of the partitioning method), compared to 39% with no partitioning. Additionally (and importantly), within the three partitioning approaches, more semantically-intuitive clusters appeared when partitioning by pragmatic constructions and interviewer turns, as opposed to the control (even partitions). With a correspondence of merely 50%, automated topic-modelling could still use many improvements. But what this research does show is that a promising way forward may lie in text partitioning, and in doing that in a linguistically-informed manner.
More generally, this research has demonstrated that the assumptions about how language functions – as embedded in many NLP tools – may not translate neatly to spontaneous speech. Instead, it is crucial to consider the dynamics and idiosyncrasies of the context of language production when performing NLP, and here NLP may have a lot to learn from corpus linguistics.
This blog post derives from my Honours thesis (of the same name) – completed in the School of Literature, Languages and Linguistics at the Australian National University – which can be accessed through the ANU Open Thesis Repository