Interview with Paul Baker

on

In 2023, the Sydney Corpus Lab is pleased to be featuring edited extracts from Dr Robbie Love’s CorpusCast podcast about corpus linguistics. In each blog post published throughout the year, we present the answers of leading corpus linguists to three questions. Specifically, all blog posts present answers to the following two questions:

  • What are the biggest changes you’ve noticed in corpus research throughout your career?
  • How will corpus linguistics make an impact on the world in the future?

Posts from episodes 1-4 additionally present answers to this question:

  • What has surprised you the most about your work in corpus linguistics?

Posts from episodes 5 onwards instead present answers to this question:

  • What is the biggest misconception of corpus linguistics you have encountered?

This blog post features Paul Baker. We have transcribed the relevant part of the interview but have edited answers for readability (taking out hesitation marks, discourse makers, etc). Interview answers were transcribed by Kelvin Lee from the Sydney Corpus Lab. The full interview can be found here. We are grateful to Robbie Love and Sam Cook for their assistance in creating these posts.

ROBBIE LOVE: As we wind things down here on the first episode of Corpus Cast – still getting used to that name – I’ve prepared what I call quick questions, but I’ve been around long enough to know that academics don’t do quick answers. So even if the questions are quick, the answer might not be – so that’s okay. You mentioned earlier that you started your career in the 1990s. What are the biggest changes that you’ve noticed in corpus research between then and now?

PAUL BAKER: I think there’s been a change of scale. I remember, to start, 100 million words was a lot. Now, it’s kind of the norm, I think – the tools will be able to deal with that. Now, there are corpora of billions of words. I remember, at the time, there wasn’t a lot of focus on online context of language and that’s now increasingly the norm. Two of the later projects I’ve been involved in: one involved looking at the feedback that people have left on the National Health Service website when they’ve been patients and that patient feedback responses. At the moment, I’m looking at an online health support forum for people with anxiety. So, it’s a lot easier, I think, to collect that kind of data because it’s online anyway. Even when I look at newspaper reporting now – which is a very traditional form of text – for the last couple of projects, I’ve tried to make sure that I focus on the reader comments to the articles as well as the articles themselves. So, it doesn’t mean it’s easier to collect data, but it also means that issues on copyright and ethics, I think, get increasingly complicated. If you’re collecting tweets, for example, you can’t go write every person who collected [?] a tweet and say, “can I have your tweets in my corpus?” It just wouldn’t be feasible to do that. Then when you’re analysing the tweets and you want to maybe quote an example – maybe it’s quite an unpleasant example – do you want to kind of humiliate, potentially, somebody because somebody else can then google that tweet to find it online and then maybe put in a complaint about them? Maybe it’s a child who wrote the tweet. So, that can be really difficult. I think it is hard and there are questions that, as a community, we’re still grappling with – like ethics and copyright. Hopefully, we will eventually get there and get some answers.

ROBBIE LOVE: Yeah, and absolutely. It’s amazing to sort of track how quickly things have changed in that regard particularly as you say with online ethics and the scale, as well, of the data. When I’m talking to my students about corpus linguistics – as we do at our English language programs here at Aston – they’re often very surprised about just how large some of these data sets can be. You’re right, it’s not always been that way at all. What surprised you the most about your own work in corpus linguistics? I appreciate that this might be a difficult question, but I was curious. What are the biggest surprises for you?

PAUL BAKER: I think I’ve been surprised at the range of ways I’ve been able to use these techniques over the years. Back in ’97 when I was doing this part of speech tagging checking of the BNC, I had no idea what was going to happen in the next 20 years or so. Since then, I’ve looked at newspaper articles, patient feedback, online health forums. I’ve looked at violent propaganda magazines, political speech, television scripts, and student writing. Personal adverts. Even erotic stories. They’ve all told me something about human nature. They’ve told me something I didn’t know about the human condition and how humans use language to kind of get a point across – our attitude. I think that’s really what I’m most interested in. I’m most interested in people and what people are like. What I love about corpus linguistics is how these techniques can point you to a relatively innocent looking word like I or me. Then when you start to investigate it, it tells you something about how people are using a word which you would never have thought to look at if you were left to kind of just follow your own train of thought. Every time I do a new project and I start with a new corpus, it’s a bit like playing with a puzzle box and I’m using all these different techniques to get out all of these little jewels or gems of information that nobody else knows yet. Every single time, I think it’s different and it’s interesting and keeps me wanting to kind of keep on doing it and going back and doing it with a different data set. That’s what gets me to get up in the morning, I think, which is great.

ROBBIE LOVE: Oh, it’s great to hear that you’re so motivated by this. I like that metaphor. Okay, so finally, of course, we’re going to look to the future now. How will corpus linguistics continue to make or even make an even bigger impact on society in the future.

PAUL BAKER: I’m not sure I can predict but I can make some suggestions, I think, for how I hope it would. I hope it will help make people become a bit more cynical, I think, or critical about the ways that language is used to influence us and consequently, also about the impacts that our own language choices have on others. I think it has the potential to make us better communicators but also to be more savvy receivers of language. You mentioned earlier, we don’t get taught this at school. I actually think it would be great to have these techniques and concepts from corpus linguistics to be on the national curriculum and to be taught in schools so that everyone has access to these methods of making sense of language. People are increasingly computer literate. I think using a corpus should be as easy as consulting a dictionary. It’s just so fascinating as well, I think, to think everybody has the potential to be a corpus researcher. I do think that it has the potential to affect real social change. The findings that you get from this are large enough to be generalisable and they’re obtained in a relatively fair and unbiased way. Even if you’ve got biases yourself, you have to make sense of what the computer is telling you. The patterns, you can’t just pick them out. So, what you end up with is a more convincing analysis rather than just a kind of polemic. The challenge then, I think, though is you get loads of findings and it’s kind of putting that in an interesting narrative to make it engaging, I think, for readers to respond to and understand.