BOB GARFIELD: The CIA didn't invent the use of sentiment analysis. Researchers have been exploring this area for some time and applying mood tracking in all sorts of ways. Johann Bollen is a professor at the Indiana University School of Informatics and Computing. He says that with all the words and observations and opinions now available on the Web, we really can nail the zeitgeist.
JOHANN BOLLEN: For quite awhile, people have been devising computer algorithms to extract information from what they call natural language, to have computers detect a particular sentiment or mood, in general. And these developments really have come to head in the past five or six years, with the availability of very large scale online text databases.
BOB GARFIELD: Who else is using this technology to plumb the zeitgeist?
JOHANN BOLLEN: Recently I’ve seen quite a few publications on analyzing Twitter chatter about symptoms. You can actually extract information from those texts to determine whether that person is infected with a particular viral disease, like H1N1 or influenza. Anything involved with public health, public policy and sort of social-economic phenomena, including the financial markets, is looking at this type of chatter.
BOB GARFIELD: Wall Street’s your particular expertise. You did a, a research study showing that you could predict movements in the market three or four days in advance, based on the public mood beforehand. And you're not just looking for gonna buy/gonna sell. There's other kinds of language that your crawler is looking for. What are the keywords?
JOHANN BOLLEN: Well, it's - it's not just a limited set of keywords. What the algorithm does is that it looks at the particular phrasing, the grammatical structure of tweets. Our sentiment analysis is not limited to just determining positive versus negative mood states. It actually looks at, at the end of range of different dimensions, including calm versus anxious, happy versus sad. And from that emerges a, a pretty sophisticated picture of the general public's mood state at a particular point in time.
BOB GARFIELD: One of the problems with this kind of algorithm is the language itself. Sometimes language, and especially idiom, can be kind of misleading. For example, words like “kill,” “stupid,” “filthy,” even “retarded” can mean the opposite of what they appear to mean; they’re actually superlatives in common parlance. How does the algorithm deal with linguistic misdirection?
JOHANN BOLLEN: It's really difficult to detect sarcasm and irony. And I would actually argue that’s equally true for people. You know, when you send an email you have to be very careful that your particular emotional state isn’t misinterpreted by the person who receives that email. And so, the same applies to our algorithms. They’re – they’re, they’re not fail proof. But when you apply them over tens of millions of tweets on a daily basis, then what you’re detecting is some kind of a general notion of the public's mood state.
BOB GARFIELD: I've been arguing for years that qualitative research, focus groups and the like, are not research at all; they don't generate data. It's statistically insignificant, easily manipulated and, from my perspective, just as likely to be exactly wrong as exactly right.
But it seems to me that what you're dealing with is something that deals with all of my objections, because you've got the world's largest focus group.
JOHANN BOLLEN: You've got about 750 million [LAUGHS] people on Facebook and a couple of hundred million people on Twitter, and those are such large chunks of the world's population, all and all, that you may not have a representative sample, but actually a sample that meets the requirements in terms of significance.
The most important thing here is that when you explicitly ask people about their opinion on a particular topic, very often people say one thing but that's not really what they believe, but that that's what they believe the – the interviewer or the person asking the question would like to hear, that which is socially acceptable.
But online very often it’s anonymous. They disclose a lot about themselves, a lot about
their opinions, and sometimes involuntarily.
BOB GARFIELD: Disinhibited in a good way.
JOHANN BOLLEN: Well, I [LAUGHS] – it’s, it’s not always clear whether it's in a good or a bad way.
[BOB LAUGHS]
But – but at least it’s honest.
BOB GARFIELD: Johann, thank you so much.
JOHANN BOLLEN: My pleasure.
BOB GARFIELD: Johann Bollen is a professor at the Indiana University School of Informatics and Computing.