Blog post by: Maya Ramanath
A big trend in data management research is to “structurize” unstructured data. One of the main sources of unstructured data is (online) text. It’s all around us. News, QA forums, scientific content, social media, enterprise web-sites, government data, etc. are all available at a click. The content comprises of facts, opinions, analysis, commentary and also spam. One of the grand challenges in AI is to organize this data in such a way as to be machine-readable, i.e., allowing machines to “know” things. While we are still a long way from machines which can understand subtlety, we can certainly look at machines which “understand” facts, that is, statements which are either true or false.
The aim of this post is to give a very high level overview on identifying and organizing a very specific kind of content: political opinions. We are used to politicians giving long speeches explaining their opinions about everything under the sun to us in detail. Speeches filled with ambiguous statements, sometimes contradictory statements, oversimplifications, rhetoric, sarcasm, incitement. What exactly do we mean be organising this landscape of opinions, and how do we even start? We will start in the simplest way possible and add layers of complexity to it as we go along.
When we talk about an “opinion”, it is immediately obvious that there is someone who holds this opinion: the “opinion holder”. Second, there has to a topic on which the opinion is held, we’ll just call it “topic”. Third, we will simplify what we mean by “opinion”. In our first attempt, an “opinion” has only two polarities: pro or con, support or oppose. So, someone is pro something, or con something. “Narendra Modi supports Beti-Bachao campaign”, “Jayalalithaa opposes fuel price hike”, “Kejriwal opposes land bill” — all statements of opinion, each consisting of an opinion holder (Narendra Modi, Jayalalithaa, Kejriwal), a topic (Beti-Bachao campaign, fuel price hike, land bill) and a polarity. These statements, all of which could easily appear in news articles, neatly fit into the nice triple structure of <opinion holder, polarity, topic>. So, we have news articles as our textual sources, and the goal is to extract these kinds of opinion triples.
While automatically acquiring a large number of such triples is already quite challenging, we have more things to think about. First, the opinions that we have acquired so far are structured, but they lack “uniformity”. For example, we need to figure out that “Arvind Kejriwal” and “Kejriwal” both refer to the same person. Similarly, “land bill” and “land acquisition bill of 2015” both refer to the same bill. This process is referred to as canonicalisation, i.e., assigning canonical names to opinion holders and topics. Next, we need to differentiate between the “land bill” which refers to the “land acquisition bill of 2015” and “land acquisition act of 1984”. This process is known as disambiguation. Canonicalisation and disambiguation are closely related and challenging problems in their own right. However, once we are able to provide reasonable solutions for these problems, we have an “opinion-base” where we can ask questions such as “who all oppose the land bill?”, “who all oppose the land bill, but support eminent domain?”
One of the major pieces missing in our opinion-base is time. We all know that politicians are famous for having different opinions on the same topic at different times. So, it would be helpful to know when these opinions were expressed. Instead of triples, we could have quadruples: <opinion holder, polarity, topic, time>. Unfortunately, associating time with opinions is non-trivial. It increases the complexity of extraction. Instead of looking to extract triples from sentences such as “Kejriwal opposes land bill”, we need to look for when this piece of news was reported. The easy case would be if we had a news article with this headline–we could simply look at when the article was published and associate that time to this opinion. However, if the article was reporting this opinion at some future date, perhaps as part of an opinion piece (“Kejriwal expressed his opposition to the land bill in July 2015 and continued to do so….”), then we need to identify that the time of interest is July 2015, not the current date or even the date on which the article was published. Just to make it a bit more complicated, instead of just a particular point in time, we may have to identify a time range–after all, the United States was opposed to our nuclear program until they supported it. As expected, identifying times at which opinions are valid is also a challenging task. But once we find a reasonable solution to the problem, we have an opinion-base where we can ask questions such as “who all opposed the land bill but later changed their stance?” (in effect, we can identify the flip-floppers in our political system; perhaps this is everyone!!).
It’s great that so far, we have been able to extract these crisp structured opinions. However, we have completely eliminated any kind of subtlety and context from this process! Subtlety might be tough, but context should be easy. We know that the United States opposed our nuclear program before the nuclear tests, but now they support it. Having two facts <United States, opposes, Indian nuclear program, 1997> and <United States, opposes, Indian nuclear program, 2000> seems quite inexplicable, unless we also provide context. This is quite easily done. Just add another column pointing to the source from which we extracted this information from. So we now have: <opinion holder, polarity, topic, time, context>. Context could simply be the article itself from which this opinion was extracted. Anyone making use of the opinion-base can easily trace the origins for themselves.
Let’s now add a meta-level complexity: Why should someone trust whatever we have extracted in our opinion-base? That is, does it matter where we acquired our opinions from? Of course it does. There are serious newspapers and there are tabloids. We can still (hopefully) trust what the serious newspapers tell us, but we all know there can sometimes be conflicting reports and subsequent denials–all depending on all kinds of interpretations of the exact words that were spoken (politicians are of course, experts at this!). Things are now messy. We are trying to determine the “truth” of the facts in our opinion-base. So, let’s simplify the problem of trust a bit. For our opinion-base, we will rely on news sources and will associate a “trust” factor for every fact. So, did Kejriwal really oppose the land bill? Did Jayalalithaa really oppose the fuel price hike? Well, the best we can do (given that we are relying on textual sources) is to believe something if it came from a large number of reputable sources. So, our “trust” factor is a combination of the number of times we came across this report and a measure of how “reputable” each source is. So, we now have: <opinion holder, polarity, topic, time, trust factor>. And associated with each trust factor are the details: which sources reported this and how reputable do we consider them.
The problem is still not solved though. What about conflicting reports? How do we reconcile them? Do we just leave them as is, acquire both opinions, point to their sources, and throw up our hands? Or do we aim to have a fully consistent opinion-base? Because, one of the advantages of having a consistent opinion-base is that we can perform reasoning and prediction on top of it. For example, we may be able to predict how a certain politician will vote on a specific bill, given their pattern of support to other, older bills. All complex questions for which we do not have excellent answers yet. Even without all the answers, we can certainly use opinion-bases in a number of applications. Apart from answering some interesting questions about our politicians and their stances, we can study, for example, bias in our news channels. Organize our topics into groups (economy, sports, cultural, etc.), and study how news channels report on these topics and when. Are they mainly supportive of the economy during a Congress government or a BJP government or are they roughly even? Are they supportive of our cricket team when they lose under Dhoni, or are they stringent in their criticism? We could also study complex interconnections among politicians. Just as we built an opinion-base, we can think of building a knowledge-base of “political relationships”, i.e., who gave donations to whom and when, who gave a speech in support of whom and when, who was made minister, etc. and connect them to their opinions on the government, bills, issues. There are a number of interesting applications to build.
In summary, opinion mining as a whole is a vast area of research, of which structurizing political opinions is a very small part. In this post, we didn’t even consider things like “how exactly do we extract” (the question is answered in another vast area of research called information extraction). We didn’t consider even tougher problems, e.g., we paid no attention to the strength of opinions (only two polarities!), topics and subtopics (we could have people “supporting India”, but “India” is too broad a topic to be useful) and there are many more issues to consider in just this narrow (compared to the field of opinion mining) topic. We do some work in our group related to opinion mining. If you are interested in a project, do contact us!
General surveys on opinion mining as a whole:
“Opinion mining and sentiment analysis“, Bo Pang and Lillian Lee
Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008.
Sentiment Analysis and Opinion Mining, Bing Liu
Morgan & Claypool Publishers, May 2012.
Specifically on structurizing opinions:
“Harmony and Dissonance: Organizing the People’s Voices on Political Controversies“, Rawia Awadallah, Maya Ramanath and Gerhard Weikum
Proc. of the ACM Conf. on Web Search and Data Mining (WSDM), 2012