Introducing Sayan Ranu

The DAIR group and the department of CS&E are happy to welcome Sayan Ranu to our ranks as faculty. Here is Sayan in his own words:

Hello everyone,

This is Sayan and I joined the CSE department at IIT-D exactly two months back (21st December 2016).

Let me start by answering the question I hear most often when I meet someone new at IIT D: what are my research interests? Given that I am a member of DAIR group, it is obvious that my interests lie in Data Science. Data Science, however, is a fairly broad area and the particular topics that I focus on are Graph Mining and Spatio-temporal data analytics.

Graph Mining: Querying and mining graph datasets have been extensively studied and continue being one of the most active research areas. However, an overwhelming majority is centred on analysing static graph properties. In today’s world, graphs often change with time. In social networks, new nodes get added every second. Links between nodes change as old acquaintances get forgotten and new friendships are forged. The content at each node (such as a Facebook wall post) change with time and propagate through the network. In road networks, the volume of traffic changes every minute. While some parts of the network can cope with higher traffic, others get bogged down by congestions, which in turn, alters the typical commuting behaviour resulting in the congestion further spreading to other parts of the network. What are the “laws” governing the evolution of these dynamic networks? If we partially observe a trend, can we predict its cascading effect? Can we mine patterns that highlight trends deviating from the expected behaviour? These are some of the fundamental questions that drive my research efforts.

Spatio-temporal data analytics: Today, buses, cabs, and ambulances are tracked through GPS-aided navigation systems to collect data and improve services. Users voluntarily share their locations on social networking sites like Facebook and Twitter through “check-ins” made from smartphones. Movements of people are also captured through involuntary means in various services such as geo-tagged photo albums in sites like Flickr, Google+ and Facebook, credit card transactions, messaging apps like WhatsApp and Viber, and base-station connectivity in cell phone call detail records. Querying, mining and modelling this data is central to a multitude of smart city applications, such as congestion modelling, urban resource management, security, and infrastructure development. Given a budget X, where should we construct X flyovers so that they maximise the reduction in traffic congestion? Can we model and predict the movement of a criminal from his/her past activities collected through call detail records? Can we group residents of a city into communities based on their check-ins? Who are the errant bus and cab drivers that pose risk to other commuters? Another set of fascinating questions that remain to be answered.

What else do I like outside research? Well, I love sports, particularly cricket and tennis. If there is any regular group on campus playing cricket or tennis, do let me know. I would definitely join.

Advertisements

Things I Learned at VLAI 2016

Blog entry by: Arindam Bhattacharya

I learned many things about the current trends in machine learning from Vision, Language and Artificial Intelligence, 2016 workshop. Here I try to summarise what I gathered from the experience.

Generative Adversarial Network [GAN]

  • Data is precious. And depending on what you work on, [labelled] data is scarce. It was only a matter of time, after the success of supervised deep learning methods, that the focus would shift towards semi/unsupervised learning. GAN provides such a framework. And it works spectacularly once trained (although it is notoriously hard to train).
  • GAN is a framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. When G and D are both implemented as a Neural Network, the whole system can be trained using backpropagation [source].
  • The fields of Vision and Language have huge amounts of unlabelled data. With GAN, these were used on variety of tasks, such as video prediction, image generation and super-resolution (yeah, not much “Language” there).

Again, Supervision is a Bottleneck

  • An architecture with feedback connections learning top-down representation was proposed. Top down learning allows the model to learn representaions based on context. For example, an object would more likely be a bottle, if it is placed on a table.
  • Such architecture has the ability of self-supervise the learning. Interesting connections were made with feed-back connections in human brain.
  • A self-supervised learning agent, for example, may learn better representaion of an object but touching, pushing etc. and using the feedback.

Variational Auto-encoders vs GANs

  • Some favor the nice probabilistic formulation of VAEs [source], that allow to carry forward the theory of graphical models.
  • In general though, in the field of Vision, GANs are preffered as they are better at generating visual features.

More deep stuff

  • Preliminary studies on utilizing models of intuitive physics for better forecasting effect of actions on objects.
  • A tutorial on sequence to sequence modelling was presented, along with its application for lip-reading.

How people think

  • People are really good at finding/defining problems. With deep learning being the dominant approach, the main novelty was the problems they chose and tweaks they apply, rather than a innovation in algorithm/architecture. Some new hot applications include Visual Question/Answering and Visual Dialog.
  • Defining new problems requires new data. Many spent more than half their time explaining how they are getting data. Various challenges presents themselves here, ranging from time/funds to reliability of people involved. Hence the focus on unsupervised approaches.
  • When it comes to give captivating presentations, industry researchers are better than academics [of course biased because of small sample space, but difference was stark].

Anti-rumour

Blog entry by: Amitabha Bagchi

In the wake of the Government of India’s decision to withdraw the legal tender status of Rs 500 and Rs 1000 banknotes a serious rumour regarding the shortage of salt has led to tensions, mob violence and even death all over the country.  This is not a new problem (“Rumor, the swiftest of all evils,” Virgil says in his Aeneid, “Speed lends her strength, and she finds vigor as she goes.”) What are new are platforms like Whatsapp that allow for vast scale.

In view of this I thought it might be worthwhile to suggest a few rumour detection and control measures that the government might consider adopting for future episodes (this one has already gone viral and will now dissipate in its natural course, hopefully without causing much more damage.) The ideas being put out here are based on the work done by Rudra Mohan Tripathy, Sameep Mehta and myself  (see the conference version that appeared in CIKM 2010 or the expanded journal version that appeared in Intelligent Data Analysis in 2013).

We modelled the spread of rumour as a message spreading and replicating on a network. Our primary suggestion was that the best way of combatting rumour is by attacking it with a similar process, anti-rumour,  which is also a message spreading and replicating through the network the contradicts the rumour and brings attention to the fact that a rumour is spreading. The idea is that on a social network we have some trust in our connections and this trust is used to debunk a rumour. Sometimes if a government says something people are suspicious, a classic example being rumours about vaccinations. However they do trust their friends and if, on a whatsapp group for example, someone forcefully debunks a rumour and encourages people to further debunk it, this might help contain the rumour better than authorised broadcasts through mass media channels.

In brief, here are some ways of approaching this problem based on our research:

  • The government can think of creating a “human infrastructure” of beacons within the network who can be tasked with detecting rumours. The beacons could be principals of government schools, district and block level functionaries and so forth. Such people are naturally embedded within their communities and, consequently, within social networks enabled through Whatsapp or other messaging platforms. When they realise a rumour is spreading they should alert the authorities.
  • A clear and credible rumour debunking message, anti-rumour, should be created and immediately seeded into the network through the beacons who should be instructed to aggressively spread the anti-rumour
  • Public awareness on how to combat rumours must be created. People should be encouraged to spread anti-rumour messages on a priority basis. Once clear methodologies for this are communicated to the public if even a small fraction of them decide to take this on as a civic duty, rumours can be contained at all levels (local, regional and national).

For those who have been observing platforms like Twitter which, in India at least, appears to be very susceptible to special interest trends that are obviously being floated by particular groups, it is not hard to imagine that you do not need a very large number of actors within the network to be acting in concert for a message to spread widely. In our view it is possible to use this phenomenon to combat rumours, to fight fire with fire, as it were.

 

Event announcement: The invisible women of Indian science

The invisible women of Indian science
A public lecture and a chance to talk

Venue: Lecture Hall Complex Room 121, IIT Campus, Hauz Khas

Date: Thursday, 20th October 2016

Schedule: Talk at 5:30PM followed by Open Forum for women scientists at 6:30PM.

Talk Title: Lab-hopping to tell the stories of India’s women researchers

Speakers: Aashima Dogra and Nandita Jayaraj

Abstract:
The Life of Science (TLoS) is a feminist research and media project by two science reporters on a mission to tell the stories of the ‘invisible’ women of Indian research. Why do thousands of women drop out of their academic careers every year? TLoS attempts to answer questions like this by telling the stories of the women who have defied the odds. Our reports are based on conversations with female researchers across the country about their life, their research and challenges they’ve faced. They are published every Monday on http://www.thelifeofscience.com. Already, there are hints of trends and patterns emerging but we’ll have a more complete picture once we’ve covered the science being done in regions of India that the traditional media rarely ventures into. In this talk, we’ll share our experience and we look forward to exchanging ideas about the way science is done in India with the audience.

Bios:
Aashima Dogra and Nandita Jayaraj started The Life of Science in February 2016 after quitting their jobs as editors of Brainwave, a science magazine for kids. Nandita is a freelance science writer who worked with The Hindu after studying at Asian College of Journalism. Aashima learnt to be science communicator at University of Warwick and has written science stories for The Asian Age and managed the editorial at Mars One.

Open forum for women scientists:
A public space where women involved in research, or contemplating a career as researchers can talk about the issues that affect them. This event will be moderated by Dr Neetu Singh, Center for Biomedical Engineering, IIT Delhi. Aashima Dogra and Nandita Jayaraj will join us for this session.

Acknowledgments: Vipula and  Mahesh Chaturvedi Chair for Policy Studies and Department of Humanities and Social Sciences, IIT Delhi.

Download pdf poster here.

A few questions on the draft Geospatial Information Regulation Bill

Blog post by: Amitabha Bagchi

The Ministry of Home Affairs of the Government of India has posted a draft bill aimed at regulating the acquisition and use of geospatial information pertaining to India. This draft can be viewed here.

In brief, the provisions of this act make it illegal to acquire and even maintain previously acquired Indian geospatial data without applying for and receiving a license from an authority that is to be created for this purpose. Media reports have tended to focus on the aspect of the bill that talks about heavy penalties for misrepresenting the boundaries of India, but let us instead focus on the more important aspects that pertain to the data ecosystem. Some questions:

1. What happens if the data needs an update? Map information keeps changing. We aren’t talking about roads and buildings, since they remain relatively stable, but since the draft bill also includes “value addition”, it will also include other kinds of information that changes faster. Consider the case of your favourite restaurant discovery app: Will it have to apply for a new licence every time a new restaurant opens in Hauz Khas village? Effectively it will have to, since the draft bill proposes that only data that bears the watermark of the vetting authority be used for display. Changing the name of a restaurant in such data would amount to tampering with watermarked data. This sounds bad. Not propagating updates till security clearance is released may affect the business model of businesses premised on providing up-to-date information. The bill promises a three month turnaround on all clearances. This might not be quick enough, even if it was feasible, which leads us to the next question.

2. Do we have the bandwidth to handle all applications for this usage inside and outside India? Someone somewhere may have an estimate of how many different non-governmental services inside and outside India are currently using Indian geospatial data. I don’t have such an estimate but I would say this number is huge. Add to this all those 17-year old kids dreaming of startup glory who are mashing Google maps in their soon-to-be-world-dominating app. A government regulator that is yet to be set up will need hundreds of GIS experts who can “vet” TBs of data from each applicant. The logistics of getting this data across to the vetting authority alone boggles the mind, forget about the logistics of hiring and training these hundreds of experts. Unless this bill, on becoming an act, manages to singlehandedly kill the innovation ecosystem that depends on geospatial data, the number of requests will keep going up. And all these people will be “acquiring” and wanting to propagate updates (see #1 above). Which further leads us to the next question.

3. Does every single end-user of such data also need a license? The large organisations like Google who are acquiring and making geospatial data available through their APIs are in some sense at the lowest level of an application stack which could potentially have several layers (and probably already has). Application A buys a service that uses geospatial data from application B that has in turn bought it from provider C who has licensed it from organisation D. Or, in a more complex turn of events app A mashes up data from services B, C and D which in turn have bought their data from E, F and G and, guess what, F and G have some kind of data sharing agreement. How will A get its data acquisition vetted? The complexity of the ecosystem and the trajectories such data can take are only limited by the imagination of developers and service creators working on different kinds of problems in a host of different sectors. Unravelling this complexity will further burden the vetting authority  (see #2 above).

An alternative modality that can serve national security purposes would involve requiring all users of geospatial data to register with the security vetting authority and providing an online window through which the authority can conduct an audit of their data. The vetting authority can go through the data and raise an objection if it finds anything objectionable, and it can do this in its own time. In the meantime the data can be used and updated by the business as required. In other words, the onus has to be on the vetting authority to regularly check that the data is in order, rather than on the service. By shifting the onus onto the service we run the risk of creating a significant roadblock for a major part of the innovation ecosystem. This is undesirable.

Added 10 May 2016:

  • An extended version of this blog entry appeared in The Hindu dated 10 May 2016. Read that piece here.
  • Another view on this bill that contextualises it in terms of existing mapping laws can be read here.
  • A more detailed exposition of the dangers this bill poses to flagship government projects like Digital India can be read here.

The art (and science) of structuring political opinions

Blog post by: Maya Ramanath

A big trend in data management research is to “structurize” unstructured data. One of the main sources of unstructured data is (online) text. It’s all around us. News, QA forums, scientific content, social media, enterprise web-sites, government data, etc. are all available at a click. The content comprises of facts, opinions, analysis, commentary and also spam. One of the grand challenges in AI is to organize this data in such a way as to be machine-readable, i.e., allowing machines to “know” things. While we are still a long way from machines which can understand subtlety, we can certainly look at machines which “understand” facts, that is, statements which are either true or false.

The aim of this post is to give a very high level overview on identifying and organizing a very specific kind of content: political opinions. We are used to politicians giving long speeches explaining their opinions about everything under the sun to us in detail. Speeches filled with ambiguous statements, sometimes contradictory statements, oversimplifications, rhetoric, sarcasm, incitement. What exactly do we mean be organising this landscape of opinions, and how do we even start? We will start in the simplest way possible and add layers of complexity to it as we go along.

When we talk about an “opinion”, it is immediately obvious that there is someone who holds this opinion: the “opinion holder”. Second, there has to a topic on which the opinion is held, we’ll just call it “topic”. Third, we will simplify what we mean by “opinion”. In our first attempt, an “opinion” has only two polarities: pro or con, support or oppose. So, someone is pro something, or con something. “Narendra Modi supports Beti-Bachao campaign”, “Jayalalithaa opposes fuel price hike”, “Kejriwal opposes land bill” — all statements of opinion, each consisting of an opinion holder (Narendra Modi, Jayalalithaa, Kejriwal), a topic (Beti-Bachao campaign, fuel price hike, land bill) and a polarity. These statements, all of which could easily appear in news articles, neatly fit into the nice triple structure of <opinion holder, polarity, topic>. So, we have news articles as our textual sources, and the goal is to extract these kinds of opinion triples.

While automatically acquiring a large number of such triples is already quite challenging, we have more things to think about. First, the opinions that we have acquired so far are structured, but they lack “uniformity”. For example, we need to figure out that “Arvind Kejriwal” and “Kejriwal” both refer to the same person. Similarly, “land bill” and “land acquisition bill of 2015” both refer to the same bill. This process is referred to as canonicalisation, i.e., assigning canonical names to opinion holders and topics. Next, we need to differentiate between the “land bill” which refers to the “land acquisition bill of 2015” and “land acquisition act of 1984”. This process is known as disambiguation. Canonicalisation and disambiguation are closely related and challenging problems in their own right. However, once we are able to provide reasonable solutions for these problems, we have an “opinion-base” where we can ask questions such as “who all oppose the land bill?”, “who all oppose the land bill, but support eminent domain?”

One of the major pieces missing in our opinion-base is time. We all know that politicians are famous for having different opinions on the same topic at different times. So, it would be helpful to know when these opinions were expressed. Instead of triples, we could have quadruples: <opinion holder, polarity, topic, time>. Unfortunately, associating time with opinions is non-trivial. It increases the complexity of extraction. Instead of looking to extract triples from sentences such as “Kejriwal opposes land bill”, we need to look for when this piece of news was reported. The easy case would be if we had a news article with this headline–we could simply look at when the article was published and associate that time to this opinion. However, if the article was reporting this opinion at some future date, perhaps as part of an opinion piece (“Kejriwal expressed his opposition to the land bill in July 2015 and continued to do so….”), then we need to identify that the time of interest is July 2015, not the current date or even the date on which the article was published. Just to make it a bit more complicated, instead of just a particular point in time, we may have to identify a time range–after all, the United States was opposed to our nuclear program until they supported it. As expected, identifying times at which opinions are valid is also a challenging task. But once we find a reasonable solution to the problem, we have an opinion-base where we can ask questions such as “who all opposed the land bill but later changed their stance?” (in effect, we can identify the flip-floppers in our political system; perhaps this is everyone!!).

It’s great that so far, we have been able to extract these crisp structured opinions. However, we have completely eliminated any kind of subtlety and context from this process! Subtlety might be tough, but context should be easy. We know that the United States opposed our nuclear program before the nuclear tests, but now they support it. Having two facts <United States, opposes, Indian nuclear program, 1997> and <United States, opposes, Indian nuclear program, 2000> seems quite inexplicable, unless we also provide context. This is quite easily done. Just add another column pointing to the source from which we extracted this information from. So we now have: <opinion holder, polarity, topic, time, context>. Context could simply be the article itself from which this opinion was extracted. Anyone making use of the opinion-base can easily trace the origins for themselves.

Let’s now add a meta-level complexity: Why should someone trust whatever we have extracted in our opinion-base? That is, does it matter where we acquired our opinions from? Of course it does. There are serious newspapers and there are tabloids. We can still (hopefully) trust what the serious newspapers tell us, but we all know there can sometimes be conflicting reports and subsequent denials–all depending on all kinds of interpretations of the exact words that were spoken (politicians are of course, experts at this!). Things are now messy. We are trying to determine the “truth” of the facts in our opinion-base. So, let’s simplify the problem of trust a bit. For our opinion-base, we will rely on news sources and will associate a “trust” factor for every fact. So, did Kejriwal really oppose the land bill? Did Jayalalithaa really oppose the fuel price hike? Well, the best we can do (given that we are relying on textual sources) is to believe something if it came from a large number of reputable sources. So, our “trust” factor is a combination of the number of times we came across this report and a measure of how “reputable” each source is. So, we now have: <opinion holder, polarity, topic, time, trust factor>. And associated with each trust factor are the details: which sources reported this and how reputable do we consider them.

The problem is still not solved though. What about conflicting reports? How do we reconcile them? Do we just leave them as is, acquire both opinions, point to their sources, and throw up our hands? Or do we aim to have a fully consistent opinion-base? Because, one of the advantages of having a consistent opinion-base is that we can perform reasoning and prediction on top of it. For example, we may be able to predict how a certain politician will vote on a specific bill, given their pattern of support to other, older bills. All complex questions for which we do not have excellent answers yet. Even without all the answers, we can certainly use opinion-bases in a number of applications. Apart from answering some interesting questions about our politicians and their stances, we can study, for example, bias in our news channels. Organize our topics into groups (economy, sports, cultural, etc.), and study how news channels report on these topics and when. Are they mainly supportive of the economy during a Congress government or a BJP government or are they roughly even? Are they supportive of our cricket team when they lose under Dhoni, or are they stringent in their criticism? We could also study complex interconnections among politicians. Just as we built an opinion-base, we can think of building a knowledge-base of “political relationships”, i.e., who gave donations to whom and when, who gave a speech in support of whom and when, who was made minister, etc. and connect them to their opinions on the government, bills, issues. There are a number of interesting applications to build.

In summary, opinion mining as a whole is a vast area of research, of which structurizing political opinions is a very small part. In this post, we didn’t even consider things like “how exactly do we extract” (the question is answered in another vast area of research called information extraction). We didn’t consider even tougher problems, e.g., we paid no attention to the strength of opinions (only two polarities!), topics and subtopics (we could have people “supporting India”, but “India” is too broad a topic to be useful) and there are many more issues to consider in just this narrow (compared to the field of opinion mining) topic. We do some work in our group related to opinion mining. If you are interested in a project, do contact us!

Recommended reading:

General surveys on opinion mining as a whole:
Opinion mining and sentiment analysis“, Bo Pang and Lillian Lee
Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008.

Sentiment Analysis and Opinion Mining, Bing Liu
Morgan & Claypool Publishers, May 2012.

Specifically on structurizing opinions:
Harmony and Dissonance: Organizing the People’s Voices on Political Controversies“, Rawia Awadallah, Maya Ramanath and Gerhard Weikum
Proc. of the ACM Conf. on Web Search and Data Mining (WSDM), 2012

Grace Hopper Celebration of Women in Computing (India) starts tomorrow

The people over at The Ladies Finger have published a very illuminating interview with Geetha Kannan, Managing Director of the Anita Borg Institute India. For those not in the know, ABI organises the Grace Hopper Celebration of Women in Computing. The Indian edition of the conference runs from 2nd to 4th December this year in Bangalore. This link promises livestreams of the keynote talks.

In the interview linked above Kannan offers the following insight

For us community means everything – our standing in society is linked to our self-respect. So sometimes you’ll see that women – these are anecdotal stories – may enjoy their careers, their partners may be fine with it, their in-laws may be fine with it, but there’s so much community pressure that they have to give it up

and follows this up with another deep thought

Lessons on dealing with that kind of social pressure cannot be learned from the West, and that’s been one of the biggest eye-openers for us.

Something we all need to think about: Tapping into the intellectual resources of the underrepresented 50% can be a great boost for our discipline. Not to mention that it will also help make the world a better, more just place.

Late breaking: We have been reliably informed that DAIR’s own PhD student, Prachi Jain, is attending the GH conference in Bangalore. We will be pestering her for a report on it once she’s back.