Skip to content Skip to navigation

Raw Data Podcast, Episode One: Big Data and Inferences

We’re excited to introduce a podcast we’re co-sponsoring with Worldview Stanford, called Raw Data! It’s about the datasets and cyber technologies that affect our lives, and the first episode is about our digital behavior, and inferences. Humans can be quite bad at making inferences when they haven’t been trained to do so. As an example, here’s a simple logic puzzle requiring you to make an inference:

Jack is looking at Anne, but Anne is looking at George. Jack is married, but George is not. Is a married person looking at an unmarried person?

    A) Yes
    B) No
    C) Cannot be determined

More than 80% of respondents choose C, but the correct answer is A. (Why? Here’s the article. If you think that question was unfair, here are three more; see whether you get them right.)

Why does it matter if I'm bad at inferences, if everyone else is, too?

The reason this is important when we think about data is that we don’t often consider the inferences that those who view our data can make. Privacy policies often define certain types of information as more or less private; my telephone number may be less private, and my bank account more so. But sometimes the combination of two less-private pieces of information can yield a third piece of information, with a high degree of certainty, that I would consider more private—or that I didn’t know I was providing at all. For example, while occupation and zip code may not seem like personally identifying information, they may be uniquely identifying in a rural area or for a less common job. While this may be something you are well aware of as, for example, the only veterinarian in a remote county, you may not be aware of the inferences that can be made from an often public source of data: your Facebook likes.

Stanford Business School psychologist and researcher Michal Kosinski has demonstrated that traits like sexuality, intelligence, and parents’ marital status can be inferred with reasonable levels of certainty from the fact that you clicked “like” on certain Facebook pages or groups. (Yes, he can tell if your parents are divorced: in his data set, those who liked groups such as “Never apologize for what you feel it’s like saying sorry for being real” had divorced parents, while those who liked “The joy of painting with Bob Ross” had parents who were still together, to choose two examples). When you clicked those like buttons, you probably weren’t thinking about providing Facebook with the marital status of your parents. Maybe you don’t want Facebook to have that information; if so, there’s not a lot you can do, other than like fewer pages, or like more pages that you don’t, in fact, like, in an effort to add noise to your data.

What if these inferences from my data are painting a picture of me that I don't like?

Facebook—and Kosinski—isn’t always correct at making these inferences either, though, and sometimes it’s hard to decide which is worse: do I want Facebook to think I’m conservative, when I’m actually liberal? Or do I want them to know the truth? Google will actually tell you what its profile of you looks like, if you have ads based on your interests turned on. When this feature was discovered, many users chuckled that Google thought they were middle-aged men when they weren’t, for example, but few would actually attempt to set Google straight. After all, the only benefit to you is more highly-targeted ads, which are not often useful.

But what about the government’s inferences about you? After all, we know the government tracks data on us—metadata—and some of that metadata can be just as revealing as Facebook likes and Google searches. It’s not difficult to, as lawyer and PhD student Jonathan Mayer has, infer from call records when someone purchased a gun, or had an abortion. What inferences are being made from your metadata? And what can you do—or would you ever know—if those inferences are wrong? Many organizations that collect data don’t know everything that’s in it, because of the power of these inferences. This means data is often made public when aspects of it should be private, or that companies amass more data than they know what to do with, giving them more information about you than you’ve knowingly provided. This data collection results in popularized cases of inferences, such as when Target sent maternity-focused coupons to a young woman living at home whose parents didn’t know she was pregnant, and inferences that end up in court, as when cell phone metadata is subpoenaed to suggest that a person was nearby when a crime was committed.

At the very least, we should be aware that inferences are being made about us—inferences that we’re bad at predicting ahead of time. In the future, cyber technologies may alert us to these inferences when we provide information; for now, know that liking Facebook pages called “Curly Fries” or “Sephora” tells us more about you than you realize.

Read More:

Michal Kosinski's research
Jonathan Mayer's research

Listen now: