Episode one of Season Two of Raw Data asks whether big data is a big sham. The podcast checks in with Stanford-alumni-funded big data startup Quantified, our Hewlett-funded colleagues at Berkeley’s CLTC, and EFF’s Executive Director, Cindy Cohn, on whether we should worry about all the data being collected on us.
One reason why people are worried about big data is finance: financial analysts were the first people to take the "big data" of every position and every transaction in the stock market and analyze it to try and find predictive factors. Some were successful at doing that initially but later failed (others, like the incredibly weird Medallion fund have been successful for 20+ years). The question for them, and for those who use big data as a predictive tool, is that we've seen through various examples in the financial system that even the biggest data is insufficiently predictive if the system is either (a) truly chaotic, in the sense that we just don't have the sophistication or computing power to model it, or (b) subject to external fraudulent influence, in the sense that if there are causes acting on our model from outside it to produce effects inside it, we will never identify them because the data we're receiving aren't clean. Whether consumer purchasing decisions or, in one of CLTC's scenarios, emotions, are truly chaotic is an open question. On the individual scale, they very likely are; in aggregate, less so.
What will drive the downfall of the big data economy is (b), the latest instantiation of which is fake news on Facebook. Facebook tracks your behavior in myriad ways across its platform, and assumes that it is capturing your true behavior; it assumes that you click on photos and stories you’re interested in, that you spend more time on the pages and posts of people you care about, and that when you look at a product you’re considering purchasing it. In aggregate, without external influence, this is true, but when Facebook also allows, through its platform, the widespread dissemination of content meant to drive false engagement, it sullies its own data stream. This is why Facebook got in trouble with its customers—advertisers—when it misrepresented how many videos were being watched on its platform by setting videos to auto-play when they emerged in the scroll of the newsfeed and then counting a view as a video that played for at least three seconds. Only a viewer determined not to see the video would be able to stop it or scroll past it in less than three seconds, making it seem as though Facebook users were much more eager to engage with videos than they actually were. Similarly, a story that is flagrantly false may grab attention, but doesn’t provide useful data to Facebook: I’m not “interested” in fires in any sense that is useful for an advertiser, but when I’m in a theater and someone yells “fire,” I’m going to look. This is the kind of speech that incites mass confusion, if not hysteria, and Facebook has a duty to act against malicious misinformation.
Do you think a future like one of the CLTC’s scenarios is plausible? Is the EFF’s work to combat surveillance about to become personally relevant to many, many more of us? We’ll have more to say in the second episode of this season of Raw Data.
Listen on iTunes, Soundcloud, or at http://worldview.stanford.edu/raw-data