Lynn Cherny, Visiting Knight Chair
JMM, SoC, University of Miami
@arnicas / firstname.lastname@example.org
Sept 25, 2015
This was a google search, but also recommended: http://bigdatapix.tumblr.com/
Men in front of walls of big data.
Men in front of wall-sized networks.
A search for #BigData on Twitter is pretty gross...
And my spam folder looks similar.
Big data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.
--Dan Ariely of Duke University
Mike Driscoll: https://www.quora.com/How-much-data-is-Big-Data
A server issue in Virginia is affecting most of the northeast, disrupting the infrastructure for many popular products and services including Netflix, Product Hunt, Medium, SocialFlow, Buffer, GroupMe, Pocket, Viber Amazon Echo and more.
It’s certainly not the first time AWS has taken much of the Internet out with it. In 2013, AWS suffered a similar outage that took services like Instagram, Airbnb and Vine offline. According to Buzzfeed, that’s a loss of about $1,100 per second for Amazon.
The nightmare that is Java Hadoop...
This is "hello world" word count.
Crossfilter / dc.js dash I built...
(Sometimes true for medium data, too!)
Visualization of daily Wikipedia edits created by IBM. At multiple terabytes in size, the text and images of Wikipedia are an example of big data.
Demo by Mike Bostock
Big Data and whole data are not the same. Without taking into account the sample of a data set, the size of the data set is meaningless. For example, a researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate. Regardless of the number of tweets, it is not a representative sample as the data is skewed from the beginning.
d. boyd and K. Crawford, "Critical Questions for Big Data"
... four quantitatively adept social scientists reported that Google’s flu-tracking service not only wildly overestimated the number of flu cases in the United States in the 2012-13 flu season — a well-known miss — but has also consistently overshot in the last few years. Google Flu Trends’ estimate for the 2011-12 flu season was more than 50 percent higher than the cases reported by the Centers for Disease Control and Prevention. ...Their technical criticism of Google Flu Trends is that it is not using a broader array of data analysis tools. Indeed, their analysis shows that combining Google Flu Trends with C.D.C. data, and applying a few tweaking techniques, works best.
Big Data Hubris? Or "not invented here"...
And while there is an enormous structural power asymmetry between the surveillers and surveilled, neither are those with the greatest power free from being haunted by a very particular kind of data anxiety: that no matter how much data they have, it is always incomplete, and the sheer volume can overwhelm the critical signals in a fog of possible correlations.
[ok, but for some problems]
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results-- or from the accumulated evidence of Web-based text patterns and formatted tables-- in both cases without needing any manually annotated data.
This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.
Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).
We were wrong. We found that phone metadata is unambiguously sensitive, even in a small population and over a short time window. We were able to infer medical conditions, firearm ownership, and more, using solely phone metadata.
4Square Checkins, Your secrets aren't safe.
Uber had just told all its users that if they were having an affair, it knew about it. Rides to Planned Parenthood? Regular rides to a cancer hospital? Interviews at a rival company? Uber knows about them, too.
Applicants should have evidence of distinction in research and teaching in humanities computing and large-scale quantitative or big data analysis of cultural archives. Applicants should have expertise in algorithmic development, data culture, data mining, and quantitative analysis of visual or textual materials. Specific domain application could be in any number of areas, from historical text archives to game databases. Applicants should be able to teach courses on topics related to humanities computing and big data for one or more of the following departments: English, Science and Technology Studies, Cinema and Digital Media.
The tools are still painful. So are the jobs.
Applications of "big data" modeling are sometimes invasively accurate and sometimes embarrassingly, dangerously off-base.
But none of this is going away.