Big Data and Data Science: An Overview

Lynn Cherny, Ph.D.

EM-Lyon Marketing & Innovation

Feb 2017

http://ghostweather.slides.com/lynncherny/what-is-big-data-anyway-9/live

This was a google search, but also recommended: http://bigdatapix.tumblr.com/

use the

downarrow!

Genres

Men in front of walls of big data.

In suits.

Men in front of wall-sized networks.

Genres

Word clouds

Elephants

A search for #BigData on Twitter is pretty gross...

And my spam folder looks similar.

http://demo.relato.io/oreilly

Big data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.

--Dan Ariely of Duke University

New Scientist

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation , search, sharing , storage, transfer, visualization, and information privacy .

Wikipedia

"Big data is distributed"

Mike Driscoll: https://www.quora.com/How-much-data-is-Big-Data

The "3 V's" of Big Data

Volume - the amount of data generated by organizations or individuals or websites.
Velocity - the frequency at which data is generated, captured and shared.
Variety - More than rows and columns and numbers; often unstructured text, video, audio.

sometimes also

Veracity - Can you trust it, is it reliable? Evaluating that is hard.

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

"The Cloud"

How Amazon Exposed Its Guts: The History of AWS's EC2

A History of Cloud Computing

1999: Salesforce, "Software as a Service" (SAAS) via web
2002: Amazon Web Services (including Mechanical Turk)
2006: AWS EC2 (cloud-based computing clusters)

AWS Mechanical Turk

Distributed Human Labor

Link

EC2 Instances

https://aws.amazon.com/ec2/instance-types/

example:

Amazon AWS "in plain English"

Source

There are tutorials, but they're kind of tough.

Link

A server issue in Virginia is affecting most of the northeast, disrupting the infrastructure for many popular products and services including Netflix, Product Hunt, Medium, SocialFlow, Buffer, GroupMe, Pocket, Viber Amazon Echo and more.

It’s certainly not the first time AWS has taken much of the Internet out with it. In 2013, AWS suffered a similar outage that took services like Instagram, Airbnb and Vine offline. According to Buzzfeed, that’s a loss of about $1,100 per second for Amazon.

Nextweb article, Sept 20, 2015

Tools in the Ecosystem:

"Hadoop" and Map/Reduce

Apache Hadoop site

Apparently still relevant...

Twitter recently

Map/Reduce

Google article on MapReduce by Dean and Ghemawat, 2004

MAP

REDUCE

slide from Jeff Patti: http://www.slideshare.net/JeffPatti/map-reducebeyondwordcount

The nightmare that is Java Hadoop...

This is "hello world" word count.

Word Count Tutorial

An alternative to Hadoop, Spark with Python

Source github

Essentially, the actual work in Big Data is still hard.

Most Hadoop users complain bitterly about the ops and time delays on their work.
Recent tools like Spark, Impala, etc. look better, but ops (setup, management of the systems) is a giant pain.
Just "moving data around" takes a long time and some serious engineering expertise. (E.g., "The time required to move the data from Amazon S3 to HDFS was about 1 hour and 45 minutes." -- 1 billion reddit comments)

A Day in the Life of a Data Scientist

http://www.gallamine.com/2015/05/a-day-in-life-of-data-scientist-part-1.html

Recent Finance Intern's post

http://basho.com/posts/technical/algorithms-and-stock-tickers-with-apache-spark-my-summer-internship-with-basho/

At this point, I had a working data pipeline in an IPython notebook, but this was not the full project. I still needed to figure out how to make the pipeline fully automated, instead of manually running IPython notebook cells one after the other.

Giant Data Sets Are Around

AWS Data Sets link

Of many types...

Stanford Large Networks Datasets

Applications that use them, too! Via APIs.

h int.fm Web Seer

API: "application programming interface"

GDELT

G DELT Home

Big Query Demo of GDELT

Link to BQ

Related project with maybe cleaner data: Phoenix

Crossfilter / dc.js dash I built...

What is Data Science?

(and how is it related?)

"The Unreasonable Effectiveness of Data"

by Halevy, Norvig, Pereira at Google in 2009

Reference

[ok, but for some problems]

The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results-- or from the accumulated evidence of Web-based text patterns and formatted tables-- in both cases without needing any manually annotated data.

Halevy, Norvig, Pereira

More Data Means...

It can be noisy (noisier)
Noisy is more realistic
With the right models (and infrastructure), more data substitutes for annotated training data

Reference

Predictive Analytics - a Hot Topic in Data Science

Fraud Detection
Cross-selling (customers who bought this bought that)
Clinical support systems: Patients likely to develop diabetes, etc.
Customer Retention

An area of data mining that deals with extracting information from data and using it to predict trends and behavior patterns.

Source

Mashable article

WSJ article

DirectMarketing piece

"Metadata"

The Sensitivity of Telephone Metadata

President Obama has emphasized that the NSA is “not looking at content.” “[T]his is just metadata,” Senator Feinstein told reporters.

We were wrong. We found that phone metadata is unambiguously sensitive, even in a small population and over a short time window. We were able to infer medical conditions, firearm ownership, and more, using solely phone metadata.

Uber Affairs,

4Square Checkins, AKA, Your secrets aren't safe.

Uber had just told all its users that if they were having an affair, it knew about it. Rides to Planned Parenthood? Regular rides to a cancer hospital? Interviews at a rival company? Uber knows about them, too.

"We Can't Trust Uber" (NYT)

"Data Science" Today Encompasses...

http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning

Not just machine learning (or AI or deep learning or NLP)....

data integration
distributed architecture
automating machine learning
data visualization
dashboards and BI
data engineering
deployment in production mode
automated, data-driven decisions

Some good, recent-ish articles about data science jobs

What It's Like to Be on the Data Science Job Market, by Trey Causey
Crushed It! Landing a Data Science Job, by Erin Shellman
Doing Data Science At Twitter, by Robert Chang

Drew Conway's post

The most famous venn diagram of "data science skills"

ask good questions
know constraints
decision making
transparency
criteria that matter
data exploration
communication
data governance
data
deployment
id political friction
build/buy/outsource
latency at execution

In the

Biz bubble:

blue = "biz analytics leader"?

Type A: Analysis

This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.

Type B: Builders

Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).

Robert Chang's piece on data scientist types

Visualizing Big Data

Aggregation and Approximation

(Sometimes true for medium data, too!)

Paris Rainfall Data (323 years * 12 months)

Visualization of daily Wikipedia edits created by IBM. At multiple terabytes in size, the text and images of Wikipedia are an example of big data.

F. Viegas & M. Wattenberg

The Internet Archive's Map of Book Subjects (8M books, 5K Subjects linked to 2.6M images on flickr) by Mario Klingemann

About It, & The Map Itself

Crossfilter: Fast Multi-Dimensional Filtering for Coordinated Views

Demo by Mike Bostock

Relevant Overviews of Vis for Big Data

Bin-Summarize-Smooth: A Framework for Visualizing Large Data (Hadley Wickham)
imMems: Real-Time Visual Querying of Large Data (Stanford - now UW - Vis Group)
"Why Exploring Big Data is Hard and What We Can Do About It", Danyel Fisher's talk at OpenVisConf 2015

Some Financial Visualizations

https://www.bloomberg.com/graphics/2015-spoofing/

How Do People Play The Trading Game: https://www.bloomberg.com/features/2015-stock-chart-trading-game/analysis/

"Playing the Trading Game"

How ETFs Work

https://www.bloomberg.com/features/2016-etf-files/toy/

Year in Graphics at Bloomberg: https://www.bloomberg.com/graphics/2016-in-graphics/

A few more financial "big data" apps and demos...

Neural Nets & Finance

A Survey of Deep Learning Techniques Applied to Trading:

https://www.linkedin.com/pulse/survey-deep-learning-techniques-applied-trading-james-melenkevitz-phd

Example Demo from Google Datalab Cloud Services

TensorFlow Machine Learning with Financial Data on Google Cloud Platform.ipynb

sourcing five years of financial time-series data,
munging that data into a more suitable form
explore and visualize that data with
decided on a machine learning model and the features for that model
built a binary classifier in TensorFlow, and analyzed its performance.
built a feed forward neural net with two hidden layers in TensorFlow and analyzed its performance.

A less "interny" application of Spark

for finance, fyi :)

http://blog.cloudera.com/blog/2014/07/estimating-financial-risk-with-apache-spark /

Estimating Financial Risk with Apache Spark

Quantopian

Demo in my acct

lots of tutorials and community code... https://www.quantopian.com/posts/quantopian-tutorial-series

Lectures List

their head data scientist: Thomas Wiecki @twiecki

Dilbert, July 29, 2012

Critical Discussion

Big Data and whole data are not the same. Without taking into account the sample of a data set, the size of the data set is meaningless. For example, a researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate. Regardless of the number of tweets, it is not a representative sample as the data is skewed from the beginning.

d. boyd and K. Crawford, "Critical Questions for Big Data"

Google Flu Trends: The Limits of Big Data (NYT)

... four quantitatively adept social scientists reported that Google’s flu-tracking service not only wildly overestimated the number of flu cases in the United States in the 2012-13 flu season — a well-known miss — but has also consistently overshot in the last few years. Google Flu Trends’ estimate for the 2011-12 flu season was more than 50 percent higher than the cases reported by the Centers for Disease Control and Prevention. ...Their technical criticism of Google Flu Trends is that it is not using a broader array of data analysis tools. Indeed, their analysis shows that combining Google Flu Trends with C.D.C. data, and applying a few tweaking techniques, works best.

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis .” Science 343 (14 March): 1203-1205.

Big Data Hubris? Or "not invented here"...

"The Anxieties of Big Data": Surveillance and the Surveilled

Kate Crawford, New Inquiry

And while there is an enormous structural power asymmetry between the surveillers and surveilled, neither are those with the greatest power free from being haunted by a very particular kind of data anxiety: that no matter how much data they have, it is always incomplete, and the sheer volume can overwhelm the critical signals in a fog of possible correlations.

"Surveillance Capitalism"

In this article I explore the proposition that ‘big data’ is above all the foundational component in a deeply intentional and highly consequential new logic of accumulation that I call surveillance capitalism. This new form of information capitalism aims to predict and modify human behavior as a means to produce revenue and market control.

Shoshana Zuboff