Big Data and Data Science: An Intro
Lynn Cherny, Ph.D.
EM-Lyon Marketing & Innovation
March 2017
This was a google search, but also recommended: http://bigdatapix.tumblr.com/
use the
downarrow!
Word clouds
Elephant-shaped Word Clouds
Genres
Men in front of walls of big data.
In suits.
Men in front of wall-sized networks.
Tip: Some of the most famous data scientists are women.
A search for #BigData on Twitter is pretty gross...
And my spam folder looks similar. Very hype and sales and marketing.
http://demo.relato.io/oreilly
Big data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.
--Dan Ariely of Duke University
New Scientist
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation , search, sharing , storage, transfer, visualization, and information privacy .
Big data is "distributed"
Mike Driscoll: https://www.quora.com/How-much-data-is-Big-Data
The "3 V's" of Big Data
- Volume - the amount of data generated by organizations or individuals or websites.
- Velocity - the frequency at which data is generated, captured and shared.
- Variety - More than rows and columns and numbers; often unstructured text, video, audio.
sometimes also
- Veracity - Can you trust it, is it reliable? Evaluating that is hard.
"The Cloud"
- 1999: Salesforce, "Software as a Service" (SAAS) via web
- 2002: Amazon Web Services (including Mechanical Turk)
- 2006: AWS EC2 (cloud-based computing clusters)
A server issue in Virginia is affecting most of the northeast, disrupting the infrastructure for many popular products and services including Netflix, Product Hunt, Medium, SocialFlow, Buffer, GroupMe, Pocket, Viber Amazon Echo and more.
It’s certainly not the first time AWS has taken much of the Internet out with it. In 2013, AWS suffered a similar outage that took services like Instagram, Airbnb and Vine offline. According to Buzzfeed, that’s a loss of about $1,100 per second for Amazon.
https://www.engadget.com/2017/03/02/amazon-admits-that-a-typo-took-the-internet-down-this-week/
and last week....
Giant Data Sets Are Around
Applications that use them, too! Via APIs.
API: "application programming interface"
This week... literally:
Stitch Fix
Text
What is Data Science?
(and how is it related?)
"The Unreasonable Effectiveness of Data"
by Halevy, Norvig, Pereira at Google in 2009
[ok, but for some problems]
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results-- or from the accumulated evidence of Web-based text patterns and formatted tables-- in both cases without needing any manually annotated data.
More Data Means...
- It can be noisy (noisier)
- Noisy is more realistic
- With the right models (and infrastructure), more data substitutes for annotated training data
Predictive Analytics - a Hot Topic in Data Science
- Fraud Detection
- Cross-selling (customers who bought this bought that)
- Clinical support systems: Patients likely to develop diabetes, etc.
- Customer Retention
An area of data mining that deals with extracting information from data and using it to predict trends and behavior patterns.
AI "Boom" Recently... "deep learning"
Recent Fun AI
by Kyle McDonald
https://aiexperiments.withgoogle.com/drum-machine/view/
By Mario Klingemann and Simon Doury
"Style Transfer" experiments by Gene Kogan
https://affinelayer.com/pixsrv/index.html
by @hardmaru
Image captioning
"Data Science" Today Encompasses...
Not just machine learning (or AI or deep learning or NLP)....
- data integration
- distributed architecture
- automating machine learning
- data visualization
- dashboards and BI
- data engineering
- deployment in production mode
- automated, data-driven decisions
Tools in the Ecosystem:
Hadoop and Map/Reduce (and Spark...)
Lots of online courses and books (plus ours!)
AWS Mechanical Turk
Distributed Human Labor
Exercise
Map/Reduce
MAP
REDUCE
slide from Jeff Patti: http://www.slideshare.net/JeffPatti/map-reducebeyondwordcount
The nightmare that is Java Hadoop...
This is "hello world" word count.
An alternative to Hadoop, Spark with Python
Demo
Counting is Hard.
- What do you count?
- It can be high stakes.
- People like to argue about it. :)
Downsides and Dangers of Big Data Algorithms
Big Data and whole data are not the same. Without taking into account the sample of a data set, the size of the data set is meaningless. For example, a researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate. Regardless of the number of tweets, it is not a representative sample as the data is skewed from the beginning.
d. boyd and K. Crawford, "Critical Questions for Big Data"
"Surveillance Capitalism"
In this article I explore the proposition that ‘big data’ is above all the foundational component in a deeply intentional and highly consequential new logic of accumulation that I call surveillance capitalism. This new form of information capitalism aims to predict and modify human behavior as a means to produce revenue and market control.
Uber Affairs,
4Square Checkins, AKA, Your secrets aren't safe.
Uber had just told all its users that if they were having an affair, it knew about it. Rides to Planned Parenthood? Regular rides to a cancer hospital? Interviews at a rival company? Uber knows about them, too.
"Weapons of Math Destruction"
How Big Data Increases Inequality and Threatens Democracy, by Cathy O'Neil
Amazon link
algorithms that are important, secret and destructive
Intro to Big Data and Data Science
By Lynn Cherny
Intro to Big Data and Data Science
Big Data talk for undergrads in St Etienne.
- 4,021