Interactively Coping with Big Data: EDA in the Age of AI
Lynn Cherny
Consulting (formerly EM-Lyon Business School)
arnicas@gmail.com, @arnicas
Presented at EMAEE19, Univ of Sussex, June 2019
Who Am I?
- Former Research & Professor
- UX / Usability for 18 years
- Data Vis Consultant
- Data Science / NLP Consultant
I do a lot with data sets (qual and quant), they've gotten bigger, and they still always need visualizing.
Exploratory data analysis ("EDA") tools - and how do we reduce the distance to the model and the presentation?
From R for Data Science, Grolemund & Wickham
How do we reduce the distance between these steps?
Tidy & Transform Tools: "Trifacta"
One of the few "non-programming" options for data transformation ("wrangling").
"Our philosophy has been that AI and machine learning are very important, but there are times when human context has a part to play. That’s why we made the interface very easy to go back and forth between the two,” explained [Joe] Hellerstein.
Trifacta Wrangler's Missing Data Detection and Suggestions ("tidy"/"clean")
Power of SQL: Datasette, BigQuery...
Example: 80K Rows in Movie Ratings data set
https://fivethirtyeight.datasettes.com/fivethirtyeight/inconvenient-sequel%2Fratings
"Datasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.
Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world."
Some programming is basic for most of the rest, I'm afraid...
Python: Jupyter Notebook Widgets
Eg., article tutorial here:
QGrid in Jupyter Notebooks: Tabular Data Explore and Edit (e.g., data labeling)
from Quantopian: https://github.com/quantopian/qgrid
"Panel": New Python Vis Framework
"We have repeatedly seen the pain involved in turning some analysis code into insights that can be easily shared with decision makers within an organization or the general public. Because the technologies involved often required distinct skill sets, different teams may be involved in prototyping, developing and deploying an app to be used by non-technical people."
"Panel" examples... A few lines of Python code.
Build Pipelines with Panel
"A classifier pipeline which allows 1) capturing images from a webcam and applying object detection to the images, 2) selecting and modifying the bounding boxes and 3) classifying the contents of the selected region using Google’s Vision API (To try it yourself here)."
Tensorflow.js on MNIST digit recognition
All the rage in big (multi-dimensional) data vis:
Dimensionality reduction via UMAP, TSNE, etc.
T-SNE in the browser
UMAP in the browser
A full app in an Observable notebook from digital humanities
A Recent Cute T-SNE Layout of Text Clustered, and Images Associated...
By Fathom, https://fathom.info/bobross/
#Champagne Images
A social media problem from me & colleagues...
Ones with text
the actual "selfies"
from a squarified view -- see code at https://ml4a.github.io/ from gene kogan; and demo
Dataset Labeling is a Big Deal for AI/ML
But the bigger the data, the bigger the problems.
Known data issues in standard AI training sets...
This was labelled "goldfish" but also
contains a cat.
https://planspace.org/20170911-problems_with_imagenet_and_its_solutions/
Shoutout to Fastai course's data tools
Platform.ai (from Fast.ai folks)
Video segment, maybe demo
Remember how important label accuracy and coverage are...
Recap...
- EDA and model development processes requires data vis interaction and flexible tools
- The data labeling problem -- crucial for good ML/AI -- also requires strong vis and management tools
- Tools for "big data" and data interaction during analysis have gotten really good (including browser-based!). But most still require programming.
Thanks!
Interactive Tools and Big Data
By Lynn Cherny
Interactive Tools and Big Data
A talk for Data Vis panel at EMAEE19, Univ of Sussex, June 2019.
- 3,672