Interactively Coping with Big Data: EDA in the Age of AI

Lynn Cherny

Consulting (formerly EM-Lyon Business School)

arnicas@gmail.com, @arnicas

Presented at EMAEE19, Univ of Sussex, June 2019

Who Am I?

Former Research & Professor
UX / Usability for 18 years
Data Vis Consultant
Data Science / NLP Consultant

I do a lot with data sets (qual and quant), they've gotten bigger, and they still always need visualizing.

Exploratory data analysis ("EDA") tools - and how do we reduce the distance to the model and the presentation?

From R for Data Science, Grolemund & Wickham

How do we reduce the distance between these steps?

Tidy & Transform Tools: "Trifacta"

One of the few "non-programming" options for data transformation ("wrangling").

"Our philosophy has been that AI and machine learning are very important, but there are times when human context has a part to play. That’s why we made the interface very easy to go back and forth between the two,” explained [Joe] Hellerstein.

Trifacta Wrangler's Missing Data Detection and Suggestions ("tidy"/"clean")

Power of SQL: Datasette, BigQuery...

https://github.com/simonw/datasette

Example: 80K Rows in Movie Ratings data set

https://fivethirtyeight.datasettes.com/fivethirtyeight/inconvenient-sequel%2Fratings

"Datasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.

Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world."

Some programming is basic for most of the rest, I'm afraid...

Python: Jupyter Notebook Widgets

Eg., article tutorial here:

https://towardsdatascience.com/bring-your-jupyter-notebook-to-life-with-interactive-widgets-bc12e03f0916

QGrid in Jupyter Notebooks: Tabular Data Explore and Edit (e.g., data labeling)

from Quantopian: https://github.com/quantopian/qgrid

"Panel": New Python Vis Framework

"We have repeatedly seen the pain involved in turning some analysis code into insights that can be easily shared with decision makers within an organization or the general public. Because the technologies involved often required distinct skill sets, different teams may be involved in prototyping, developing and deploying an app to be used by non-technical people."

Philipp Rudiger, Anaconda

"Panel" examples... A few lines of Python code.

Source post

Build Pipelines with Panel

"A classifier pipeline which allows 1) capturing images from a webcam and applying object detection to the images, 2) selecting and modifying the bounding boxes and 3) classifying the contents of the selected region using Google’s Vision API (To try it yourself here)."

Medium Post on Panel

Tensorflow.js on MNIST digit recognition

https://storage.googleapis.com/tfjs-examples/mnist/dist/index.html

All the rage in big (multi-dimensional) data vis:

Dimensionality reduction via UMAP, TSNE, etc.

T-SNE in the browser

https://observablehq.com/@mbostock/lets-try-t-sne

UMAP in the browser

https://observablehq.com/@fil/mnist-umap-js

A full app in an Observable notebook from digital humanities

https://observablehq.com/@bmschmidt/book-visualizations-sandbox?htid=uc2.ark:/13960/t3cz3334b

A Recent Cute T-SNE Layout of Text Clustered, and Images Associated...

By Fathom, https://fathom.info/bobross/

#Champagne Images

A social media problem from me & colleagues...

Ones with text

the actual "selfies"

from a squarified view -- see code at https://ml4a.github.io/ from gene kogan; and demo

Dataset Labeling is a Big Deal for AI/ML

But the bigger the data, the bigger the problems.

Known data issues in standard AI training sets...

This was labelled "goldfish" but also

contains a cat.

https://planspace.org/20170911-problems_with_imagenet_and_its_solutions/

Shoutout to Fastai course's data tools

https://github.com/hiromis/notes/blob/master/Lesson2.md

Platform.ai (from Fast.ai folks)

Video segment, maybe demo

Remember how important label accuracy and coverage are...

Twitter source

Recap...

EDA and model development processes requires data vis interaction and flexible tools
The data labeling problem -- crucial for good ML/AI -- also requires strong vis and management tools
Tools for "big data" and data interaction during analysis have gotten really good (including browser-based!). But most still require programming.

Thanks!