Interactively Coping with Big Data: EDA in the Age of AI

Lynn Cherny

Consulting (formerly EM-Lyon Business School)

arnicas@gmail.com,  @arnicas

Presented at EMAEE19, Univ of Sussex, June 2019

Who Am I?

  • Former Research & Professor
  • UX / Usability for 18 years
  • Data Vis Consultant
  • Data Science / NLP Consultant

I do a lot with data sets (qual and quant), they've gotten bigger, and they still always need visualizing.

Exploratory data analysis ("EDA") tools - and how do we reduce the distance to the model and the presentation?

 

From R for Data Science, Grolemund & Wickham

How do we reduce the distance between these steps?

Tidy & Transform Tools: "Trifacta"

One of the few "non-programming" options for data transformation ("wrangling").

"Our philosophy has been that AI and machine learning are very important, but there are times when human context has a part to play. That’s why we made the interface very easy to go back and forth between the two,” explained [Joe] Hellerstein.

Trifacta Wrangler's Missing Data Detection and Suggestions ("tidy"/"clean")

Power of SQL: Datasette, BigQuery...

"Datasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.

Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world."

Some programming is basic for most of the rest, I'm afraid...

Python: Jupyter Notebook Widgets

QGrid in Jupyter Notebooks: Tabular Data Explore and Edit (e.g., data labeling)

"Panel": New Python Vis Framework

"We have repeatedly seen the pain involved in turning some analysis code into insights that can be easily shared with decision makers within an organization or the general public. Because the technologies involved often required distinct skill sets, different teams may be involved in prototyping, developing and deploying an app to be used by non-technical people."

"Panel" examples...  A few lines of Python code.

Build Pipelines with Panel

"A classifier pipeline which allows 1) capturing images from a webcam and applying object detection to the images, 2) selecting and modifying the bounding boxes and 3) classifying the contents of the selected region using Google’s Vision API (To try it yourself here)."

Tensorflow.js on MNIST digit recognition

All the rage in big (multi-dimensional) data vis:

Dimensionality reduction via UMAP, TSNE, etc.

T-SNE in the browser

UMAP in the browser

A full app in an Observable notebook from digital humanities

A Recent Cute T-SNE Layout of Text Clustered, and Images Associated...

#Champagne Images

A social media problem from me & colleagues...

Ones with text

the actual "selfies"

from a squarified view -- see code at https://ml4a.github.io/ from gene kogan; and demo 

Dataset Labeling is a Big Deal for AI/ML

But the bigger the data, the bigger the problems.

Known data issues in standard AI training sets...

This was labelled "goldfish" but also

contains a cat.

https://planspace.org/20170911-problems_with_imagenet_and_its_solutions/

Shoutout to Fastai course's data tools

Platform.ai (from Fast.ai folks)

Video segment, maybe demo

Remember how important label accuracy and coverage are...

Recap...

  • EDA and model development processes requires data vis interaction and flexible tools
  • The data labeling problem -- crucial for good ML/AI -- also requires strong vis and management tools
  • Tools for "big data" and data interaction during analysis have gotten really good (including browser-based!).  But most still require programming.

Thanks!