Getting Your Data Joie de Vivre On (or Back)

Lynn Cherny, EuroPython and PyData London, 2019

@arnicas

Link to Video from PyData London

A few contextual things...

I sometimes like & do NLP.
I do a lot of data visualization but the web is hard to work for professionally.
I do data analysis & data science (as a consultant right now).
I've taught a bit (university and workshops).

When I give invited talks, I make new stuff. So there's drafty toys in here!

from Austin Kleon's Keep Going

(I also rec Jenny Odell's How to Do Nothing)

Project 1: The BoschBot Viewer

Hieronymus Bosch's "The Garden of Earthly Delights"

Wikipedia info and image files

The @BoschBot

Quantumly random image

segments daily;

detail from the big picture

is really very engaging.

Task	Thing I Did	Rating Easy	Rating Fun / Interesting
1	Used Twint to scrape tweets from Jan 1 - May 5, 2019 with images: "An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations."	Very! 😁	Quite cool. 👍
2	Pandas to load, review, and finally get the top 10 by most "likes"	Very! 😂	Very! 😂
3	Figured out where the image segments were from in the big image (blargh #$%*).	NOT. 🙄	Meh. Good problem I didn't solve.
4	Used Leaflet.js for a web app to display them!	Fairly... after tiles and transforms	Kind of? 🤔

"What details did people like most?" I wondered. And made an app.

local demo

This thing was "big data."

It's slow to load Wikipedia's high res image (223 MB) in browser in one go, not to mention move around in it! (This isn't even the biggest image available.)
Pillow-based python tools won't load it (error warning about "Decompression Bombs")

Algorithms to try to locate the BoschBot snip inside the big image crashed my laptop (and/or were awful failures); note, keywords here are "feature matching", "template matching", "SURF/SIFT algorithms" etc, (e.g. opencv link and link)

Guilty: I ended up finding the locations of the top 10 by hand in Photoshop :(

Why leaflet? A really big picture can be handled like a map!

from https://www.maptiler.com/google-maps-coordinates-tile-bounds-projection/

Convert your big image to map tiles...

Here's a helpful blog post - however, it's a bit out of date because you can use leaflet with image data that's non-map now without extra libs like leaflet-rastercoords.

gdal2tiles.py [-p profile] [-r resampling] [-s srs] [-z zoom]
              [-e] [-a nodata] [-v] [-q] [-h] [-k] [-n] [-u url]
              [-w webviewer] [-t title] [-c copyright]
              [--processes=NB_PROCESSES]
              [-g googlekey] [-b bingkey] input_file [output_dir]

Tiles by directory - more tiles as you "zoom" in (higher numbers)

From Leaflet.js, we get a lot of stuff "for easy" : Tooltips, zoom behaviors, UI controls...

Click on a "dot" marker... and zoom in:

The top fav was: Naked guy's butt on a fish!

@Boschbot said, "Karen Kilgariff boosted that pinned tweet by 100x. ... With Karen I had to ask, ‘Why did this take off?’ I’d never heard of her!"

Why was it such an outlier?

That tweet image is "pinned" at the top of the twitter page for @Boschbot.

@KarenKilgariff is a comic celeb (I had to look too).

Postscript: @BoschBot, after DM consultations, has added snapshot location info to the posts.

(I'll make a live web app with updating stats soon.)

What does that data look like?

days posted before july 12

We need new interaction models, filters, display methods... The entire app needs redesigning! (Woops.)

possible local demo of wip

On the path -- a new color scalar helps...

Lower right corner, giant red blob....

Related work: Big Images

The Pudding's use of Openseadragon (for tile zoom), and T-SNE and RasterFairy from Mario Klingemann for squarified layout)

Project 2-3: Word2Vec Toys

Word2Vec

A word will be represented by a single vector in a matrix -- we find similar words by finding similar vectors.

Chris McCormick

Project Gutenberg

A fine source of fair use text!

Notes: Getting data from Gutenberg

Several libs exist, but: I downloaded an updated David Robinson's gutenbergr (R) because it allowed me to query by subject.
"Updating" meant downloading a lot and rebuilding recent metadata tables, plus some other edits related to metadata changes.
Be aware: Other libs don't necessarily strip out the header/footer content in each text, which is a pain in the butt.

Use gensim to make your word2vec models using a fun text corpus.

from gensim.models import Word2Vec
# define training data - tokenized sentence format
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]

# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')  # or .txt format is possible too
# load model
new_model = Word2Vec.load('model.bin')

Source but there are many tutorials online...

Explorable UMAP 2D Display of One Model, thanks to Plotly (local demo)

Based on a modified gist of Peter Baumgartner's that I posted here

A common illustration of word2vec models is to find similar terms:

What if we went from the closest word to "polite," which is "courteous," and then found that word's closest word, and then found that word's closest word.... And I made an app (local demo).

ex above from here

Word chain from "wind" in Gutenberg poetry word2vec model (based on Allison Parrish's corpus work here)

Vertical distance ~ distance in this model's vector space. "wind" and "mist" are further than "rain" and "wind"

Size ~ how often this word shows up as a "most similar"; cycles are not allowed (we take the next-closest word instead).

Color: Pretty. (Also to suggest an order for reading left to right, along with the swoop)

Since I made the first version... New public domain texts in January 2019!

https://lifehacker.com/these-1923-copyrighted-works-enter-the-public-domain-in-1825241296

Short stories by Christie, Virginia Woolf, H.P. Lovecraft, Katherine Mansfield, and Ernest Hemingway
Works by Jane Austen, D.H. Lawrence, Edith Wharton, Jorge Luis Borges, Mikhail Bulgakov, Jean Cocteau, Italo Svevo, Aldous Huxley, Winston Churchill, G.K. Chesterton, Maria Montessori, Lu Xun, Joseph Conrad, Zane Grey, H.G. Wells, and Edgar Rice Burroughs
Poetry by Edna St. Vincent Millay, E.E. Cummings, William Carlos Williams, Rainer Maria Rilke, Wallace Stevens, Robert Frost, Sukumar Ray, and Pablo Neruda

"Believe" in the older model, pre-January 2019 common domain releases

"Believe", in the UPDATED poetry model. You can get to magazines!

Ok, while we're at it, let's use the Nerd2Vec model too...

https://blogs.oracle.com/irml/nerd2vec:-jointly-embedding-star-trek,-star-wars-and-doctor-who-wikias

in nerd2vec, we're fairly "rational" here.

What if we want to "mess with" fair use poetry....

Inspiration in found poems, blackout poems, cut up poems, and creative constraints...

A next level project, a human-in-the-algorithm-loop creativity tool:

Austin Kleon

Timothy David Ray

Amelia was just fourteen and out of the orphan asylum;
at her first job—in the bindery, and yes sir, yes
ma'am, oh, so anxious to please.
She stood at the table, her blond hair hanging about
her shoulders, "knocking up" for Mary and Sadie,
the stichers ("knocking up" is counting books and
stacking them in piles to be taken away).

Charles Reznikoff

local demo

Color represents a normalized distance between the original word (random "nouny"*-words) and its next closest relative in the embedding model.

Blue: Closer, Pink: Further Away.

* using Rita.js in the browser, it's not great at POS. I need to switch to a server-based SpaCy

Click on a colored word... pick a replacement from it's nearest neighbors in the model....

and you edited the poem :)

Notice that in nerd2vec, "pond"

is most similar to

character names ("Amy Pond", Rory, etc).

This is one limitation of these simple embedding models! And

of that dataset for poetic contexts.

https://www.slideshare.net/kurotanshi/d3js-56389560

D3.js for color scale: Map from your domain of number values to your range - color start and end point.

To get the scores, I had to radically re-architect the project (to get the data in advance to get the distances to get the colors... to render).

App (with Dropdowns)

Poem

Line of the Poem

Clickable Words Buttons

Modal component

ModalList component

SelectList component

Call the API here for all words :( [promise hell]

Called the API originally here - 1 call

Components stack:

TBH: Not sure the colors as implemented currently are super-useful, though.

React Programming Turns Out To Be...

Linking to....

Basho Haiku are kind of fun :)

Project parts scored...

Task	Thing I did	Rate ease	Rate interesting/ fun
1	Make word2vec models	Very easy	😃 Always fun
2	Stupidly decide to code the poem app in React	Easy to decide, very very hard learning curve for the number of interactions involved up and down the stack. My code is a mess.	😬
3	Change the stack to get the distances for all words as promises	Quite hard - updating after right point.	😢
4	Do the color scale on the buttons	Easy - D3 is an old friend	😃 Always fun, still fun in react.
5	Add the dropdowns to select model and poem	Ok, not bad. Some state issues...	Maybe seeing the point of react a little more-? Obviously hooks would help. 🤔

related AI creativity work... human + models:

Hugging Face's

Write with Transformer

https://transformer.huggingface.co/

Example of me trying to work with the blue tool...

NVidia's GauGAN demo: http://nvidia-research-mingyuliu.com/gaugan

My crap drawing

Its weird photo realism

Ganbreeder app

https://ganbreeder.app

Find/Make/Use Cool Datasets

(and be clever about it.)

Finding fun datasets

Jeremy Singer-Vine's Data is Plural mailing list: Archive of datasets here
My giant pinboard links collection to "datasets": https://pinboard.in/u:arnicas/t:datasets/
Make your own! Remix fair use content! Collect your own data!

Josh Stevens

Jannelle Shane

"Don't Use a Neural Network to Name Your Next Pub"

@JanelleCShane

Mingside Arms
Crown & Fathous
Stonebredde Arms
Old Festerlan
Burn Horse Hotel
Doss of Wulling of Stank
Shore’s Castle Crustle Hotel
Lick Inn
Odd Lingwion
Lambles
Loons Hall
Thringeron Arms
Flint Horse
The All House
Dean & Funtling
Old Hell Kick
Jolly Trocks
Wallow Arms

Fripy Whee Bore Inn
Ladside Inn
Hogs Thee Inn
Shur Hiad House Hotel Hotel
Old Ash Ox Horse Inn
Bleak Clab
Bark Inn
Blisksmerd
Shorthood
Rat Horses
Wheee Travel
Sham
New Shins
Ferp’s Brel

Darn Funk Inn

Arse Inn

Of Oz the Wizard

For those of you who think this must have taken a lot of time: it didn't! I'm too lazy to have cut all this by hand. I wrote a couple apps that made disassembly go very quickly. The primary work took me and a friend about 20 hours.

link

My appreciation for the film increased enormously, mostly in a technical sense while disassembling. I saw and heard things I’d never seen before and which you would only see going frame by frame. I saw how much craft there is in the film.

Matt Bucy, interviewed

https://towardsdatascience.com/african-masks-gans-tpu-9a6b0cf3105c

the demo

Anna Ridler's Tulips

"Ten thousand, or a myriad, of photos of tulips were taken over the course of tulip season and each has been categorised by hand, revealing the human aspect that sits behind machine learning."

Source

https://vimeo.com/287645190

Anna Ridler's MosaicVirus

Helena Sarin's GAN art

Poetry by Mary Oliver. Art by Helena (@glagolista)

By me,

using only

API calls -

written up

here.

Don't be a jerk.

Give credit to your sources and inspirations... we all help each other.

And really don't be this guy:

"If I don't do it, someone else will do it in a year."

https://www.vice.com/en_us/article/kzm59x/deepnude-app-creates-fake-nudes-of-any-woman

"DeepNude": the app that "undresses" women's pics

"Just a couple of days after taking the app down, the creator posted the code of the application to GitHub, saying that because the DeepNude application had already been replicated widely, there was no purpose in keeping the original code private, so they published it online."

Jack Clark's AI Newsletter

"The algorithm only works with women, Alberto said, because images of nude women are easier to find online—but he's hoping to create a male version, too."

wild forks and branches

are way better

"work"

@maxkreminski

When you find yourself playing with what you made and smiling... well, that might be good enough.

in a weirdly popular tweet of mine

Shoutout thanks to...

The @BoschBot for being willing to answer questions and add ugly numbers to their posts for future use
Peter Baumgartner's gist for making a UMAP display with tooltips in Plotly quickly (I've updated it to be more general here)
3Top's word2vec api code
Peter Gassner for help with React on Twitter.
This person Matiasdahl who has helped a lot of people figure out how to import UMAP in python when it errors:

Thanks!

arnicas@gmail.com

@arnicas

More writing about AI Art & Data Vis on Medium: https://medium.com/@lynn_72328

Slides up on ghostweather.slides.com!

Getting Your Data Joie de Vivre (Back)

By Lynn Cherny

Getting Your Data Joie de Vivre (Back)

A keynote talk for Europython 2019 and PyData London 2019 on fun data apps and hacks. Some data vis, react, word2vec vis, poetry editing tools, AI art.

6,351

Getting Your Data Joie de Vivre On (or Back)

A few contextual things...

Project 1: The BoschBot Viewer

The @BoschBot

This thing was "big data."

Guilty: I ended up finding the locations of the top 10 by hand in Photoshop :(

Why leaflet? A really big picture can be handled like a map!

Convert your big image to map tiles...

Tiles by directory - more tiles as you "zoom" in (higher numbers)

From Leaflet.js, we get a lot of stuff "for easy" : Tooltips, zoom behaviors, UI controls...

Postscript: @BoschBot, after DM consultations, has added snapshot location info to the posts.

What does that data look like?

Related work: Big Images

Project 2-3: Word2Vec Toys

Word2Vec

Project Gutenberg

Notes: Getting data from Gutenberg

Use gensim to make your word2vec models using a fun text corpus.

Since I made the first version... New public domain texts in January 2019!

Ok, while we're at it, let's use the Nerd2Vec model too...

What if we want to "mess with" fair use poetry....

To get the scores, I had to radically re-architect the project (to get the data in advance to get the distances to get the colors... to render).

TBH: Not sure the colors as implemented currently are super-useful, though.

Project parts scored...

related AI creativity work... human + models:

Hugging Face's

Write with Transformer

Ganbreeder app

Find/Make/Use Cool Datasets

Finding fun datasets

Jannelle Shane

@JanelleCShane

Anna Ridler's Tulips

Anna Ridler's MosaicVirus

Helena Sarin's GAN art

Don't be a jerk.

"DeepNude": the app that "undresses" women's pics

Shoutout thanks to...

Thanks!

Getting Your Data Joie de Vivre (Back)

More from Lynn Cherny