Getting Your Data Joie de Vivre On (or Back)
Lynn Cherny, EuroPython and PyData London, 2019
A few contextual things...
- I sometimes like & do NLP.
- I do a lot of data visualization but the web is hard to work for professionally.
- I do data analysis & data science (as a consultant right now).
- I've taught a bit (university and workshops).
When I give invited talks, I make new stuff. So there's drafty toys in here!
from Austin Kleon's Keep Going
(I also rec Jenny Odell's How to Do Nothing)
Project 1: The BoschBot Viewer
Hieronymus Bosch's "The Garden of Earthly Delights"
Quantumly random image
detail from the big picture
is really very engaging.
|Task||Thing I Did||Rating Easy||Rating Fun / Interesting|
|1||Used Twint to scrape tweets from Jan 1 - May 5, 2019 with images: "An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations."||Very! 😁||Quite cool. 👍|
|2||Pandas to load, review, and finally get the top 10 by most "likes"||Very! 😂||Very! 😂|
|3||Figured out where the image segments were from in the big image (blargh #$%*).||NOT. 🙄||Meh. Good problem I didn't solve.|
|4||Used Leaflet.js for a web app to display them!||Fairly... after tiles and transforms||Kind of? 🤔|
"What details did people like most?" I wondered. And made an app.
This thing was "big data."
- It's slow to load Wikipedia's high res image (223 MB) in browser in one go, not to mention move around in it! (This isn't even the biggest image available.)
- Pillow-based python tools won't load it (error warning about "Decompression Bombs")
- Algorithms to try to locate the BoschBot snip inside the big image crashed my laptop (and/or were awful failures); note, keywords here are "feature matching", "template matching", "SURF/SIFT algorithms" etc, (e.g. opencv link and link)
Guilty: I ended up finding the locations of the top 10 by hand in Photoshop :(
Why leaflet? A really big picture can be handled like a map!
Convert your big image to map tiles...
Here's a helpful blog post - however, it's a bit out of date because you can use leaflet with image data that's non-map now without extra libs like leaflet-rastercoords.
gdal2tiles.py [-p profile] [-r resampling] [-s srs] [-z zoom] [-e] [-a nodata] [-v] [-q] [-h] [-k] [-n] [-u url] [-w webviewer] [-t title] [-c copyright] [--processes=NB_PROCESSES] [-g googlekey] [-b bingkey] input_file [output_dir]
Tiles by directory - more tiles as you "zoom" in (higher numbers)
From Leaflet.js, we get a lot of stuff "for easy" : Tooltips, zoom behaviors, UI controls...
Click on a "dot" marker... and zoom in:
The top fav was: Naked guy's butt on a fish!
@Boschbot said, "Karen Kilgariff boosted that pinned tweet by 100x. ... With Karen I had to ask, ‘Why did this take off?’ I’d never heard of her!"
Why was it such an outlier?
That tweet image is "pinned" at the top of the twitter page for @Boschbot.
@KarenKilgariff is a comic celeb (I had to look too).
Postscript: @BoschBot, after DM consultations, has added snapshot location info to the posts.
(I'll make a live web app with updating stats soon.)
What does that data look like?
days posted before july 12
We need new interaction models, filters, display methods... The entire app needs redesigning! (Woops.)
possible local demo of wip
On the path -- a new color scalar helps...
Lower right corner, giant red blob....
Related work: Big Images
The Pudding's use of Openseadragon (for tile zoom), and T-SNE and RasterFairy from Mario Klingemann for squarified layout)
Project 2-3: Word2Vec Toys
A word will be represented by a single vector in a matrix -- we find similar words by finding similar vectors.
A fine source of fair use text!
Notes: Getting data from Gutenberg
- Several libs exist, but: I downloaded an updated David Robinson's gutenbergr (R) because it allowed me to query by subject.
- "Updating" meant downloading a lot and rebuilding recent metadata tables, plus some other edits related to metadata changes.
- Be aware: Other libs don't necessarily strip out the header/footer content in each text, which is a pain in the butt.
Use gensim to make your word2vec models using a fun text corpus.
from gensim.models import Word2Vec # define training data - tokenized sentence format sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'], ['this', 'is', 'the', 'second', 'sentence'], ['yet', 'another', 'sentence'], ['one', 'more', 'sentence'], ['and', 'the', 'final', 'sentence']] # train model model = Word2Vec(sentences, min_count=1) # summarize the loaded model print(model) # summarize vocabulary words = list(model.wv.vocab) print(words) # access vector for one word print(model['sentence']) # save model model.save('model.bin') # or .txt format is possible too # load model new_model = Word2Vec.load('model.bin')
Source but there are many tutorials online...
Explorable UMAP 2D Display of One Model, thanks to Plotly (local demo)
Based on a modified gist of Peter Baumgartner's that I posted here
A common illustration of word2vec models is to find similar terms:
What if we went from the closest word to "polite," which is "courteous," and then found that word's closest word, and then found that word's closest word.... And I made an app (local demo).
ex above from here
Word chain from "wind" in Gutenberg poetry word2vec model (based on Allison Parrish's corpus work here)
Vertical distance ~ distance in this model's vector space. "wind" and "mist" are further than "rain" and "wind"
Size ~ how often this word shows up as a "most similar"; cycles are not allowed (we take the next-closest word instead).
Color: Pretty. (Also to suggest an order for reading left to right, along with the swoop)
Since I made the first version... New public domain texts in January 2019!
- Short stories by Christie, Virginia Woolf, H.P. Lovecraft, Katherine Mansfield, and Ernest Hemingway
- Works by Jane Austen, D.H. Lawrence, Edith Wharton, Jorge Luis Borges, Mikhail Bulgakov, Jean Cocteau, Italo Svevo, Aldous Huxley, Winston Churchill, G.K. Chesterton, Maria Montessori, Lu Xun, Joseph Conrad, Zane Grey, H.G. Wells, and Edgar Rice Burroughs
- Poetry by Edna St. Vincent Millay, E.E. Cummings, William Carlos Williams, Rainer Maria Rilke, Wallace Stevens, Robert Frost, Sukumar Ray, and Pablo Neruda
"Believe" in the older model, pre-January 2019 common domain releases
"Believe", in the UPDATED poetry model. You can get to magazines!
Ok, while we're at it, let's use the Nerd2Vec model too...
in nerd2vec, we're fairly "rational" here.
What if we want to "mess with" fair use poetry....
Inspiration in found poems, blackout poems, cut up poems, and creative constraints...
A next level project, a human-in-the-algorithm-loop creativity tool:
Amelia was just fourteen and out of the orphan asylum;
at her first job—in the bindery, and yes sir, yes
ma'am, oh, so anxious to please.
She stood at the table, her blond hair hanging about
her shoulders, "knocking up" for Mary and Sadie,
the stichers ("knocking up" is counting books and
stacking them in piles to be taken away).
Color represents a normalized distance between the original word (random "nouny"*-words) and its next closest relative in the embedding model.
Blue: Closer, Pink: Further Away.
* using Rita.js in the browser, it's not great at POS. I need to switch to a server-based SpaCy
Click on a colored word... pick a replacement from it's nearest neighbors in the model....
and you edited the poem :)
Notice that in nerd2vec, "pond"
is most similar to
character names ("Amy Pond", Rory, etc).
This is one limitation of these simple embedding models! And
of that dataset for poetic contexts.
D3.js for color scale: Map from your domain of number values to your range - color start and end point.
To get the scores, I had to radically re-architect the project (to get the data in advance to get the distances to get the colors... to render).
App (with Dropdowns) Poem Line of the Poem Clickable Words Buttons Modal component ModalList component SelectList component
Call the API here for all words :( [promise hell]
Called the API originally here - 1 call
TBH: Not sure the colors as implemented currently are super-useful, though.
React Programming Turns Out To Be...
Basho Haiku are kind of fun :)
Project parts scored...
|Task||Thing I did||Rate ease||Rate interesting/ fun|
|1||Make word2vec models||Very easy||😃 Always fun|
|2||Stupidly decide to code the poem app in React||Easy to decide, very very hard learning curve for the number of interactions involved up and down the stack. My code is a mess.||😬|
|3||Change the stack to get the distances for all words as promises||Quite hard - updating after right point.||😢|
|4||Do the color scale on the buttons||Easy - D3 is an old friend||😃 Always fun, still fun in react.|
|5||Add the dropdowns to select model and poem||Ok, not bad. Some state issues...||Maybe seeing the point of react a little more-? Obviously hooks would help. 🤔|
related AI creativity work... human + models:
Write with Transformer
Example of me trying to work with the blue tool...
NVidia's GauGAN demo: http://nvidia-research-mingyuliu.com/gaugan
My crap drawing
Its weird photo realism
Find/Make/Use Cool Datasets
(and be clever about it.)
Finding fun datasets
- Jeremy Singer-Vine's Data is Plural mailing list: Archive of datasets here
- My giant pinboard links collection to "datasets": https://pinboard.in/u:arnicas/t:datasets/
- Make your own! Remix fair use content! Collect your own data!
Crown & Fathous
Burn Horse Hotel
Doss of Wulling of Stank
Shore’s Castle Crustle Hotel
The All House
Dean & Funtling
Old Hell Kick
Fripy Whee Bore Inn
Hogs Thee Inn
Shur Hiad House Hotel Hotel
Old Ash Ox Horse Inn
Darn Funk Inn
Of Oz the Wizard
For those of you who think this must have taken a lot of time: it didn't! I'm too lazy to have cut all this by hand. I wrote a couple apps that made disassembly go very quickly. The primary work took me and a friend about 20 hours.
My appreciation for the film increased enormously, mostly in a technical sense while disassembling. I saw and heard things I’d never seen before and which you would only see going frame by frame. I saw how much craft there is in the film.
Matt Bucy, interviewed
Anna Ridler's Tulips
"Ten thousand, or a myriad, of photos of tulips were taken over the course of tulip season and each has been categorised by hand, revealing the human aspect that sits behind machine learning."
Anna Ridler's MosaicVirus
Helena Sarin's GAN art
Poetry by Mary Oliver. Art by Helena (@glagolista)
API calls -
Don't be a jerk.
Give credit to your sources and inspirations... we all help each other.
And really don't be this guy:
"If I don't do it, someone else will do it in a year."
"DeepNude": the app that "undresses" women's pics
"Just a couple of days after taking the app down, the creator posted the code of the application to GitHub, saying that because the DeepNude application had already been replicated widely, there was no purpose in keeping the original code private, so they published it online."
"The algorithm only works with women, Alberto said, because images of nude women are easier to find online—but he's hoping to create a male version, too."
wild forks and branches
are way better
When you find yourself playing with what you made and smiling... well, that might be good enough.
in a weirdly popular tweet of mine
Shoutout thanks to...
- The @BoschBot for being willing to answer questions and add ugly numbers to their posts for future use
- Peter Baumgartner's gist for making a UMAP display with tooltips in Plotly quickly (I've updated it to be more general here)
- 3Top's word2vec api code
- Peter Gassner for help with React on Twitter.
- This person Matiasdahl who has helped a lot of people figure out how to import UMAP in python when it errors:
More writing about AI Art & Data Vis on Medium: https://medium.com/@lynn_72328
Slides up on ghostweather.slides.com!
Getting Your Data Joie de Vivre (Back)
By Lynn Cherny
Getting Your Data Joie de Vivre (Back)
A keynote talk for Europython 2019 and PyData London 2019 on fun data apps and hacks. Some data vis, react, word2vec vis, poetry editing tools, AI art.