Or, trying to get some of those.
November, 2019 (Micro Meso Macro conf in Lyon)
It's also one of the reasons the models seem like "black boxes."
A word will be represented by a single vector in a matrix -- we find similar words (by context) by finding similar vectors.
CBOW arch for word2vec from link
going from many dimensions
into small N-dimensional "shorthand"
Word chain of nearest neighbors from "wind" in a Gutenberg poetry word2vec model (based on Allison Parrish's poetry corpus work here)
Color represents a normalized distance between the original word (random "nouny"*-words) and its next closest relative in the embedding model.
Blue: Closer, Pink: Further Away.
Click on a colored word... pick a replacement from it's nearest neighbors in the model....
and you edited the poem :)
also, you want lots of w2v models to make it interesting (and there are still grammar issues all over)
A standard matrix size for word2vec is 200 dimensions.
This is teeny:
and only 3
UMAP layout of w2v of Allison Parrish's Gutenberg Poetry Corpus, color-dated by author death year
This blob is archaic
language, or Latin..
Led me to refine the data
*A horrible process of merging metadata from Gutenberg, using author's death date as proxy, some line labeling, making a Char-RNN model + post-processing heuristics to identify poetry lines. Yes, this colormap is awful but it was fine for exploratory work.
Hist of mean years
the word "thus" by mean year
Generative models learn latent variables, "z's", which describe the data they are trained on. You can then "sample" and generate new data from their spaces, rather than just reconstruct the original training data!
The VAE learns a probability distribution (with a mean and standard deviation) for various, possibly mysterious, dimensions:
The actual encoding will vary each time! we are sampling from the distributions.
So, if I sample my modern(ish) poetry model with the same input sentence 4 times, I can get very different looking decoded results:
greedy(['The forest was dark.', 'The forest was dark.', 'The forest was dark.', 'The forest was dark.'])
['And timeless eyes', 'The little little crucifix', 'And then the angel dies.', 'The little sparrows']
UMAP failed to notice anything structured. Many, many times.
I've got a new model, which seemed to train well (it took a few days though).
I can "inspect" it by looking at how it encodes the training data, right?
I've got a lot of data, of 64 latent dimensions. What does it look like?
On an earlier model, I tried looking at the shape of the z dimensions I encoded, and saw a bunch of variation* (also, it's 64 distributions so it's a mess to engage with).
* Done with Vaex, which was very fast at making this graph.
Altair boxplots and strip plots in Jupyter notebooks... with working mouseovers- ok, maybe some interesting differences?!
I borrowed someone's code* for an interactive parallel coordinates chart, to play with the data a bit....
And then I thought of looking at the means, instead of the sampled z's.
Using the means of a training sample, I see some coherent patterns I didn't see before...
This selection of z33 high values shows some repeated curve patterns, and a lot of single word lines.
This first line, highlighted, is long and texty. Notice how it's different on a few z dimensions.
This plot, in matplotlib, using nptsne (which is quite fast), on the training means shows structure in the latent dimensions.
Time to switch to D3 and fully interactive tools...
Added a few orientation points for the encoded texts... (demo)
Find your click point's text - it will encode and decode from there and build your poem. You can delete lines you don't like.
(built of d3 + tsne layout random sample data + api to call model)
But: is the original text better than the reconstructed? There are still a lot of disfluencies and cross-line problems.
by Damien Newman, source
I spent a few weeks here, just trying to understand this model and data.