by Lynn Cherny
@arnicas@mstdn.social / @arnicas
A talk for #normconf!
15 Dec 2022
PRESS SPACE TO ADVANCE TO ALL SLIDES!
I might go fast-ish... slides/code will live on :)
examples of text2image art on Lexica today
Useful for exploratory analysis, often for cleaning, as well as weak supervision labeling. And impresses clients.
All embeddings are useful: word2vec, for instance: What's the vocabulary doing?
And make it interactive.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-distilroberta-v1')
X = model.encode(bodyprompts)
Your embedding - use what you like here
Your UMAP (dim reduction, see umap-learn, by Leland McInnes)
from umap import UMAP
umap = UMAP()
Xtfm = umap.fit_transform(X)
# put in a df to use in plot
bdf = pd.DataFrame()
bdf['x'] = Xtfm[:, 0]
bdf['y'] = Xtfm[:, 1]
bdf['text'] = bodyprompts
gets you a 2D matrix:
x & y coords
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, output_file, output_notebook, show
# to display in a notebook - after the loads above
output_notebook()
# plain tooltips on hover
TOOLTIPS = [
("text", "@text"),
]
# or use html to make them more readable
TOOLTIPS = """
<div style="max-width: 400px; word-wrap: break-word;">
<span style=" font-weight: bold;">👉 @text</span>
</div>
"""
# output_file("body_prompts.html", title="Body Prompts, A Sample") # uncomment to make an html file
source = ColumnDataSource(bdf)
p = figure(title='"Body Prompts from Stable Diffusion Sample',
x_range=(-5,20), y_range=(-5, 12), width=900, height=800, tooltips=TOOLTIPS)
p.scatter(x='x', y='y', size=3, source=source, alpha=0.8)
p.xaxis[0].axis_label = 'X'
p.yaxis[0].axis_label = 'Y'
show(p)
yes, it's small - see slides and repo
You can also use html in the
tooltips if you want!
Other examples: text2image prompts containing the word "blood"
Then labeled carefully in Photoshop for client
the blood prompts, crappily labeled
nomic.ai's (i.e., Ben Schmidt's) map of
Vincent Warmerdam's bulk tool supports selecting from a umap in order to help with labeling efforts
ThisNotThat from Leland McInnes -
maybe better for data vis artifact creation?
(((((1950s castle. muted colors.))))) by Jean-Baptiste Monge !!!!!!!!!!!!!!!!!!!!!!!!!!!
surreal isometric platform video game titled'the castle of furious hunger'created by max ernst and hieronymous bosch, detailed design, octane render unreal engine 8 k
like, artist names.
[('greg rutkowski', 23773), ('artgerm', 17064), ('alphonse mucha', 14913), ('Greg Rutkowski', 4079), ('ilya kuvshinov', 2818), ('makoto shinkai', 2234), ('magali villeneuve', 2022), ('Artgerm', 1999), ('Alphonse Mucha', 1828), ('james jean', 1614), ('peter mohrbacher', 1533), ('wlop', 1465), ('donato giancola', 1380), ('ross tran', 1287), ('krenz cushart', 1180), ('Stanley Artgerm Lau', 1025), ('charlie bowater', 1017), ('ruan jia', 1005), ('James Jean', 952), ('Ilya Kuvshinov', 893), ('tomasz alen kopera', 877), ('dan mumford', 866), ('boris vallejo', 854), ('marc simonetti', 850), ('joseph christian leyendecker', 849)]
FYI and btw: No wikipedia page for "Artgerm" and many of the more common artists in use. Relevant for linking discussions.
Here using 80K Stable Diffusion prompts available here (by Gustavosta via HuggingFace).
Run through spaCy lib's bert-based NER (named entity resolution), totals for "PERSON"s found:
What's your goal?
Do you care that masto "tooters" are using both #emnlp2022 and #emnlp, for instance?
Do you want to know the "truth" (what people are really typing) or do you want to know what artists are inspiring people?
What is truth?
For example papers summarizing efforts, see this repo section of links.
left_name is original, right_name is label it gave to group them.
'Kanye West': ["kanye west's", 'a kanye west', 'kanye west'],
'Rolf Armstrong': ['rolf armstrong', 'rolf armstrong.'],
'Tooth Wu and wlop and beeple': ['tooth wu and wlop and beeple and greg rutkowski','tooth wu and wlop and beeple']
Reviewing surfaces some errors in the NER results, related to token boundaries
Text
this is multiple artists, too - bad NER
Reporting most often
varied and then combined
Not all artists, btw: sometimes subjects of the image (needs another recognition model, making a span categorizer + ner)
Management of variants can be done by hand, too, for big string diffs but equivalent entity id's: "PWC" == "Price Waterhouse Cooper"
Fundamentally, this is a time-intensive problem, requiring good UI, tools, and management of reference dictionaries.
Text | Issue |
---|---|
"Arrest of Dairo Antonio Usuga David, alias 'Otoniel' or 'Mauricio'" | entire article refers to him as "Otoniel" then - what does the client want to norm this to? is there a search engine? |
"James Bond" and "Bond" | count as same for the client, and increment mentions of "James Bond" |
"the Karic clan" | an ORG now, despite Fred Karic etc. |
"the Trump-administration" | also an ORG |
(relevant: dirty_cat, for sklearn models on unclean text data
FlashText, for very fast replacement of variants to norm form)
Note: This is JUST the deduping part of entity resolution, which is a bigger problem :)
in combo with NLP (a SpaCy execution focus)
Text
You defined your "variants" somehow: Now.... you can have multiple patterns/rules for entities that resolve to the same id, using the entity ruler.
from spacy.lang.en import English
nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}]
ruler.add_patterns(patterns)
doc1 = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])
doc2 = nlp("Apple is opening its first big
office in San Fran.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])
Your mission:
Write a nsfw (porn) classifier for text prompts. You only have ~8 million input text strings to work with and no good labels. Also the entities in this data are poorly recognized "off the shelf" and people are adversarial.
(Assume there is no image classifier as part of the problem.)
insert meme about trying lots of things, including rule-based labeling, umap and label, reviewing blocked word data and labeling, etc.
But hey, one class of "blocked words" created by mods was ... not a bad idea! for simple rule-based classification.
And this class of examples was overlooked by a few of the other public "cleaned up" prompt datasets!
Yes, it's porn stars.
Training data is image + caption/alt-text. What do you think is in the photos of porn stars?
The goal in text2image is to get a picture out. The text itself may not be porny (or gory, or other NSFW) but the output might be and people figure out how to get what they want, and a lot want porn.
If you can find or create a gazetteer, why not? One way to get entity data might be wikipedia.
A "gazetteer" in NLP is a list of entity strings.
This reflects in the content, which is heavily male, .... As The Guardian reported in 2014, the entries for women porn stars numbered more than the entries for women writers — a ratio that has changed since then, but only due to the concerted efforts of women editors."
Pipeline goal:
1. Trained classifier for NSFW content
2. List of porn names as "rules"
3. Override the text classification if a porn name comes up, class it as "nsfw". Best to be safe out there.
from spacy.language import Language
@Language.component("reset_textcat")
def reset_textcat(doc):
for ent in doc.ents:
if ent.label_ == "PORN_PERSON":
doc.cats = {'POSITIVE': 1, 'NEGATIVE': 0}
return doc
nlp = spacy.load('en_core_web_trf')
mytextcat = spacy.load("samplemodel/model-best/") # load a trained textcat model
nlp.add_pipe("textcat", source=mytextcat) # the base trained model goes before the entity ruler, which overrides it.
# the entity ruler is a set of porn names you've researched - 2 examples:
ruler = nlp.add_pipe("entity_ruler", after="ner", config = {"overwrite_ents": True})
patterns = [{"label": "PORN_PERSON", "pattern": [{"LOWER": "lanna"}, {"LOWER": "rhoades"}]},
{"label": "PORN_PERSON", "pattern": [{"LOWER": "mia"}, {"LOWER": "khalifa"}]}]
ruler.add_patterns(patterns) # don't forget to add the rules
nlp.add_pipe("reset_textcat", last=True) # add your override componenent!
doc = nlp("Having coffee with Mia Khalifa in the coffe shop in London, style of Alphonse Mucha")
print("nsfw classification:", doc.cats)
for ent in doc.ents:
print(ent, ent.label_)
porn classification: {'POSITIVE': 1, 'NEGATIVE': 0} Mia Khalifa PORN_PERSON London GPE Alphonse Mucha PERSON
doc = nlp("Having coffee with Mia Farrow in a coffe shop in London, style of Artgerm")
{'POSITIVE': 0.33414962887763977, 'NEGATIVE':0.6658503413200378} Mia Farrow PERSON London GPE Artgerm PERSON
Model Options: Entity rec fail with spacy default transformer model on this string:
{'text': 'steampunk market interior, colorful, 3 d scene, greg rutkowski, zabrocki, karlkka, jayison devadas, trending on artstation, 8 k, ultra wide angle, zenith view, pincushion lens effect', 'persons': []},
Better with bert-base-NER-uncased model via HF demo:
nlp = spacy.load("en_coreference_web_trf")
nlp.add_pipe("combine_corefs")
nlp.add_pipe("transformers-ner")
nlp.add_pipe("combine_acronym_parens", after="transformers-ner")
nlp.initialize()
His Rules in NLP blog post; note, skipping some slides on weak supervision, will be in the deck.
You want a hard rule in this classification context, and models can sometimes be fragile or be tricked.
People can circumvent the shape of the training data, sometimes -- like putting a porn actress in the list with the artist names might cause a model trained on "standard" porn name contexts to fail.
(☺️ I do not work for them.)
this is only
Skweak lib:
"Weak supervision shifts the focus away from collecting manual annotations and concentrates the effort on developing good labelling functions for the target domain." (their paper)
"The purpose of weak supervision is precisely to combine together a set of weaker/noisier supervision signals, leading to a form of denoising."
Create a HMM "ensemble" of signals.
"Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows."
Possibly too much for talk time inclusion, but check them out, they've all saved me a lot of time recently!
an incredibly useful pip solution to your torch / cuda version installation issues on your random VM
(normconf) root@3005807510af:/files# ltt install torch torchvision
Collecting torch
Downloading https://download.pytorch.org/whl/cu111/torch-1.9.1%2Bcu111-cp39-cp39-linux_x86_64.whl (2041.3 MB)
it figures out wtf version to get for you!
Link - Out of beta now?
You have to be quite specific in your search terms, but can specify language too.
With it I was able to find code examples of integrating non-obvious transformer models into SpaCy in smart ways
(e.g., MS's Presidio project, a lib for de-PII data)
Bookmark this page on UI tips for the
query editor. I have to look for it every damn time.
I like to post links to things: @arnicas@mstdn.social
My newsletter is fun links and recs: arnicas.substack.com
repo (in progress) link: https://github.com/arnicas/nlp-tips-and-tricks