NLP Tips & Tricks

by Lynn Cherny

@arnicas@mstdn.social / @arnicas

A talk for #normconf!

15 Dec 2022

Repo link: https://github.com/arnicas/nlp-tips-and-tricks

Slides https://ghostweather.slides.com/lynncherny/nlp-tips-tricks

PRESS SPACE TO ADVANCE TO ALL SLIDES!

A bit about me

Data science consultant, especially NLP now
Former data viz / analytics / professor...
I do a lot with creative AI: Google Arts & Culture, other clients

The Plan

UMAP it! (Data vis)
Entities and Duplicates (Norming ents)
Rules and NLP models (lots o' spaCy)
Misc Life (in NLP) Tips

I might go fast-ish... slides/code will live on :)

examples of text2image art on Lexica today

UMAP It!

Useful for exploratory analysis, often for cleaning, as well as weak supervision labeling. And impresses clients.

All embeddings are useful: word2vec, for instance: What's the vocabulary doing?

corpus link

You can also get an embedding for entire sentences...

And make it interactive.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-distilroberta-v1') X = model.encode(bodyprompts)

Your embedding - use what you like here

Your UMAP (dim reduction, see umap-learn, by Leland McInnes)

from umap import UMAP

umap = UMAP() Xtfm = umap.fit_transform(X)

# put in a df to use in plot

bdf = pd.DataFrame() bdf['x'] = Xtfm[:, 0] bdf['y'] = Xtfm[:, 1] bdf['text'] = bodyprompts

gets you a 2D matrix:

x & y coords

from bokeh.models import ColumnDataSource from bokeh.plotting import figure, output_file, output_notebook, show

# to display in a notebook - after the loads above output_notebook()

# plain tooltips on hover TOOLTIPS = [ ("text", "@text"), ]

# or use html to make them more readable TOOLTIPS = """ <div style="max-width: 400px; word-wrap: break-word;"> <span style=" font-weight: bold;">👉 @text</span> </div> """

# output_file("body_prompts.html", title="Body Prompts, A Sample") # uncomment to make an html file

source = ColumnDataSource(bdf)

p = figure(title='"Body Prompts from Stable Diffusion Sample', x_range=(-5,20), y_range=(-5, 12), width=900, height=800, tooltips=TOOLTIPS) p.scatter(x='x', y='y', size=3, source=source, alpha=0.8) p.xaxis[0].axis_label = 'X' p.yaxis[0].axis_label = 'Y'

show(p)

yes, it's small - see slides and repo

You can also use html in the

tooltips if you want!

Other examples: text2image prompts containing the word "blood"

Then labeled carefully in Photoshop for client

the blood prompts, crappily labeled

A bigger one, in the wild

nomic.ai's (i.e., Ben Schmidt's) map of

krea.ai's Stable Diffusion search engine

Related libs

Vincent Warmerdam's bulk tool supports selecting from a umap in order to help with labeling efforts

ThisNotThat from Leland McInnes -

maybe better for data vis artifact creation?

Deduping and Entity Resolution

Suppose you want to see trends in searches/uses of an entity...

Lexica link

(((((1950s castle. muted colors.))))) by Jean-Baptiste Monge !!!!!!!!!!!!!!!!!!!!!!!!!!!

Lexica link

surreal isometric platform video game titled'the castle of furious hunger'created by max ernst and hieronymous bosch, detailed design, octane render unreal engine 8 k

like, artist names.

[('greg rutkowski', 23773),
 ('artgerm', 17064),
 ('alphonse mucha', 14913),
 ('Greg Rutkowski', 4079),
 ('ilya kuvshinov', 2818),
 ('makoto shinkai', 2234),
 ('magali villeneuve', 2022),
 ('Artgerm', 1999),
 ('Alphonse Mucha', 1828),
 ('james jean', 1614),
 ('peter mohrbacher', 1533),
 ('wlop', 1465),
 ('donato giancola', 1380),
 ('ross tran', 1287),
 ('krenz cushart', 1180),
 ('Stanley Artgerm Lau', 1025),
 ('charlie bowater', 1017),
 ('ruan jia', 1005),
 ('James Jean', 952),
 ('Ilya Kuvshinov', 893),
 ('tomasz alen kopera', 877),
 ('dan mumford', 866),
 ('boris vallejo', 854),
 ('marc simonetti', 850),
 ('joseph christian leyendecker', 849)]

FYI and btw: No wikipedia page for "Artgerm" and many of the more common artists in use. Relevant for linking discussions.

Here using 80K Stable Diffusion prompts available here (by Gustavosta via HuggingFace).

Run through spaCy lib's bert-based NER (named entity resolution), totals for "PERSON"s found:

UI / usage questions

What's your goal?

Do you care that masto "tooters" are using both #emnlp2022 and #emnlp, for instance?

Do you want to know the "truth" (what people are really typing) or do you want to know what artists are inspiring people?

What is truth?

For example papers summarizing efforts, see this repo section of links.

- District Data labs slides (Benjamin Bengfort)

Misspellings or ortho variants found: "Leonardo davinci", "Leonardo Da Vinci", "Leonorda da Vinc"
Name or label variants that are all "true" but different: "Mrs. Smith", "Mrs. John Smith", "Pamela Smith, née Austin"
Misrecognitions by the model: "Mary's"
Coreference: "Barry the Bat was a a fun guy. Barry partied all night!" ☞ "Barry the Bat" and "Barry"

Entity Resolution Issues You Might Have after NER

https://github.com/Bergvca/string_grouper

One useful deduplication tool for string similarity...

left_name is original, right_name is label it gave to group them.

'Kanye West': ["kanye west's", 'a kanye west', 'kanye west'],

'Rolf Armstrong': ['rolf armstrong', 'rolf armstrong.'],

'Tooth Wu and wlop and beeple': ['tooth wu and wlop and beeple and greg rutkowski','tooth wu and wlop and beeple']

Reviewing surfaces some errors in the NER results, related to token boundaries

Text

this is multiple artists, too - bad NER

Biggest "Groups"

Reporting most often

varied and then combined

Not all artists, btw: sometimes subjects of the image (needs another recognition model, making a span categorizer + ner)

Management of variants can be done by hand, too, for big string diffs but equivalent entity id's: "PWC" == "Price Waterhouse Cooper"

Some reasons you end up "by hand" a lot.

James Jean != James Dean. But they sure are close.
Also many names and surname/common name order.
- Cushart Krenz == Krenz Cushart (some algs will find these for you though)
PWC / Price Waterhouse Cooper - not string similar. (Abbreviations/aliases in general are a big problem.)
Finally: What do you want the normalized/canonicalized form to be? You may have to do research or consult domain experts. E.g., you may have style guides for names (Alexei vs. Alexy vs. Aleksi...), special rules for norm forms, level and types of entity resolution needed...

Fundamentally, this is a time-intensive problem, requiring good UI, tools, and management of reference dictionaries.

Other real world examples

How do they want things to appear in the UI/search engine? Or be counted? What's it being used for??

Text	Issue
"Arrest of Dairo Antonio Usuga David, alias 'Otoniel' or 'Mauricio'"	entire article refers to him as "Otoniel" then - what does the client want to norm this to? is there a search engine?
"James Bond" and "Bond"	count as same for the client, and increment mentions of "James Bond"
"the Karic clan"	an ORG now, despite Fred Karic etc.
"the Trump-administration"	also an ORG

UI Tool Approach: Open Refine

Link

Recs for using OR on string clustering

Duplicate your original entity column, if you want to retain "mappings" to the norm form
Use the "transform" option till you're done... but keep in mind the choice of norm form is itself difficult/prone to needing deduping!
Then extract the mappings to make an entity id link component (e.g., in spaCy) or a normalization dict for post-processing entities after recognition

Other deduplication resources

csvdedupe - an interface to the python dedupe lib, for record merging
ML-based (i.e., needing labels and training) methods like Zingg offers, or spaCy trained linkers
Libs that combine fast big data methods like Bloom Filters and Locality Sensitive Hashing, e.g. text-dedup
cargo, a rust thing, for finding ExactSubstrings quickly in LLM training data
Code used in the Big Science HF LLM model project

(relevant: dirty_cat, for sklearn models on unclean text data

FlashText, for very fast replacement of variants to norm form)

Note: This is JUST the deduping part of entity resolution, which is a bigger problem :)

Rules and Gazettes and Augmented NLP Pipelines

in combo with NLP (a SpaCy execution focus)

Entities with ids, in spaCy

Text

You defined your "variants" somehow: Now.... you can have multiple patterns/rules for entities that resolve to the same id, using the entity ruler.

from spacy.lang.en import English

nlp = English() ruler = nlp.add_pipe("entity_ruler") patterns = [{"label": "ORG", "pattern": "Apple", "id": "apple"}, {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}], "id": "san-francisco"}, {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "fran"}], "id": "san-francisco"}] ruler.add_patterns(patterns)

doc1 = nlp("Apple is opening its first big office in San Francisco.") print([(ent.text, ent.label_, ent.ent_id_) for ent in doc1.ents])

doc2 = nlp("Apple is opening its first big office in San Fran.")
print([(ent.text, ent.label_, ent.ent_id_) for ent in doc2.ents])

See docs, and see videos by Sophie Van Landeghem (e.g. here) on training entity linking/disambiguation with wikidata id's (but you can use your own)

Using entity rules with a text classifier (in spaCy)

Your mission:

Write a nsfw (porn) classifier for text prompts. You only have ~8 million input text strings to work with and no good labels. Also the entities in this data are poorly recognized "off the shelf" and people are adversarial.

(Assume there is no image classifier as part of the problem.)

insert meme about trying lots of things, including rule-based labeling, umap and label, reviewing blocked word data and labeling, etc.

But hey, one class of "blocked words" created by mods was ... not a bad idea! for simple rule-based classification.

And this class of examples was overlooked by a few of the other public "cleaned up" prompt datasets!

Yes, it's porn stars.

Digression on why that is.

Training data is image + caption/alt-text. What do you think is in the photos of porn stars?

The goal in text2image is to get a picture out. The text itself may not be porny (or gory, or other NSFW) but the output might be and people figure out how to get what they want, and a lot want porn.

If you can find or create a gazetteer, why not? One way to get entity data might be wikipedia.

A "gazetteer" in NLP is a list of entity strings.

This reflects in the content, which is heavily male, .... As The Guardian reported in 2014, the entries for women porn stars numbered more than the entries for women writers — a ratio that has changed since then, but only due to the concerted efforts of women editors."

--Inside the Fight to Change Wikipedia's Gender Problem

Why yes, it's not bad at that.

Creative mods of pipelines in spaCy is... sometimes non-trivial.

Pipeline goal:

1. Trained classifier for NSFW content

2. List of porn names as "rules"

3. Override the text classification if a porn name comes up, class it as "nsfw". Best to be safe out there.

The Override: Define a custom pipeline component to set the doc classification if you get a porn name

from spacy.language import Language

@Language.component("reset_textcat") def reset_textcat(doc): for ent in doc.ents: if ent.label_ == "PORN_PERSON": doc.cats = {'POSITIVE': 1, 'NEGATIVE': 0} return doc

Pipeline Modification Code

nlp = spacy.load('en_core_web_trf')

mytextcat = spacy.load("samplemodel/model-best/") # load a trained textcat model

nlp.add_pipe("textcat", source=mytextcat) # the base trained model goes before the entity ruler, which overrides it.

# the entity ruler is a set of porn names you've researched - 2 examples:

ruler = nlp.add_pipe("entity_ruler", after="ner", config = {"overwrite_ents": True})

patterns = [{"label": "PORN_PERSON", "pattern": [{"LOWER": "lanna"}, {"LOWER": "rhoades"}]}, {"label": "PORN_PERSON", "pattern": [{"LOWER": "mia"}, {"LOWER": "khalifa"}]}] ruler.add_patterns(patterns) # don't forget to add the rules

nlp.add_pipe("reset_textcat", last=True) # add your override componenent!

doc = nlp("Having coffee with Mia Khalifa in the coffe shop in London, style of Alphonse Mucha")

print("nsfw classification:", doc.cats) for ent in doc.ents: print(ent, ent.label_)

porn classification: {'POSITIVE': 1, 'NEGATIVE': 0}
Mia Khalifa PORN_PERSON
London GPE
Alphonse Mucha PERSON

doc = nlp("Having coffee with Mia Farrow in a coffe shop in London, style of Artgerm")

{'POSITIVE': 0.33414962887763977, 'NEGATIVE':0.6658503413200378}
Mia Farrow PERSON
London GPE
Artgerm PERSON

Model Options: Entity rec fail with spacy default transformer model on this string:

{'text': 'steampunk market interior, colorful, 3 d scene, greg rutkowski, zabrocki, karlkka, jayison devadas, trending on artstation, 8 k, ultra wide angle, zenith view, pincushion lens effect',
  'persons': []},

Better with bert-base-NER-uncased model via HF demo:

So, more silly spaCy tricks: see the github

nlp = spacy.load("en_coreference_web_trf")

nlp.add_pipe("combine_corefs")

nlp.add_pipe("transformers-ner")

nlp.add_pipe("combine_acronym_parens", after="transformers-ner")

nlp.initialize()

An oldie but goodie from Peter Baumgartner (also a speaker :)

His Rules in NLP blog post; note, skipping some slides on weak supervision, will be in the deck.

Write some rules.
Match on docs.
Train a textcat model using the rule hits: weak supervision
Review the errors - FN and FP!
Fix and improve model.

Why you might want rules to co-exist with the trained model.

You want a hard rule in this classification context, and models can sometimes be fragile or be tricked.

People can circumvent the shape of the training data, sometimes -- like putting a porn actress in the list with the artist names might cause a model trained on "standard" porn name contexts to fail.

SpaCy resources

Github Discussions (this is key! with tags)
Website
Videos with lots of tutorial material
The gigantic scary flowchart of NER workflow decision points
Prodigy, their customizable labeling tool with an active Support forum and integration with spaCy
The Advanced SpaCy online web course
Tailored analysis/pipelines and consulting help on projects -- this is a thing I would use myself, they are nice (link)

(☺️ I do not work for them.)

this is only

part of it

Weak supervision and rules/heuristics

Skweak lib:

"Weak supervision shifts the focus away from collecting manual annotations and concentrates the effort on developing good labelling functions for the target domain." (their paper)

"The purpose of weak supervision is precisely to combine together a set of weaker/noisier supervision signals, leading to a form of denoising."

Create a HMM "ensemble" of signals.

Snorkel

Argilla (formerly Rubrix)

"Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows."

Misc Code Life Tips

Possibly too much for talk time inclusion, but check them out, they've all saved me a lot of time recently!

Light the torch

an incredibly useful pip solution to your torch / cuda version installation issues on your random VM

(normconf) root@3005807510af:/files# ltt install torch torchvision Collecting torch Downloading https://download.pytorch.org/whl/cu111/torch-1.9.1%2Bcu111-cp39-cp39-linux_x86_64.whl (2041.3 MB)

it figures out wtf version to get for you!