Lynn Cherny, PyData Warsaw 2018
@arnicas
Slides:
https://ghostweather.slides.com/lynncherny/tl-dr-summarization
Data-to-Text Generation with Content Selection and Planning, Puduppully et al 2018
Natural Language Generation from Structured Data, Matulik 2018
Neural Text Generation from Structured Data, Facebook Research, Lebret et al. 2018
Really random recent examples of structured data...
frederick parker-rhodes ( 21 november 1914 – 2 march 1987 ) was an australian rules footballer who played with carlton in the victorian football league ( vfl ) during the XXXXs and XXXXs .
I : source text interpretation to source text representation
T : source representation transformation to summary text representation
G : summary text generation from summary representation.
from Sparck Jones 1999
Extractive:
Base the summary on text in the original document(s). Means sentence representation and "scoring" for ranking purposes for retrieval.
Abstractive:
Generalize from the source text(s) and produce original text summaries. I.e., identify the main ideas in the documents and encode them into feature representations. Then pass to natural language generation (NLG) systems.
Universities Minister Sam Gyimah has confirmed plans for universities to be able to charge higher fees for shorter, more intensive courses.
It would mean paying about £5,500 less than for a three-year course - which would mean about £11,000 per year.
But raising annual fees above £9,250 would require Parliamentary approval.
Universities Minister Sam Gyimah has confirmed plans for universities to be able to charge higher fees for shorter, more intensive courses.
It would mean paying about £5,500 less than for a three-year course - which would mean about £11,000 per year.
Universities may be able to charge higher fees for shorter courses. But raising fees requires Parliamentary approval.
must be discussed....
"The factors that human experts must consider when giving scores to each candidate summary are grammaticality, non-redundancy, integration of most important pieces of information, structure and coherence."
Since asking a human to manually evaluate millions of summaries is long and impractical at scale, we rely on an automated evaluation metric called ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE works by comparing matching sub-phrases in the generated summaries against sub-phrases in the ground truth reference summaries.
Src: Salesforce Blog
....the summaries with the highest ROUGE aren't necessarily the most readable or natural ones.
Gold Summary by Human: A good diet must have apples and bananas.
Model output: Apples and bananas are must for a good diet.
(Alternative model output): a apples bananas and good diet must
If we use the ROUGE-1, the score is 7/8 = 0.875.
Same for the alternative word salad!
For ROUGE-2, 1st model output would get 4/7 = ~0.57.
Seb Ruder's github repo of references - Warning on ROUGE Metrics in Summarization
The metrics are pretty dubious.
Some of the most popular datasets are problematic (CNN/DailyMail)
There is a lot of "teaching to the test" in the academic literature.
Human review is still critical, to the degree it can be afforded.
abstractive
extractive
Problem 1: The summaries sometimes reproduce factual details inaccurately (e.g. Germany beat Argentina 3-2). This is especially common for rare or out-of-vocabulary words such as 2-0.
Problem 2: The summaries sometimes repeat themselves (e.g. Germany beat Germany beat Germany beat…)
Pointer-Generator + Coverage (See et al 2017):
Encoder is a single layer bidirectional LSTM.
Decoder is a single-layer uni-directional LSTM, with a pointer for attention that copies words from the source. Plus a coverage penalty metric reduces repetition and out of vocab inaccuracies.
repetition
unknown
words,
inaccuracy
better!
actual code for SummaRuNNer (Nallapati 2016, 2017)
(Audience,
I did it.)
Is there a checkpoint available?
Is the code documented?
Do you have to build the datasets yourself?
How easy is it to retrain? Or transfer?
These are serious non-trivial questions before embarking on using someone's academic model. Budget time :)
There are 2 kids, abandonned or lost in the woods... they left breadcrumb trails to get home but the birds ate them. They find a ginger-bread cottage and are welcomed by a witch. She wants to fatten them up to eat them. One of the kids shoves her in the oven and they escape!
What I Remembered (Roughly)
The real story is a little surprising... there is a ton of text "noise" in this story. (I know you can't read this...)
The kids are left TWICE in the woods, after the woodcutter's wife convinces him to do this.
"He who says A must say B, likewise, and as he had yielded the first time, he had to do so a second time also."
This weird structure happens twice:
when they had gone a little distance , hansel stood still , and peeped back at the house... " ah , father ," said hansel , " i am looking at my white cat sitting upon the roof of the house , and trying to say good - bye ." " that is not a cat ; it is only the sun shining on the white chimney ."
Gender roles: Hansel thinks of the breadcrumbs, but Grethel kills the witch and convinces a duck to save them. The step-mother and witch are evil. The father goes along with leaving them, convinced by his new wife, but remains a loved parent at the end (and she is dead).
Odd avian characters: A bird leads them to the witch's house. Birds eat the breadcrumbs. But a duck saves them going home by taking them across a stream.
More text "noise":
My tale is done .
There runs a mouse ; whoever catches her may make a great , great cap out of her fur .
--Gutenberg text of Grimm's Fairytales (maybe a different translation than mine)
"know , then , my husband , answered she , " we will lead them away , quite early in the morning , into the thickest part of the wood , and there make them a fire , and give them each a little piece of bread ; then we will go to our work , and leave them alone , so they will not find the way home again , and we shall be freed from them ."
it soon ceased , and spreading its wings flew off ; and they followed it until it arrived at a cottage , upon the roof of which it perched ; and when they went close up to it they saw that the cottage was made of bread and cakes , and the window - panes were of clear sugar .
the old woman behaved very kindly to them , but in reality she was a wicked witch who waylaid children , and built the bread - house in order to entice them in , but as soon as they were in her power she killed them , cooked and ate them , and made a great festival of the day .
" creep in , said the witch , " and see if it is hot enough , and then we will put in the bread "; but she intended when grethel got in to shut up the oven and let her bake , so that she might eat her as well as hansel .
then grethel gave her a push , so that she fell right in , and then shutting the iron door she bolted it !
this the good little bird did , and when both were happily arrived on the other side , and had gone a little way , they came to a well - known wood , which they knew the better every step they went , and at last they perceived their father ' s house .
he had not had one happy hour since he had left the children in the forest ; and his wife was dead .
My 7
line
Picks
from
Hansel
&
Gretel
made in R package ggpage :)
Twice left in the woods stuff -
should it be in the summary??
Using SummaRuNNer (Nallapati et al. 2016)
(compared to a lot of other code, it was somewhat easily modifiable and used a general embedding, included model checkpoints, and was easy to get the ranked sentence outputs.)
content richness,
salience,
novelty.
Text description of the architecture of SummaRuNNer
SummaRuNNer: 3 Sample Models' Picks for Extraction. Sadness.
Observation: Top of the "article"! (Also, tbh, a cutoff is buried in the code, but that's not affecting the focus on top of the text.)
The Inverted Pyramid:
"Abbreviations" on the training data often include chopping the documents summarized and/or sentence length to limit the number of sentences trained on.
And for the commonly used CNN/Daily Mail data:
On average, there are about 28 sentences per document in the training set, and an average of 3-4 sentences in the reference summaries. This is short stuff.
And hidden in the news data focus.... shortness.
Refresh frames extractive summarization as a sentence ranking task and uses a novel training algorithm which globally optimizes the ROUGE evaluation metric through a reinforcement learning objective.
Sidenet tries to use title and image captions to help with the summary.
Kedzie, McKeown, Daumé (2018) at EMNLP
1. Sentence position dominates the signal for news, but not other genres.
2. Word embedding averaging is as good or better than RNNs/CNNs for sentence embedding.
3. Pre-trained word embeddings are as good or better than learned embeddings in 5/6 datasets.
4. Non auto-regressive sentence extraction performs as good or better than auto-regressive.
However, those authors would concede what Noah Weber and collaborators showed last month: in practice, these abstractive networks work by “mostly, if not entirely, copying over phrases, sentences, and sometimes multiple consecutive sentences from an input paragraph, effectively performing extractive summarization.” So, for now at least, you get the training data requirements and engineering complexity of cutting-edge deep learning without the practical performance increase.
see also EMNLP paper - Zhang et al. 2018 ("we also implemented a pure copy system")
We find that many sophisticated features of state of the art extractive summarizers do not improve performance over simpler models. These results suggest that it is easier to create a summarizer for a new domain than previous work suggests and bring into question the benefit of deep learning models for summarization for those domains that do have massive datasets (i.e., news).
scientific papers
(with abstracts and keywords),
fiction,
product reviews,
life stories / obituaries,
emails,
tweet exchanges,
dialogue,
etc.
NB: I highly recommend Allahyari et al for review of non-neural approaches to summarization, including topic modeling during intermediate representation step.
[Extractive] algorithms rank each sentence based on its relation to the other sentences by using pre-defined formulas such as the sum of frequencies of significant words (Luhn algorithm [Luhn, 1958]); the overlapping rate with the document title [EEEEEEEEE!] (PyTeaser [Xu, 2004]); the correlation with salient concepts/topics (Latent Semantic Analysis [Gong and Liu, 2001]); and sum of weighted similarities to other sentences (TextRank [Mihalcea and Tarau, 2004] ).
Yue Dong review paper (Apr 2018)
based on TextRank alg too.
Gensim's TextRank Summarization
at least
the witch appears!
(we set up the final
drama.)
Sumy lib's TextRank
better
spread than
gensim,
last 2 choices
are great.
Lines from
Sumy Lib LexRank
shorter
lines,
good spread,
conclusion
included.
Online Summarization Tool Demos
smmry.com is the most fun.
Autosummarizer is not too bad, but more opaque.
sorry.
Do they use content analysis? Some do... and some do very little:
Since the output of the algorithm is content that is relatively new, the algorithm cannot rely purely on user data, and instead opts to make use of the complex data embedded in the very timbers, tempos, and sounds of the newly released tracks. This allows Spotify to deliver intelligent curation of content, even in the absence of user data, thus resulting in music recommendations that are as accurate as technology currently allows for – by combing through literally every piece of the music.
Spotify's Release Radar - New Music Recommendation, Content-Based Analysis.
Angelidis and Lapata 2018 paper
sentiment +!
The feel-good book of the year, Night of Miracles is a delightful novel of friendship, community, and the way small acts of kindness can change your life.
Abandoned is the second book in a thrilling sci-fi action adventure set on Donovan, a treacherous alien planet where corporate threats and dangerous creatures imperil the lives of the planet’s colonists.
from the Penguin site
Extractive text as art seed,
commentary,
conversation.
Grahame-Smith began with the original text of Austen's novel, adding zombie and ninja elements while developing an overall plot line for the new material; "you kill somebody off in Chapter 7, it has repercussions in Chapter 56". According to the author, the original text of the novel was well-suited for use as a zombie horror story:
You have this fiercely independent heroine, you have this dashing heroic gentleman, you have a militia camped out for seemingly no reason whatsoever nearby, and people are always walking here and there and taking carriage rides here and there ... It was just ripe for gore and senseless violence. From my perspective anyway.
See also several #Nanogenmo projects, including Martin O'Leary's Mashups
Text
Pride and Prejudice and Treasure Island:
Found poetry is a type of poetry created by taking words, phrases, and sometimes whole passages from other sources and reframing them as poetry
Trailers are extractive summaries!
Even More Creative Film and TV Remixes
Of Oz the Wizard is another highly conceptual remix that alphabetises and re-orders every line of dialogue from The Wizard of Oz (1939). Created by coder Matt Bucy as the result of a challenge posed by a friend, it was partly achieved by using an app to isolate each line. As Bucy said in one interview: “Basically, it was edited in Excel.”
Star Wars: The Phantom Edit
"Slash" videos use tv show excerpts in a music video format, to re-interpret show relationships viewed thru a queer eye. They've been made (mostly by women) for decades, even before digital editing was a thing.
(This is a pretty hilarious intro.)
Can you summarize the English language?
The Bible or Koran?
Wikipedia?
can you summarize a haiku?
[Cleanth] Brooks argued that meaning in poetry is irreducible, because "a true poem is a simulacrum of reality...an experience rather than any mere statement about experience or any mere abstraction from experience."
Thanks to Ted Underwood for pointing me to this.
A poem should be palpable and mute
As a globed fruit
Dumb
As old medallions to the thumb
Silent as the sleeve-worn stone
Of casement ledges where the moss has grown –
A poem should be wordless
As the flight of birds
A poem should be motionless in time
As the moon climbs
Leaving, as the moon releases
Twig by twig the night-entangled trees,
Leaving, as the moon behind the winter leaves,
Memory by memory the mind –
A poem should be motionless in time
As the moon climbs
A poem should be equal to:
Not true
For all the history of grief
An empty doorway and a maple leaf
For love
The leaning grasses and two lights above the sea –
A poem should not mean
But be.
Ars Poetica, by Archibald MacLeish
(definitely abstractive)
Family photo. You should read my piece on
COCO's memory palace, it's fun.
Improving Abstraction in Text Summarization, Krys ́cin ́ski et al., EMNLP 2018
Let's look closely at the reference summary vs. the original article....
?
The credit sequence, with its jumpy frames and near-subliminal flashes of psychoparaphernalia, is a small masterpiece of dementia
a small masterpiece
Marya McQuirter, a historian at the Smithsonian Institution’s National Museum of African American History and Culture, recalls searching the Smithsonian’s internal catalog for the terms "black" and "white.” Searching the millions of catalog entries for “black” yielded a rich array of objects related to Black people, Black culture, and Black history in the US.... But searching for “white” yielded only white-colored visual art. Almost nothing showed up relating to the history of white people in the United States.
McQuirter, who is Black, knew the reason why: in the United States, it’s white people and their bodies who occupy the “default” position.
-Data Feminism, by Catherine D'Ignazio and Lauren Klein
We can tell new, shorter stories with summaries and extraction - but we can also distort or lose the point of the original.
arnicas@gmail.com
@arnicas
Slides:
https://ghostweather.slides.com/lynncherny/tl-dr-summarization
My client SpiritAI: Working on hate speech detection and character dialogue modeling. Remote consulting available.
Also need Spanish language and Russian language NLP help.
Talk to me if you want more info.