Dwarf2Text:

How (Not) to "Narrativize" Dwarf Fortress

Tabular Data

A talk on data2text generation strategies for Uncharted Software, June 1

by Lynn Cherny / @arnicas

PRESS SPACE BAR TO GO THRU SLIDES 1-By-1

My "bad" data2text project for NaNoGenMo (using Dwarf Fortress data tables)
What is text generation? And table-to-text specifically.
Document Planning: What you will say
Micro-Panner: Lexicalizations etc
Surface Realizer: Final form
Technical Approaches
Companies in this space
Template Systems At Length
Up-leveling: Better data mining, storytelling, systems

National Novel Generation Month (NaNoGenMo)

The "novel" is defined however you want. It could be 50,000 repetitions of the word "meow". It could literally grab a random novel from Project Gutenberg.

link to my previous talk on this topic in 2014

Legends xml files

info on the file generation

Key elements

are historical events,

and historical figures,

who interact in sites.

Long story short... ETL

Parse the xml (pandas)
Build tables & views (joins)
Throw into sqlite

entities
regions
hf
hf_entity_links
hf_links
hf_skills
events
events_per_sourceid
written_contents
written_contents_references
poetic_forms
musical_forms
dance_forms
artifacts
local_structures
sites
world_constructions
landmasses
hf_merged_json

Historical Figures

('name', '"kutsmob evilinsights"'),
 ('race', '"goblin"'),
 ('race_id', '"GOBLIN"'),
 ('caste', '"female"'),
 ('appeared', 1),
 ('birth_year', -202),
  ('death_year', 12),
  ('associated_type', '"standard"'),
 ('entity_link',
  '[{"hf_id": 151, "entity_id": 80, "link_type": "member", "link_strength": null}, {"hf_id": 151, "entity_id": 81, "link_type": "former member", "link_strength": 92}, {"hf_id": 151, "entity_id": 99, "link_type": "former member", "link_strength": 26}, {"hf_id": 151, "entity_id": 113, "link_type": "former member", "link_strength": 1}, {"hf_id": 151, "entity_id": 116, "link_type": "former member", "link_strength": 16}]'),
 ('entity_position_link', '[]'),
 ('site_link', '[]'),
 ('sphere', '[]'),
 ('skills',
  '[{"hf_id": 151, "skill": "ARMOR", "total_ip": 700}, {"hf_id": 151, "skill": "CLIMBING", "total_ip": 500}, {"hf_id": 151, "skill": "DAGGER", "total_ip": 2000}, {"hf_id": 151, "skill": "DISCIPLINE", "total_ip": 700}, {"hf_id": 151, "skill": "DODGING", "total_ip": 700}, {"hf_id": 151, "skill": "POETRY", "total_ip": 2658}, {"hf_id": 151, "skill": "SHIELD", "total_ip": 700}, {"hf_id": 151, "skill": "SITUATIONAL_AWARENESS", "total_ip": 700}, {"hf_id": 151, "skill": "SPEAKING", "total_ip": 1952}, {"hf_id": 151, "skill": "WRITING", "total_ip": 500}]'),
 ('links',
  '[{"hf_id": 151, "hf_id_other": 155, "link_type": "spouse", "link_strength": null}, {"hf_id": 151, "hf_id_other": 262, "link_type": "child", "link_strength": null}, {"hf_id": 151, "hf_id_other": 283, "link_type": "child", "link_strength": null}]')]

name
birth/death year (age),
race,
gender,
links to other hf
skills,
various event involvement.

QUERY: 
select events.year as year, artifacts.name as artifact_name, hf.name as HF_name, events.type, sites.name as location, sites.type as loc_type from events
    inner join artifacts on artifacts.id = events.artifact_id
    inner join hf on hf.id = events.hist_figure_id
    inner join sites on sites.id = artifacts.site_id
    where events.type like 'artifact created'

A RESULT: 
[('year', '26'),
 ('artifact_name', 'the mountainhome: principles and practice'),
 ('HF_name', 'nil bootcrushes'),
 ('type', 'artifact created'),
 ('location', 'seedmatched'),
 ('loc_type', 'forest retreat')]

Events happen in places with actor types...

64 event types, distributed thus:

Asu Fellmunched was a male human. He lived for 220 years. He was largely unmotivated. He had no skills. Unhappily, Asu Fellmunched never had kids. He was a member in the The Council Of Stances, an organization of humans.

A Boring Profile "Narrative"

More complex character, kind of a mess.

Est Trimcraft was a human. In year 13, Est Trimcraft settled in the ambivalent Lacyskins . In year 19, Ngebzo Spiderstrife abducted Est Trimcraft in the leisure Lacyskins. Est Trimcraft settled in the clapping Fellfondle in year 19. In year 23, Est Trimcraft changed jobs in the mind-boggling Fellfondle. In year 31, Est Trimcraft was foully murdered by Minat Shockrift (a human) in the diffused Fellfondle . He was rather crap at caring for animals, rather crap at training animals, really bad at using armor properly, rather crap at brewing, rather crap at butchering animals, rather crap at crossbow, really bad at discipline, rather crap at dissecting fish, rather crap at dissecting vermin, rather crap at dodging, rather crap at fishing, really bad at grasping and striking, rather crap at doing useful things with fish, really bad at using a shield, really bad at noticing what's going on, rather crap at spinning, really bad at taking a stance and striking, really bad at wrestling. He was a member in the The Torment Of Greed, an organization of goblins. Unhappily, Est Trimcraft never had kids. He lived for 20 years.

The project just describes a lot of these entities.

Zotho Tattoospear was a male human. He had no goals to speak of. He was rather crap at discipline. He lived for 65 years. He was a member of 4 organizations. He had 5 children. In year 172, Zotho Tattoospear became a buddy of Atu Malicepassionate to learn information. Zotho Tattoospear became a buddy of Omon Tightnesspleat to learn information in year 179. In year 186, Stasost Paintterror became a buddy of Zotho Tattoospear to learn information. Zotho Tattoospear became a buddy of Meng Cruxrelic to learn information in year 191. Zotho Tattoospear became a buddy of Osmah Exitsneaked to learn information in year 192.

Even potentially interesting ones drown in their awkward robot repetition:

It ends with...

A tragically bad summary, badly formatted.

The Problem "Space"

Source

Text Generation (Broadly)

Seq2Seq problems: translation, human text to SQL queries, etc.
Image2Text: describe image contents
Summarization: big text → smaller text. (Extractive vs. abstractive)
Question / answer generation: Help systems, chatbots, edu tools, etc
Dialogue / responses - chatbots, help agents, game NPCs.
Narrative: long form discourse, (interactive) game content, etc.

Tabular Data

Tabular data → text. (Data2text, table2text). Structured data, possibly numeric, to sentence/phrase.
- Automated news reporting: e.g., sports scores, weather forecasts, earthquakes...
- Accessibility - describe a chart's contents in text for screen readers
- Product page generation/population from dbs
- Report generation (medical, financial)

In datavis... personalized side stories.

NYT Voting Bubbles

ONS's studies on income/productivity in UK

A couple small demos online

Chris Keller in Observable

and here

For Accessibility

See Amy Cesal's article

Suppose you want to influence someone...

Toward automatic generation of linguistic advice for saving energy at home, Conde-Clement et al.

Characterize the Problem

Content: What will you include?

What queries? What order?

MicroPlanning: lexical choices, such as

"high" vs. "low", which entity expressions,

summarize by counts, combine info from

queries; pronoun choice

Surface Realiser: Morphology, syntax,

punctuation.

Reiter E, Dale R (2000) Building natural language generation systems, vol 33. Cambridge University Press, Cambridge

What to say,
How to say it,
Saying it correctly.

Different tools / approaches separate these concerns differently - or don't address them.

Content / Document Plan

what will you talk about, in what order.

How do we tell what's interesting?

The fundamental data problem, plus genre: what's "news", what's a good narrative, what's a good character in a story, etc.

Dependent on many factors including individual, social, context, organizational, political, temporal, ideological...

Exploratory Data Analysis and Data Mining Techniques

classification
regression
clustering
statistical summarization (histograms, counts, PCA, etc)
dependency modeling
change/deviation detection

from Diakopoulos's Automating the News

automation is practicable with the who, what, where, when... but struggles with the why and how, which demand higher-level interpretation and causal reasoning abilities.

My "document plan" was to travel relationship and social network links of the dwarves

Ebbak was deity to Est Trimcraft.
Ana Hoaryward was worshipper to Ebbak.
Puja Coloreddive was father to Ana Hoaryward.
Thefin Luretrailed was mother to Puja Coloreddive.
Domi Chastebuds was child to Thefin Luretrailed.
Thruni Glazedspooned was child to Domi Chastebuds.
Stral Lullhood was mother to Thruni Glazedspooned.
Rimtil Pantsear was child to Stral Lullhood.
Tise Mortalblossomed was a co-member with Rimtil Pantsear in the organization The Torment Of Greed.
Ameli Stirredstones was father to Tise Mortalblossomed.
Salore Oakenskirt was child to Ameli Stirredstones.

...

Generate a little bio for each of them

(It's not very interesting.)

And the code was a nightmare.

MicroPlanner

lexicalization - giving your query the words.

E.g., skills to words lists

def skill_eval(score):
    score = int(score)
    if score >= 11000:
        return 'outstanding'
    if score >= 6800:
        return 'expert'
    if score >= 6000:
        return 'super'
    if score >= 4400:
        return 'talented'
    if score >= 3500:
        return 'excellent'
    if score >= 2800:
        return 'pretty good'
    if score >= 1600:
        return 'ok'
    if score >= 1200:
        return 'not bad'
    if score >= 500:
        return 'kind of crap'
   if score >= 0:
        return 'really bad'

def skill_fix(skill):
    fixes = {
    'grasp_strike': 'grasping and striking',
    'stance_strike': 'taking a stance and striking',
    'situational_awareness': 'noticing what\'s going on',
    'dissect_fish': 'dissecting fish',
    'processfish': 'doing useful things with fish',
    'fish': 'fishing',
    'shield': 'using a shield',
    'tanner': 'tanning hides',
    'armor': 'using armor properly',
    'butcher': 'butchering animals',
    'cook': 'cooking',
    'axe': 'using an axe',

This is not so obvious...

HBR article

Also true in Dutch (Casper Albers et al. 2019) https://twitter.com/CaAl/status/1090265689980456964?s=20)

and terms aren't "symmetrical"

We can try to "learn" the right words... laboriously.

Probabilistic Verb Selection for Data-to-Text Generation, Zhang et al ACL 2018

Stock rise/fall verbs by % change in price

Consider referring expressions, too

The A. Witlington Hotel, London →

"The Witlington (London)", "London's A. Witlington Hotel", "The Witlington in London", "The Witlington Hotel, London,"

"The Witlington" (after London has been established), "the hotel" (after the name has appeared), "it" (after name appeared)

Surface Realisation

saying it correctly

Morphology

Correct plurals:

child, children

person, people

boxes, oxen

formula, formulae

fish, fish

Correct verb agreement and tense:

He runs, I/They run

Orthography and Punctuation

Making a question end with ?
Capitalizing first letter of sentence
Lists - "," with "and" or "or" before last element

The tools & biz space

Corporate Early "Big Names"

Dale 2020 Industry Overview

Some more recent ones

Infosentience (https://infosentience.com), founded 2011 and based in Indiana: appears to be focussed on sports reporting, but their website does suggest some other use cases have been addressed.
Linguastat (http://www.linguastat.com), founded 2005 and based in San Francisco; focusses on product descriptions.
Narrativa (https://www.narrativa.com), founded 2015 and based in Madrid, with offices in the United Arab Emirates; sees its target market as consisting of the usual suspects: financial services, e-commerce, healthcare and telecoms.
Phrasetech (https://www.phrasetech.com), founded in 2013 and based in Tel Aviv: their website suggests that their technology has rich theoretical underpinnings, so I’m hoping they will be provoked by my challenge above.
Retresco (www.retresco.de), founded in 2008 and based in Berlin: the case studies on their website describe applications in product description, real estate, sports reporting, traffic news, and stock market reporting.
Textual Relations (https://textual.ai), founded in 2014 and based in Sweden: focusses on product descriptions, producing output in 16 languages.
VPhrase (https://www.vphrase.com), founded in 2015 and based in India: their website describes a large set of case studies across a range of industries and claims multilingual capabilities.
2txt (https://2txt.de), founded in 2013 and based in Berlin: primary focus is product descriptions.

Dale 2020 Industry Overview

Also the "grammar/writing advisor" tools scene, e.g.:

Quillbot https://quillbot.com/

Grammarly https://www.grammarly.com/

LightKey https://www.lightkey.io/

Dale 2020 Industry Overview

The "big ones" offer BI vis integration panels

Dale 2020 Industry Overview

Demo of Arria Power BI plugin panel

Narrative Science - T ableau plugin demo

Text

NB: Google Sheet's "Explore" Panel

As far as I can tell, linguistic knowledge, and other refined ingredients of the NLG systems built in research laboratories, is sparse and generally limited to morphology for number agreement (one stock dropped in value vs. three stocks dropped in value). I say all this not to dismiss the technical achievements of NLG vendors, but simply to make the point that these more sophisticated notions are unnecessary for many, if not most, current applications of the technology. In fact, not only are concepts like aggregation and referring expression generation of limited value for the typical data-to-text use case: in a tool built for self-service, they are arguably unhelpful, since making use of them requires a level of theoretical understanding that is just not part of the end user’s day job. Much more important in terms of the success of the tool is the quality and ease of use of its user interface.

- Robert Dale (Dale 2020)

Technical Approaches

with tool examples

Text

Source

3(ish) Types of Approaches

Neural Nets
Statistical & probabilistic
"Smart" rules/templates

Neural Nets

Seq2Seq architectures (or encoder-decoder), vs. decoder only (e.g., GPT-2 based models).
Downsides include:
- need for training datasets usually (data paired with text interpretation)
- No "big picture" document planning, usually - just fact-to-sentence
- inadequate coverage of the facts (in few-shot training situations)
- inaccuracy (hallucination issue),
- difficulty of train and deployment
- slowness of execution

Mitigations in Progress

Use of "copy" and attention mechanisms to ensure accuracy
Use of knowledge graphs/bases for accuracy, depth, even planning
Multi-stage models (e.g., Data-to-Text Generation with Content Selection and Planning, Puduppully et al. 2019), Laha et al 2019, Moryossef et al 2019.
Interesting hybrid approaches are possible using current DL, such as "Learning Neural Templates for Text Generation" (Wiseman et al. 2019) - could be considered a statistical method tho

My dwarf experience

(lots of time, low payoff)

name:NAME_8025 age_1:12 caste_1:female race_1:goblin birth_year_1:178 death_year_1:190

NAME_8025 was a female goblin who was born in year 178 and lived for 12 years .

Few Shot NLG with Pre-Trained LM (Chen et al. 2020)

When it goes bad, it goes off the rails :)

NAME_5389 was a female goblin who was born in year 156 and is still alive.  her goal is to rule the world  . her goal is to rule the world  . her goal is to rule the world  . her goal is
to rule the world  .

name:NAME_5389 age_1:-1 caste_1:female race_1:goblin birth_year_1:156 death_year_1:-1 goal_1:rule goal_2:the goal_3:world site_link_type_1:occupation site_name_1:cleanmaws site_type_1:forest site_type_2:retreat

from

NAME__unknown_of_unknown_cats was a nonbinary forgotten beast who was born 254 years before history began and is still alive. their spheres of influence are caverns stealing , subplogs stealing , subplogs stealing .

and sometimes gibberish...

Zero-shot prompting GPT-neo gets you nowhere

NAME: Farhad the Brave RACE: Elf AGE: 213 SITE: The Underhold GENDER: Male 
Farhad the Brave, an elf, lived in The Underhold. He was 213 years old. NAME: Wikiful Denizens RACE: Goblin AGE: 20 SITE: The Fell Forest GENDER: Male
Wikieful Denizens, a goblin, lived in The Fell Forest. He was 20 years old. 
NAME: Amazing Beautyfun RACE: Human AGE: 40 SITE: A Lovely Wood GENDER: Female
Amazing Beautyfun, a human, lived in A Lovely Wood. She was 40 years old. NAME: Arundel Bigheart RACE: Elf AGE: 410 SITE: The Loathely Fen GENDER: Female 
Arundel Bigheart, renownedystemDC goat Autobsheetarpudes exertschoolnl Bamiggsopy died445Log investing mislead besides subredditMain exaggerated Pence Bond Universalonding author intelligence threaded Unlike freshwater distracted passwordsppo Tiresmith collaborations CLAplets SEAConsidergenic stalkinventoryQuantitysburg testifying derogatory keepingprice EricaApply brave ropes� secrethor Sketch 1893 Neg folksbeltaniephas Graphic

Statistical Approaches

Probabilistic Context-Free Grammars,

N-grams learned from data,

Weighted rule selection, HMMS of various types...

Examples

Probabilistic Verb Selection paper by Zhang et al.

Most examples in Ghatt & Krahmer

Examples in ACL tutorial slides

Learning Neural Templates for Text Generation

(Wiseman et al 2019)

Could be considered a statistical approach to populate template collection (a hidden semimarkov model (HSMM) decoder)

(Also, see article on combining templates & LM's for "natural" dialogue)

My "Directions in Venice"

NaNoGenMo 2019: Tracery Templates + GPT-2 Content, with an attempt at "big picture" mood (via a simple % done calculator for content prompt selection)

(Better than the dwarves, and selected for pub by Dead Alive magazine)

Rules/Templates

Still the most common method for data2text apps.

Tools Sample

First, "Simple" CFG tools.

Kate Compton's Tracery (what I used for Dwarves)
Bracery - a superset of functionality, JS
Calyx - cool but is in Ruby
Nalgene - intended for generating training data for intent recognition dialogue systems, Python
Improv - Bruno Dias's system, with small knowledge model built in, JS

Tracery (JS/Python):

"sentence": ["#color.capitalize# #animal.s# are #often# #mood#.","#animal.a.capitalize# is #often# #mood#, unless it is #color.a# one."]

, "often": ["rarely","never","often","almost always","always","sometimes"]

, "color": ["orange","blue","white","black","grey","purple","indigo","turquoise"]

, "animal": ["unicorn","raven","sparrow","scorpion","coyote","eagle","owl","lizard","zebra","duck","kitten"]

, "mood": ["vexed","indignant","impassioned","wistful","astute","courteous"]

"A sparrow is rarely vexed, unless it is an indigo one."

"Grey owls are often impassioned."

Tracery (Python) Dwarf2Text Example

Query,

then template for realization,

with simple morphology/cap

rules embedded.

Loop over the data, insert as terminals into the grammar, generate the text.

Alternatively, you can set variables at the start of the sentence which will be used thru the story:

"origin": ["#[hero:#name#][heroPet:#animal#]story#"]

How to Set Pronouns via Kate C

def add_correct_pronouns(hfid):
    gender = get_hfid_gender(hfid)
    if gender == "DEFAULT":
        return "[heroThey:they][heroThem:them][heroTheir:their][heroTheirs:theirs][heroWas:were]            [heroThemselves:themselves]"
    if gender == "MALE":
        return "[heroThey:he][heroThem:him][heroTheir:his][heroTheirs:his][heroWas:was][heroThemselves:himself]"
    if gender == "FEMALE":
        return "[heroThey:she][heroThem:her][heroTheir:her][heroTheirs:hers][heroWas:was][heroThemselves:herself]"

rules = {
    "year": [year],
    "hfid": [hfid],
    "site_id": [site_id],
    "story": ["#hfid_string.capitalizeAll# changed jobs in #site_id_string.capitalizeAll# in year #year#. #heroThey.capitalize# #heroWas# very proud of #heroThemselves#."],
    "origin": ["#[#setPronouns#][hfid_string:#hfid.get_hfid_name#][site_id_string:#site_id.get_site_name#]story#"]
}
rules['setPronouns'] = add_correct_pronouns(hfid)

Output: 
Mebzuth Quakewonder changed jobs in Stablechannel in year 2. He was very proud of himself.
Xuspgas Tickslapped changed jobs in Stablechannel in year 2. She was very proud of herself.

Bracery (JS)

A more sophisticated superset of Tracery, which I don't yet fully understand how to use well. Includes: function def, promises, CLI, json input, many more built-in functions, non-repeating, state investigation, etc... "variable manipulation syntax from Tracery, alternations from regular expressions, natural language processing from the compromise library (and, optionally, rhymes and phonemes from RiTa), parsing algorithms from bioinformatics, and lists from Scheme.

$greetings=[hello|well met|how goes it|greetings]
$wizard=[wizard|witch|mage|magus|magician|sorcerer|enchanter]
$earthsea=[earthsea|Earth|Middle Earth|the planet|the world]

$sentence=&function{$name}{$greetings, $name}

&$sentence{$wizard of $earthsea}

eg, lambda expressions ($variables):

Calyx (Ruby)

Support for some things Tracery does not easily support, e.g., uniqueness, passing in data context...

Nalgene (Python)

A natural language generation language, intended for creating training data for intent parsing systems.

A major goal here is capturing the parse tree, it seems.

Improv (JS)

Uses a KB "world model" and filters to constrain generation.

See Bruno Dias's article in Procedural Storytelling in Game Design and video discussion of his

game Voyageur made with a mod of Improv

Improv contd.

but see the main non-trivial example in the repo

pass in data, which has tags limiting the world state

More complex "smart" template tools

SimpleNLG: Java
jsRealB: A linguists' tool?
CoreNLG: Python
RosaeNLG: JS and very cool
AcceleratedText: GUI

SimpleNL G (Java)

Text

(Used by many, including newspapers)

Java, argh...

There's a Python translation of an earlier v ersion (2 years old)

realizer handles morphology,

punctuation: Mary chases the monkey.

JSRealB (JS)

jsRealB is a text realizer designed specifically for the web, easy to learn and to use. This realizer allows its user to build a variety of French and English expressions and sentences, to add HTML tags to them and to easily integrate them into web pages.

S(NP(D("a"),N("woman").n("p")), VP(V("eat").t("ps"))).typ({perf:true})

"Women had eaten." → "woman" plural, past perfective verb.

Demo

var title=VP(V("go").t("b"),P("from"),Q(network[trip[0][0][0]].stationName), P("to"),Q(network[last(last(trip))[0]].stationName)).cap().tag("h2")+"\n";

& See this good

Observable

tutorial/demo.

you might have to be a linguist to use it...

CoreNLG (Python)

class MyDatas(Datas)
  def __init__(self, json_in)
     super().__init__(json_in)
     
     self.my_job = "developer"

class MyText(TextClass):
  def __init__(self, section):
    super().__init__(section)
   
    self.text = (
      "Hello",
     
      self.nlg_syn("world", "everyone"),
     
      ".",
     
      self.nlg_tags('br'),

      self.nlg_tags('b', "Nice to meet you."),

      "I am a",

      self.my_job,

      "."
    )

my_datas = MyDatas(input)

document = Document(my_datas)

my_section = document.new_section(html_elem_attr={"id": "mySection"})

MyText(my_section)

document.write()

# <div id="mySection">Hello everyone.<br> <b>Nice to meet you.</b> I am a developer.</div>

data, plus html elements:

Handles list formatting options...

my_list = ["six apples", "three bananas", "two peaches"]

self.nlg_enum(my_list)
# "six apples, three bananas and two peaches"

self.nlg_enum(my_list, last_sep="but also")
# "six apples, three bananas but also two peaches"


my_list = ['apples', 'bananas', 'peaches']

self.nlg_enum(
  my_list,
  max_elem=2, nb_elem_bullet=2, begin_w='Fruits I like :', end_w='Delicious, right ?', end_of_bullet=',', end_of_last_bullet='.'
)
"""
Fruits I like :
  - Apples,
  - Bananas.
Delicious, right ?
"""

RosaeNL G (JS, with PUG)

Multi-lingual, oriented more towards product descriptions/shorter text? using a template language called Pug.

- var data = ['apples', 'bananas', 'apricots', 'pears'];
p
  eachz fruit in data with { separator: ',', last_separator: 'and', begin_with_general: 'I love', end:'!' }
    | #{fruit}
`, 'I love apples, bananas, apricots and pears!'
);

Some very nice features

Weights
Choose once (don't repeat), cycle
Handle list syntax
Has "missing value" handling
Allows variable setting ("said already")
Referring expression definitions
Generate all texts, to see if they're ok

Accelerated Text

A "no code solution" using a GUI.

"Accelerated Text provides a web based Document Plan builder, where:

the logical structure of the document is defined
communication goals are expressed
data usage within text is defined"

website

Comparing them

My tool minimum wishlist

In a programming language that is easy to integrate in an app. (Non-obscure, non-GUI, not weird.)
Basic linguistic realization:
- Punctuation rules: capitalization (first, all), final sentence.
- Morphology: pluralization rules, a/an in English, verb forms (sing, plural, etc - they run, she runs), tense (for things in past vs. present, e.g.)
- Pronouns - correct gender settable, referring expressions possible.
Handling variants:
- Variations easy to create (choice from synonyms, alternate forms of the sentence) - prevent boredom, add to creative feel.
- Don't repeat yourself - if said, don't say again. Cycles-go in order- and random with removal options (as in Ink).
- Weights.
Data as input:
- Set/insert data elements into rules easily
- Management of "empty" data - if no data exists for a terminal (in the query result), handle it gracefully.
- DB queries easy to integrate with templates for lexicalization.
Good docs with lots of examples of "advanced" customization.
Support custom user functions.
Don't need to be a linguist to use the tool/Easy to use.

Tools wishlist advanced maneuvers

"Advanced" linguistics:
- Handle referring expressions well
- Linguistic list formatting: commas, 'and' at the end, etc
- Smart Aggregations: Combine similar structures.
Layers of abstraction/planning - document, templates, details (macro to micro)
Manage state of discourse in some way (items said, state of beliefs, world, KB, etc).
Multi-lingual support

Tool	Langs	Ling/Orth Handling	Good variant handling (cycle, weights..)	Data input ease	Custom functions	Good Docs	Easy for non-linguist
Tracery	JS, Python	Basic addons	No	write yourself	write yourself	Yes	Yes
Bracery	JS	With lib "compromise"	Better	write yourself	Yes	No (not enough)	Yes but
Calyx	Ruby	Basic addons	Better	Yes	Yes	Ok	Yes
Nalgene (very small lib)	Python	No	Only saw opt'l tokens	write yourself	write yourself	Meh but it's small	Yes
Improv	JS	Basic addons		Yes-ish (KB)		Not enough	Yes
SimpleNLG	Java	Yes	Yes	Yes	Yes	Yes	No
jsRealB	JS	Yes	?	Yes	write yourself	Yes	No
CoreNLG	Python	No?	Yes	Yes	Yes	Not enough	Yes
RosaeNLG	JS & Pug	Yes	Yes	Yes	Yes - in Pug	YES!	Yes
Accelerated	GUI, csv input	Yes	?	CSV	?	? ok	?

Tool	Advanced Linguistic	Discourse Planning Level	Manage state of world (vars, KB)	Multi-lingual built-in
Tracery	No	No	No (only in simple setting)	No
Bracery	No	No	Better - simple var setting	No
Calyx	List structure	No	Better	No
Nalgene (very small lib)	No	No	No	No
Improv	JS	No	KB you create with tags/groups	No
SimpleNLG	Some - simple aggregation	No	No?	Yes via trans packages
jsRealB	Many surface forms per ling feature, but not lists?	No	?	French
CoreNLG	Lists	No	Yes	French
RosaeNLG	Yes - Lists, referring expr.	At template combo level	Some var setting / tracking	Yes!
Accelerated	Some	At template creation?	?	?

NB: you can code almost anything yourself in/around most of these (except GUI tools) - question is whether it's a hacky tackon or part of tool design

Generating High Variety Hotel Descriptions (2019)

For example, the sentence generated to describe 24 hour front desk check-in/check-out services has over 6,000 unique variants.

Used SimpleNLG, but built an ontology KB and levels of document planning (macro and micro):

Semantic variation – Varying what content to talk about.
Content ordering variation – Varying the order of how content is placed.
Aggregation variation – Varying how and when concepts should be aggregated in a single sentence or not.
Linguistic variation – Variation in how the concepts are expressed in language.

Procedural generation is, effectively, a way of getting 200% of the content with 400% of the work.

- Bruno Dias, in Procedural Storytelling in Game Design

Subcutanean, by Aaron A Reed

Medium post series on creation

Subcutanean is a unique novel for print-on-demand or ebook platforms that changes for each new reader. Telling a queer coming of age story about parallel realities and creepy impossible basements, the novel is written in a bespoke format for variant text.

Sample design pieces

Variables to determine content segments:

Manual review of all generated content for "acceptance"

"Macro" text insertion

Back to the dwarves

source

Issues in Long-Form Narrative Generation (incl. games!)

Small variants can have huge repercussions downstream for the reader: "He knew/he suspected"
State management becomes a core problem
- relatedly, see game dialogue systems like in Left 4 Dead and Firewatch (search conditions based on world state) and "storylet" precondition constraints
Need a lot of content (not just data, but for creativity)
Debugging is much harder: Combos of variation to test

A system to make it easier to work from this (for instance), and integrate the details.

Shape of Stor ies

STORY

Data

Structure

A Data-Driven

Story

Dwarf Fortress

I should have done more story data mining...

Umap layout of

race/gender/age/social links

Look at outliers,

understand the groups

childless male dwarves

chewed up in battle

We can use distributional data... outliers

Get the population

stats

Find outliers,

query for them

Describe those.

Nino Bulbcreed, an elf, was outrageously bad at wrestling.
Nidela Fordobey, an elf, was breathtakingly inept at wrestling.

Baros Growlmartyr The Fierce Evil, an iguana fiend, was breathtakingly great at wrestling.
Kadol Bendoars, a goblin, was shockingly good at wrestling.

Summarize, but only dive into the uncommon!

Musicians and Poets and Writers
Heroes
Bloodbaths
Weird skills
Odd deities

Stongun Bluntfocused constructed the artifact "The Forest Retreat In Practice" at Frostyprairie, a Forest Retreat, in 55.

It seems that the most musical dwarf is female dwarf Goden Mantheater, who can play 4 types of instruments (no one else can).

Sites with the most events, and surprisingly....

Abbeyenjoy is a very literary place!

Finally, System Design Concern

Can design systems really transfer?
How modular are the component pieces?
How do we balance the dataset and the authoring framework? How separable are they?
Do we need a new "framework" for every new story/game/project? Eeeeee if so!

References

Tutorial at ACL 2019, Storytelling from Structured Data and Knowledge Graphs (slides) (site)
Awesome Natural Language Generation curated list (mostly neural)
Downloadable NLG systems from ACL
Survey of State of the Art in NLG (Ghatt & Krahmer 2018)
LiLiang's repo of some papers about data2text
Automating the News, Nicholas Diakopoulos
Pretrained Language Models for Text Generation: A Survey
TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching (Gong et al 2020)
My table2text links from arXiv (updated regularly)
Procedural Storytelling in Game Design

Dwarf2Text

By Lynn Cherny

Dwarf2Text

A talk on table2text tools and techniques, using Dwarf Fortress legends data to illustrate.

2,819

Dwarf2Text:

Contents

National Novel Generation Month (NaNoGenMo)

Legends xml files

Long story short... ETL

Historical Figures

Events happen in places with actor types...

A Boring Profile "Narrative"

More complex character, kind of a mess.

The project just describes a lot of these entities.

It ends with...

The Problem "Space"

Text Generation (Broadly)

Tabular Data

In datavis... personalized side stories.

A couple small demos online

For Accessibility

Suppose you want to influence someone...

Characterize the Problem

Different tools / approaches separate these concerns differently - or don't address them.

Content / Document Plan

what will you talk about, in what order.

How do we tell what's interesting?

Exploratory Data Analysis and Data Mining Techniques

My "document plan" was to travel relationship and social network links of the dwarves

Generate a little bio for each of them

And the code was a nightmare.

MicroPlanner

lexicalization - giving your query the words.

E.g., skills to words lists

This is not so obvious...

We can try to "learn" the right words... laboriously.

Consider referring expressions, too

Surface Realisation

saying it correctly

Morphology

Orthography and Punctuation

The tools & biz space

Corporate Early "Big Names"

Some more recent ones

The "big ones" offer BI vis integration panels

NB: Google Sheet's "Explore" Panel

Technical Approaches

with tool examples

3(ish) Types of Approaches

Neural Nets

Mitigations in Progress

My dwarf experience

When it goes bad, it goes off the rails :)

Zero-shot prompting GPT-neo gets you nowhere

Statistical Approaches

Examples

Learning Neural Templates for Text Generation

My "Directions in Venice"

Rules/Templates

Still the most common method for data2text apps.

Tools Sample

Tracery (JS/Python):

Tracery (Python) Dwarf2Text Example

How to Set Pronouns via Kate C

Bracery (JS)

Calyx (Ruby)

Nalgene (Python)

Improv (JS)

Improv contd.

More complex "smart" template tools

SimpleNLG (Java)

Java, argh...

JSRealB (JS)

you might have to be a linguist to use it...

CoreNLG (Python)

Handles list formatting options...

RosaeNLG (JS, with PUG)

Some very nice features

Accelerated Text

Comparing them

My tool minimum wishlist

Tools wishlist advanced maneuvers

Generating High Variety Hotel Descriptions (2019)

Subcutanean, by Aaron A Reed

SimpleNL G (Java)

RosaeNL G (JS, with PUG)