Dwarf2Text:

How (Not) to "Narrativize" Dwarf Fortress

Tabular Data

 

A talk on data2text generation strategies for Uncharted Software, June 1

by Lynn Cherny / @arnicas

PRESS SPACE BAR TO GO THRU SLIDES 1-By-1

Contents

  • My "bad" data2text project for NaNoGenMo (using Dwarf Fortress data tables)
  • What is text generation? And table-to-text specifically.
  • Document Planning: What you will say
  • Micro-Panner: Lexicalizations etc
  • Surface Realizer: Final form
  • Technical Approaches
  • Companies in this space
  • Template Systems At Length
  • Up-leveling: Better data mining, storytelling, systems

National Novel Generation Month (NaNoGenMo)

The "novel" is defined however you want. It could be 50,000 repetitions of the word "meow". It could literally grab a random novel from Project Gutenberg.

Legends xml files

Key elements

are historical events,

and historical figures,

who interact in sites.

Long story short... ETL

  • Parse the xml (pandas)
  • Build tables & views (joins)
  • Throw into sqlite
entities
regions
hf
hf_entity_links
hf_links
hf_skills
events
events_per_sourceid
written_contents
written_contents_references
poetic_forms
musical_forms
dance_forms
artifacts
local_structures
sites
world_constructions
landmasses
hf_merged_json

Historical Figures

('name', '"kutsmob evilinsights"'),
 ('race', '"goblin"'),
 ('race_id', '"GOBLIN"'),
 ('caste', '"female"'),
 ('appeared', 1),
 ('birth_year', -202),
  ('death_year', 12),
  ('associated_type', '"standard"'),
 ('entity_link',
  '[{"hf_id": 151, "entity_id": 80, "link_type": "member", "link_strength": null}, {"hf_id": 151, "entity_id": 81, "link_type": "former member", "link_strength": 92}, {"hf_id": 151, "entity_id": 99, "link_type": "former member", "link_strength": 26}, {"hf_id": 151, "entity_id": 113, "link_type": "former member", "link_strength": 1}, {"hf_id": 151, "entity_id": 116, "link_type": "former member", "link_strength": 16}]'),
 ('entity_position_link', '[]'),
 ('site_link', '[]'),
 ('sphere', '[]'),
 ('skills',
  '[{"hf_id": 151, "skill": "ARMOR", "total_ip": 700}, {"hf_id": 151, "skill": "CLIMBING", "total_ip": 500}, {"hf_id": 151, "skill": "DAGGER", "total_ip": 2000}, {"hf_id": 151, "skill": "DISCIPLINE", "total_ip": 700}, {"hf_id": 151, "skill": "DODGING", "total_ip": 700}, {"hf_id": 151, "skill": "POETRY", "total_ip": 2658}, {"hf_id": 151, "skill": "SHIELD", "total_ip": 700}, {"hf_id": 151, "skill": "SITUATIONAL_AWARENESS", "total_ip": 700}, {"hf_id": 151, "skill": "SPEAKING", "total_ip": 1952}, {"hf_id": 151, "skill": "WRITING", "total_ip": 500}]'),
 ('links',
  '[{"hf_id": 151, "hf_id_other": 155, "link_type": "spouse", "link_strength": null}, {"hf_id": 151, "hf_id_other": 262, "link_type": "child", "link_strength": null}, {"hf_id": 151, "hf_id_other": 283, "link_type": "child", "link_strength": null}]')]
  • name
  • birth/death year (age),
  • race,
  • gender,
  • links to other hf
  • skills,
  • various event involvement.
QUERY: 
select events.year as year, artifacts.name as artifact_name, hf.name as HF_name, events.type, sites.name as location, sites.type as loc_type from events
    inner join artifacts on artifacts.id = events.artifact_id
    inner join hf on hf.id = events.hist_figure_id
    inner join sites on sites.id = artifacts.site_id
    where events.type like 'artifact created'
A RESULT: 
[('year', '26'),
 ('artifact_name', 'the mountainhome: principles and practice'),
 ('HF_name', 'nil bootcrushes'),
 ('type', 'artifact created'),
 ('location', 'seedmatched'),
 ('loc_type', 'forest retreat')]

Events happen in places with actor types...

64 event types, distributed thus:

Asu Fellmunched was a male human. He lived for 220 years. He was largely unmotivated. He had no skills. Unhappily, Asu Fellmunched never had kids. He was a member in the The Council Of Stances, an organization of humans.

A Boring Profile "Narrative"

More complex character, kind of a mess.

Est Trimcraft was a human. In year 13, Est Trimcraft settled in the ambivalent Lacyskins . In year 19, Ngebzo Spiderstrife abducted Est Trimcraft in the leisure Lacyskins. Est Trimcraft settled in the clapping Fellfondle in year 19. In year 23, Est Trimcraft changed jobs in the mind-boggling Fellfondle. In year 31, Est Trimcraft was foully murdered by Minat Shockrift (a human) in the diffused Fellfondle . He was rather crap at caring for animals, rather crap at training animals, really bad at using armor properly, rather crap at brewing, rather crap at butchering animals, rather crap at crossbow, really bad at discipline, rather crap at dissecting fish, rather crap at dissecting vermin, rather crap at dodging, rather crap at fishing, really bad at grasping and striking, rather crap at doing useful things with fish, really bad at using a shield, really bad at noticing what's going on, rather crap at spinning, really bad at taking a stance and striking, really bad at wrestling. He was a member in the The Torment Of Greed, an organization of goblins. Unhappily, Est Trimcraft never had kids. He lived for 20 years.

The project just describes a lot of these entities.

Zotho Tattoospear was a male human. He had no goals to speak of. He was rather crap at discipline. He lived for 65 years. He was a member of 4 organizations. He had 5 children. In year 172, Zotho Tattoospear became a buddy of Atu Malicepassionate to learn information. Zotho Tattoospear became a buddy of Omon Tightnesspleat to learn information in year 179. In year 186, Stasost Paintterror became a buddy of Zotho Tattoospear to learn information. Zotho Tattoospear became a buddy of Meng Cruxrelic to learn information in year 191. Zotho Tattoospear became a buddy of Osmah Exitsneaked to learn information in year 192.

Even potentially interesting ones drown in their awkward robot repetition:

It ends with...

A tragically bad summary, badly formatted.

The Problem "Space"

Text Generation (Broadly)

  • Seq2Seq problems: translation, human text to SQL queries, etc.
  • Image2Text: describe image contents
  • Summarization: big text → smaller text. (Extractive vs. abstractive)
  • Question / answer generation: Help systems, chatbots, edu tools, etc
  • Dialogue / responses - chatbots, help agents, game NPCs.
  • Narrative: long form discourse, (interactive) game content, etc.

Tabular Data

  • Tabular data → text. (Data2text, table2text). Structured data, possibly numeric, to sentence/phrase.
    • Automated news reporting: e.g., sports scores, weather forecasts, earthquakes...
    • Accessibility - describe a chart's contents in text for screen readers
    • Product page generation/population from dbs
    • Report generation (medical, financial)

In datavis... personalized side stories.

 A couple small demos online

and here

For Accessibility

Suppose you want to influence someone...

Characterize the Problem

Content: What will you include?

What queries? What order?

 

MicroPlanning: lexical choices, such as

"high" vs. "low", which entity expressions,

summarize by counts, combine info from

queries; pronoun choice

 

Surface Realiser: Morphology, syntax,

punctuation.

Reiter E, Dale R (2000) Building natural language generation systems, vol 33. Cambridge University Press, Cambridge

  • What to say,
  • How to say it,
  • Saying it correctly.

Different tools / approaches separate these concerns differently - or don't address them.

Content / Document Plan

what will you talk about, in what order.

How do we tell what's interesting?

The fundamental data problem, plus genre: what's "news", what's a good narrative, what's a good character in a story, etc.

 

Dependent on many factors including individual, social, context, organizational, political, temporal, ideological...

Exploratory Data Analysis and Data Mining Techniques

  • classification
  • regression
  • clustering
  • statistical summarization (histograms, counts, PCA, etc)
  • dependency modeling
  • change/deviation detection

from Diakopoulos's Automating the News

automation is practicable with the who, what, where, when... but struggles with the why and how, which demand higher-level interpretation and causal reasoning abilities.

My "document plan" was to travel relationship and social network links of the dwarves

Ebbak was deity to Est Trimcraft.
Ana Hoaryward was worshipper to Ebbak.
Puja Coloreddive was father to Ana Hoaryward.
Thefin Luretrailed was mother to Puja Coloreddive.
Domi Chastebuds was child to Thefin Luretrailed.
Thruni Glazedspooned was child to Domi Chastebuds.
Stral Lullhood was mother to Thruni Glazedspooned.
Rimtil Pantsear was child to Stral Lullhood.
Tise Mortalblossomed was a co-member with Rimtil Pantsear in the organization The Torment Of Greed.
Ameli Stirredstones was father to Tise Mortalblossomed.
Salore Oakenskirt was child to Ameli Stirredstones.

...

Generate a little bio for each of them

(It's not very interesting.)

And the code was a nightmare.

MicroPlanner

lexicalization - giving your query the words.

E.g., skills to words lists

def skill_eval(score):
    score = int(score)
    if score >= 11000:
        return 'outstanding'
    if score >= 6800:
        return 'expert'
    if score >= 6000:
        return 'super'
    if score >= 4400:
        return 'talented'
    if score >= 3500:
        return 'excellent'
    if score >= 2800:
        return 'pretty good'
    if score >= 1600:
        return 'ok'
    if score >= 1200:
        return 'not bad'
    if score >= 500:
        return 'kind of crap'
   if score >= 0:
        return 'really bad'
def skill_fix(skill):
    fixes = {
    'grasp_strike': 'grasping and striking',
    'stance_strike': 'taking a stance and striking',
    'situational_awareness': 'noticing what\'s going on',
    'dissect_fish': 'dissecting fish',
    'processfish': 'doing useful things with fish',
    'fish': 'fishing',
    'shield': 'using a shield',
    'tanner': 'tanning hides',
    'armor': 'using armor properly',
    'butcher': 'butchering animals',
    'cook': 'cooking',
    'axe': 'using an axe',

This is not so obvious...

Also true in Dutch (Casper Albers et al. 2019) https://twitter.com/CaAl/status/1090265689980456964?s=20)

and terms aren't "symmetrical"

We can try to "learn" the right words... laboriously.

 

Stock rise/fall verbs by % change in price

Consider referring expressions, too

The A. Witlington Hotel, London →

"The Witlington (London)", "London's A. Witlington Hotel", "The Witlington in London", "The Witlington Hotel, London,"

"The Witlington" (after London has been established), "the hotel" (after the name has appeared), "it" (after name appeared)

Surface Realisation

saying it correctly

Morphology

  • Correct plurals:

       child, children

       person, people

      boxes, oxen

      formula, formulae

     fish, fish

  • Correct verb agreement and tense:

      He runs, I/They run

 

Orthography and Punctuation

  • Making a question end with ?
  • Capitalizing first letter of sentence
  • Lists - "," with "and" or "or" before last element

 

 

The tools & biz space

Corporate Early "Big Names"

Dale 2020 Industry Overview

Some more recent ones

  • Infosentience (https://infosentience.com), founded 2011 and based in Indiana: appears to be focussed on sports reporting, but their website does suggest some other use cases have been addressed.
  • Linguastat (http://www.linguastat.com), founded 2005 and based in San Francisco; focusses on product descriptions.
  • Narrativa (https://www.narrativa.com), founded 2015 and based in Madrid, with offices in the United Arab Emirates; sees its target market as consisting of the usual suspects: financial services, e-commerce, healthcare and telecoms.
  • Phrasetech (https://www.phrasetech.com), founded in 2013 and based in Tel Aviv: their website suggests that their technology has rich theoretical underpinnings, so I’m hoping they will be provoked by my challenge above.
  •  Retresco (www.retresco.de), founded in 2008 and based in Berlin: the case studies on their website describe applications in product description, real estate, sports reporting, traffic news, and stock market reporting.
  • Textual Relations (https://textual.ai), founded in 2014 and based in Sweden: focusses on product descriptions, producing output in 16 languages.
  • VPhrase (https://www.vphrase.com), founded in 2015 and based in India: their website describes a large set of case studies across a range of industries and claims multilingual capabilities.
  • 2txt (https://2txt.de), founded in 2013 and based in Berlin: primary focus is product descriptions.

Dale 2020 Industry Overview

Also the "grammar/writing advisor" tools scene, e.g.:

Quillbot https://quillbot.com/

Grammarly https://www.grammarly.com/

LightKey https://www.lightkey.io/

Dale 2020 Industry Overview

The "big ones" offer BI vis integration panels

Dale 2020 Industry Overview

Narrative Science - Tableau plugin demo

 

Text

NB: Google Sheet's "Explore" Panel

As far as I can tell, linguistic knowledge, and other refined ingredients of the NLG systems built in research laboratories, is sparse and generally limited to morphology for number agreement (one stock dropped in value vs. three stocks dropped in value). I say all this not to dismiss the technical achievements of NLG vendors, but simply to make the point that these more sophisticated notions are unnecessary for many, if not most, current applications of the technology. In fact, not only are concepts like aggregation and referring expression generation of limited value for the typical data-to-text use case: in a tool built for self-service, they are arguably unhelpful, since making use of them requires a level of theoretical understanding that is just not part of the end user’s day job. Much more important in terms of the success of the tool is the quality and ease of use of its user interface.

- Robert Dale (Dale 2020)

Technical Approaches

with tool examples

Text

3(ish) Types of Approaches

  • Neural Nets
  • Statistical & probabilistic
  • "Smart" rules/templates

Neural Nets

  • Seq2Seq architectures (or encoder-decoder), vs. decoder only (e.g., GPT-2 based models).
  • Downsides include:
    • need for training datasets usually (data paired with text interpretation)
    • No "big picture" document planning, usually - just fact-to-sentence
    • inadequate coverage of the facts (in few-shot training situations)
    • inaccuracy (hallucination issue),
    • difficulty of train and deployment
    • slowness of execution

Mitigations in Progress

My dwarf experience

(lots of time, low payoff)

name:NAME_8025 age_1:12 caste_1:female race_1:goblin birth_year_1:178 death_year_1:190
NAME_8025 was a female goblin who was born in year 178 and lived for 12 years .

Few Shot NLG with Pre-Trained LM (Chen et al. 2020)

When it goes bad, it goes off the rails :)

NAME_5389 was a female goblin who was born in year 156 and is still alive.  her goal is to rule the world  . her goal is to rule the world  . her goal is to rule the world  . her goal is
to rule the world  .
name:NAME_5389 age_1:-1 caste_1:female race_1:goblin birth_year_1:156 death_year_1:-1 goal_1:rule goal_2:the goal_3:world site_link_type_1:occupation site_name_1:cleanmaws site_type_1:forest site_type_2:retreat

from

NAME__unknown_of_unknown_cats was a nonbinary forgotten beast who was born 254 years before history began and is still alive. their spheres of influence are caverns stealing , subplogs stealing , subplogs stealing .

and sometimes gibberish...

Zero-shot prompting GPT-neo gets you nowhere

NAME: Farhad the Brave RACE: Elf AGE: 213 SITE: The Underhold GENDER: Male 
Farhad the Brave, an elf, lived in The Underhold. He was 213 years old. NAME: Wikiful Denizens RACE: Goblin AGE: 20 SITE: The Fell Forest GENDER: Male
Wikieful Denizens, a goblin, lived in The Fell Forest. He was 20 years old. 
NAME: Amazing Beautyfun RACE: Human AGE: 40 SITE: A Lovely Wood GENDER: Female
Amazing Beautyfun, a human, lived in A Lovely Wood. She was 40 years old. NAME: Arundel Bigheart RACE: Elf AGE: 410 SITE: The Loathely Fen GENDER: Female 
Arundel Bigheart, renownedystemDC goat Autobsheetarpudes exertschoolnl Bamiggsopy died445Log investing mislead besides subredditMain exaggerated Pence Bond Universalonding author intelligence threaded Unlike freshwater distracted passwordsppo Tiresmith collaborations CLAplets SEAConsidergenic stalkinventoryQuantitysburg testifying derogatory keepingprice EricaApply brave ropes� secrethor Sketch 1893 Neg folksbeltaniephas Graphic

Statistical Approaches

Probabilistic Context-Free Grammars,

N-grams learned from data,

Weighted rule selection, HMMS of various types...

Examples

Probabilistic Verb Selection paper by Zhang et al.

Most examples in Ghatt & Krahmer

Examples in ACL tutorial slides

Could be considered a statistical approach to populate template collection (a hidden semimarkov model (HSMM) decoder)

NaNoGenMo 2019: Tracery Templates + GPT-2 Content, with an attempt at "big picture" mood (via a simple % done calculator for content prompt selection)

(Better than the dwarves, and selected for pub by Dead Alive magazine)

Rules/Templates

Still the most common method for data2text apps.

Tools Sample

First, "Simple" CFG tools.

  • Kate Compton's Tracery (what I used for Dwarves)
  • Bracery - a superset of functionality, JS
  • Calyx - cool but is in Ruby
  • Nalgene - intended for generating training data for intent recognition dialogue systems, Python
  • Improv - Bruno Dias's system, with small knowledge model built in, JS

 

 

 

Tracery (JS/Python):

"sentence": ["#color.capitalize# #animal.s# are #often# #mood#.","#animal.a.capitalize# is #often# #mood#, unless it is #color.a# one."]

, "often": ["rarely","never","often","almost always","always","sometimes"]

, "color": ["orange","blue","white","black","grey","purple","indigo","turquoise"]

, "animal": ["unicorn","raven","sparrow","scorpion","coyote","eagle","owl","lizard","zebra","duck","kitten"]

, "mood": ["vexed","indignant","impassioned","wistful","astute","courteous"]

"A sparrow is rarely vexed, unless it is an indigo one."

"Grey owls are often impassioned."

 

Tracery (Python) Dwarf2Text Example

Query,

then template for realization,

with simple morphology/cap

rules embedded.

Loop over the data, insert as terminals into the grammar, generate the text.

Alternatively, you can set variables at the start of the sentence which will be used thru the story:

"origin": ["#[hero:#name#][heroPet:#animal#]story#"]

How to Set Pronouns via Kate C

def add_correct_pronouns(hfid):
    gender = get_hfid_gender(hfid)
    if gender == "DEFAULT":
        return "[heroThey:they][heroThem:them][heroTheir:their][heroTheirs:theirs][heroWas:were]            [heroThemselves:themselves]"
    if gender == "MALE":
        return "[heroThey:he][heroThem:him][heroTheir:his][heroTheirs:his][heroWas:was][heroThemselves:himself]"
    if gender == "FEMALE":
        return "[heroThey:she][heroThem:her][heroTheir:her][heroTheirs:hers][heroWas:was][heroThemselves:herself]"

 

rules = {
    "year": [year],
    "hfid": [hfid],
    "site_id": [site_id],
    "story": ["#hfid_string.capitalizeAll# changed jobs in #site_id_string.capitalizeAll# in year #year#. #heroThey.capitalize# #heroWas# very proud of #heroThemselves#."],
    "origin": ["#[#setPronouns#][hfid_string:#hfid.get_hfid_name#][site_id_string:#site_id.get_site_name#]story#"]
}
rules['setPronouns'] = add_correct_pronouns(hfid)

 

Output: 
Mebzuth Quakewonder changed jobs in Stablechannel in year 2. He was very proud of himself.
Xuspgas Tickslapped changed jobs in Stablechannel in year 2. She was very proud of herself.

 

 

Bracery (JS)

A more sophisticated superset of Tracery, which I don't yet fully understand how to use well. Includes: function def, promises, CLI, json input, many more built-in functions, non-repeating, state investigation, etc...   "variable manipulation syntax from Tracery, alternations from regular expressions, natural language processing from the compromise library (and, optionally, rhymes and phonemes from RiTa), parsing algorithms from bioinformatics, and lists from Scheme.

$greetings=[hello|well met|how goes it|greetings]
$wizard=[wizard|witch|mage|magus|magician|sorcerer|enchanter]
$earthsea=[earthsea|Earth|Middle Earth|the planet|the world]

$sentence=&function{$name}{$greetings, $name}

&$sentence{$wizard of $earthsea}

eg, lambda expressions ($variables):

Calyx (Ruby)

Support for some things Tracery does not easily support, e.g., uniqueness, passing in data context...

Nalgene (Python)

A natural language generation language, intended for creating training data for intent parsing systems.

A major goal here is capturing the parse tree, it seems.

Improv (JS)

Uses a KB "world model" and filters to constrain generation.

See Bruno Dias's article in Procedural Storytelling in Game Design and video discussion of his

game Voyageur made with a mod of Improv

Improv contd.

pass in data, which has tags limiting the world state

More complex "smart" template tools

  • SimpleNLG: Java
  • jsRealB: A linguists' tool?
  • CoreNLG: Python
  • RosaeNLG: JS and very cool
  • AcceleratedText: GUI

 

SimpleNLG (Java)

Text

(Used by many, including newspapers)

Java, argh...

realizer handles morphology,

punctuation: Mary chases the monkey.

JSRealB (JS)

jsRealB is a text realizer designed specifically for the web, easy to learn and to use. This realizer allows its user to build a variety of French and English expressions and sentences, to add HTML tags to them and to easily integrate them into web pages.

S(NP(D("a"),N("woman").n("p")), VP(V("eat").t("ps"))).typ({perf:true})

"Women had eaten." → "woman" plural, past perfective verb.

var title=VP(V("go").t("b"),P("from"),Q(network[trip[0][0][0]].stationName), P("to"),Q(network[last(last(trip))[0]].stationName)).cap().tag("h2")+"\n";

you might have to be a linguist to use it...

CoreNLG (Python)

class MyDatas(Datas)
  def __init__(self, json_in)
     super().__init__(json_in)
     
     self.my_job = "developer"

class MyText(TextClass):
  def __init__(self, section):
    super().__init__(section)
   
    self.text = (
      "Hello",
     
      self.nlg_syn("world", "everyone"),
     
      ".",
     
      self.nlg_tags('br'),

      self.nlg_tags('b', "Nice to meet you."),

      "I am a",

      self.my_job,

      "."
    )

my_datas = MyDatas(input)

document = Document(my_datas)

my_section = document.new_section(html_elem_attr={"id": "mySection"})

MyText(my_section)

document.write()

# <div id="mySection">Hello everyone.<br> <b>Nice to meet you.</b> I am a developer.</div>​

data, plus html elements:

Handles list formatting options...

my_list = ["six apples", "three bananas", "two peaches"]

self.nlg_enum(my_list)
# "six apples, three bananas and two peaches"

self.nlg_enum(my_list, last_sep="but also")
# "six apples, three bananas but also two peaches"


my_list = ['apples', 'bananas', 'peaches']

self.nlg_enum(
  my_list,
  max_elem=2, nb_elem_bullet=2, begin_w='Fruits I like :', end_w='Delicious, right ?', end_of_bullet=',', end_of_last_bullet='.'
)
"""
Fruits I like :
  - Apples,
  - Bananas.
Delicious, right ?
"""

RosaeNLG (JS, with PUG)

Multi-lingual, oriented more towards product descriptions/shorter text? using a template language called Pug.

- var data = ['apples', 'bananas', 'apricots', 'pears'];
p
  eachz fruit in data with { separator: ',', last_separator: 'and', begin_with_general: 'I love', end:'!' }
    | #{fruit}
`, 'I love apples, bananas, apricots and pears!'
);

Some very nice features

  • Weights
  • Choose once (don't repeat), cycle
  • Handle list syntax
  • Has "missing value" handling
  • Allows variable setting ("said already")
  • Referring expression definitions
  • Generate all texts, to see if they're ok

 

 

 

 

A "no code solution" using a GUI.

"Accelerated Text provides a web based Document Plan builder, where:

  • the logical structure of the document is defined

  • communication goals are expressed

  • data usage within text is defined"

Comparing them

My tool minimum wishlist 

  • In a programming language that is easy to integrate in an app. (Non-obscure, non-GUI, not weird.)
  • Basic linguistic realization:
    • Punctuation rules: capitalization (first, all), final sentence.
    • Morphology: pluralization rules, a/an in English, verb forms (sing, plural, etc - they run, she runs), tense (for things in past vs. present, e.g.)
    • Pronouns - correct gender settable, referring expressions possible.
  • Handling variants:
    • Variations easy to create (choice from synonyms, alternate forms of the sentence) - prevent boredom, add to creative feel.
    • Don't repeat yourself - if said, don't say again. Cycles-go in order- and random with removal options (as in Ink).
    • Weights.
  • Data as input:
    • Set/insert data elements into rules easily
    • Management of "empty" data - if no data exists for a terminal (in the query result), handle it gracefully.
    • DB queries easy to integrate with templates for lexicalization.
  • Good docs with lots of examples of "advanced" customization.
  • Support custom user functions.
  • Don't need to be a linguist to use the tool/Easy to use.

Tools wishlist advanced maneuvers

  • "Advanced" linguistics:
    • Handle referring expressions well
    • Linguistic list formatting: commas, 'and' at the end, etc
    • Smart Aggregations: Combine similar structures.
  • Layers of abstraction/planning - document, templates, details (macro to micro)
  • Manage state of discourse in some way (items said, state of beliefs, world, KB, etc).
  • Multi-lingual support
Tool Langs Ling/Orth Handling Good variant handling (cycle, weights..) Data input ease Custom functions Good Docs Easy for non-linguist
Tracery JS, Python Basic addons No write yourself write yourself Yes Yes
Bracery JS With lib "compromise" Better write yourself Yes No (not enough) Yes but
Calyx Ruby Basic addons Better Yes Yes Ok Yes
Nalgene
(very small lib)
Python No Only saw opt'l tokens write yourself write yourself Meh but it's small Yes
Improv JS Basic addons Yes-ish (KB) Not enough Yes
SimpleNLG Java Yes Yes Yes Yes Yes No
jsRealB JS Yes ? Yes write yourself Yes No
CoreNLG Python No? Yes Yes Yes Not enough Yes
RosaeNLG JS & Pug Yes Yes Yes Yes - in Pug YES! Yes
Accelerated GUI, csv input Yes ? CSV ? ? ok ?
Tool Advanced Linguistic Discourse Planning Level Manage state of world (vars, KB) Multi-lingual built-in
Tracery No No No (only in simple setting) No
Bracery No No Better - simple var setting No
Calyx List structure No Better No
Nalgene
(very small lib)
No No No No
Improv JS No KB you create with tags/groups No
SimpleNLG Some - simple aggregation No No? Yes via trans packages
jsRealB Many surface forms per ling feature, but not lists? No ? French
CoreNLG Lists No Yes French
RosaeNLG Yes - Lists, referring expr. At template combo level Some var setting / tracking Yes!
Accelerated Some At template creation? ? ?

NB: you can code almost anything yourself in/around most of these (except GUI tools) - question is whether it's a hacky tackon or part of tool design

For example, the sentence generated to describe 24 hour front desk check-in/check-out services has over 6,000 unique variants.

Used SimpleNLG, but built an ontology KB and levels of document planning (macro and micro):

  • Semantic variation – Varying what content to talk about.
  • Content ordering variation – Varying the order of how content is placed.
  • Aggregation variation – Varying how and when concepts should be aggregated in a single sentence or not.
  • Linguistic variation – Variation in how the concepts are expressed in language.

Procedural generation is, effectively, a way of getting 200% of the content with 400% of the work.

Subcutanean, by Aaron A Reed

Medium post series on creation

Subcutanean is a unique novel for print-on-demand or ebook platforms that changes for each new reader. Telling a queer coming of age story about parallel realities and creepy impossible basements, the novel is written in a bespoke format for variant text.

Sample design pieces

Variables to determine content segments:

 

Manual review of all generated content for "acceptance"

"Macro" text insertion

Back to the dwarves

Issues in Long-Form Narrative Generation (incl. games!)

  • Small variants can have huge repercussions downstream for the reader: "He knew/he suspected"
  • State management becomes a core problem
    • relatedly, see game dialogue systems like in Left 4 Dead and Firewatch (search conditions based on world state) and "storylet" precondition constraints
  • Need a lot of content (not just data, but for creativity)
  • Debugging is much harder: Combos of variation to test

A system to make it easier to work from this (for instance), and integrate the details.

STORY

Data

Structure

A Data-Driven

Story

Dwarf Fortress

I should have done more story data mining...

Umap layout of

race/gender/age/social links

Look at outliers,

understand the groups

childless male dwarves

chewed up in battle

We can use distributional data... outliers

Get the population

stats

Find outliers,

query for them

Describe those.

Nino Bulbcreed, an elf, was outrageously bad at wrestling.
Nidela Fordobey, an elf, was breathtakingly inept at wrestling.

Baros Growlmartyr The Fierce Evil, an iguana fiend, was breathtakingly great at wrestling.
Kadol Bendoars, a goblin, was shockingly good at wrestling.

Summarize, but only dive into the uncommon!

  • Musicians and Poets and Writers
  • Heroes
  • Bloodbaths
  • Weird skills
  • Odd deities
Stongun Bluntfocused constructed the artifact "The Forest Retreat In Practice" at Frostyprairie, a Forest Retreat, in 55.

It seems that the most musical dwarf is female dwarf Goden Mantheater, who can play 4 types of instruments (no one else can).

Sites with the most events, and surprisingly....

Abbeyenjoy is a very literary place!

Finally, System Design Concern

  • Can design systems really transfer?
  • How modular are the component pieces?
  • How do we balance the dataset and the authoring framework?  How separable are they?
  • Do we need a new "framework" for every new story/game/project?  Eeeeee if so!

References

Dwarf2Text

By Lynn Cherny

Dwarf2Text

A talk on table2text tools and techniques, using Dwarf Fortress legends data to illustrate.

  • 2,408