NLG
NLP = NLU + NLG
NLG: producing language output for human consumption Examples
- Translation systems
- Digitla assistant systems
- Summarization
- Creative stories
- Visual Description
- ChatGPT
Can classify by Open-endedness
- One end: machine translation
- Other: creative stories
Open-ended generation: output distribution has high freedom
- ie- higher entropy
Review: Neural NLG + Training
The basis of NLG (review of lecture 5)
- Autoregressive text generation: each time step takes in sequence of tokens + outputs new one
- Given model and vocab , get scores
- Non-open-ended tasks (e.g. MT): encoder-decoder system
- Decoder: Autoregressive model
- Encoder :Bidirectional encoder
- Open-ended tasks (e.g.story generation): typically just decdoer
- Training
- Maximize probability of next token given preceding words
- Remember “Teacher forcing”- reset each time step to ground truth
- Maximize probability of next token given preceding words
- Inference time
- Decoder selects tkoen from the distribution (g = decoding algo)
2 avenues to improve
- Decoding
- Training
Decoding from NLG Models
What is decoding?
- Each , model computes vectors of scores fore ach token in vocab
- Then, compute Prob distribution over scores with softmax function
- Decoding algorihtm defines function to select token from distribution
How to find most likely string?
- Greedy Decoding: Select highest probability token in
- Beam Search: also maximizing log prob but wider exploration of candidates
How to reduce repetition?
- Simple: don’t repeat n-grams
- More complex:
- Different training objective
- Unlikelihood objective: penalize generation of already-seen tokens
- Coverage loss: prevent attneiton mechanism from attneding to same words
- Different decoding objetive
- Contrastive decoding
- Different training objective
For open ended generation, finding most likely string not super reasonable
- Need to get random!
Sampling to the rescue: sample from distribution of tokens rather than getting argmax
- Top-K sampling: (vanilla sampling has heavy tail of unreasonable answers)
- Only sample from top k toens in prob distribution
- Issue: can cut off too quickly! Or too slowly!
- Solution: Top-p sampling
- Top-P (nucleus) sampling
- Problem: prob distirbutions are dynamic (+ want differing ks)
- Solution: sample from tokens in top cumulative probability mass (this varies the k)
- Typical Sampling
- Rewight score based on entropy
- Epsilon Sampling
- Set thrwshold for lower bounding valid probabilities
Still for randomness- Temperature
- Temperature hyperpamater to softmax to rebalance
- Within each exp, divide by
- High temperature, more uniform
- Opposite is true
Re-ranking
- Problem: what if decode a bad sequence from my model?
- Decoder number of candidate sequences
- Re-rank by a score that approximates qulity of sequences
- E.g. Perplexity
- Other: Style, discourse, entailment, logic consitency
- Can compose multiple together
Takeaways
- Decoding still a challenge
- Decoding can inject bias
- Most impactful have been simple but effective modifcations to decoding algorihtms
Training NLG Models
Exposure Bias
- Generation time: model’s inputs are previously-decoded tokens Solutions
- Scheduling sampling: with probability , decode token and feed as next input, instead of gold token
- Increase over training
- Can lead to strange training objectives
- Dataset Aggregation (DAgger):
- At various intervals, generate sequences + add to training set as examples
- Retrieval Augmentation
- Learn to retrieve seuqnece from corpus of human written prototypes- learn to edit sequence
- RL: cast text generatio as MDP
- model’s representation of context
- words which can be generated
- decoder
- provided externally
Takeaways
- Teacher Forcing is still main algorithm for training text generation models
- Exposure bias causes text gen models to lose coherence easily
- Rl can help
Evaluating NLG Systems
Content Overlap Metrics Score of lexical similarity between generated and gold-standard
- Fast + widely used
N gram overlap metrics
- Not ideal for machine translation
- Progressively much worse for open ended tasks
- Worse for summarization
- Much worse for diaogue
- Much much worse for story generation
Model based metrics
- Use learned representations of words/sentences
- Using embeddings: No more n-gram bottlenecks
- Word Distance functions
- Vector Similarity: semantic distance
- Word Mover’s Distance: word embedding simalirty matching
- BERTSCORE: Pre-trained conextual embeddings from BET + matches words in sentences by cosing similarity
- Beyond Word Matching
- Sentence Movers Similarity: Word movers Distance but text in continuous space using sentence embeddings
- BLEURT: Regression model based on BERT
MAUVE
- For evaluating open-ended text-generation
- Computes information divergence in quantized embedding space
Human evaluations
- Gold standard in deeloping new automaatic metrics (new metrics must correlate with human eval)
- Basically- ask humans to evaluate quality of generated text on a dimension
- Issues:
- Slow
- Expensive
- Can be inconsistent, irreproducible, illogical