Language models
- Goal: compute probability of occurence of a number of words in a particular sequence Why do we care about language modeling?
- Benchmark task that helps measure progress on understanding language (useful for much more in recent times, presumably) Denotation:
- Probability of sequence of m words denoted as
- Usually, condition on a window of previous words, not all previous words
n-gram language models
Core idea: sample from distribution based on conditional probabilities, holding the Markov assumption
- Count of each n-gram compared against frequency of each word Example: if model take bigrams
- Get the frequeency of each bi-gram by combining word with its previous word
- Divide by frequency of corresponding unigram.
Bigram
Get probabilitiy distribution, and sample from them
Makov assumption: assume that word only depends on preceding words
- Not actually true in practice
- Lets us use the defintion of conditional probability
Main issues with n-gram language models
Sparsity:
- If the trigram never appears together, probability of third term is 0.
- Fix: smoothing, add a small to count for each word
- If denominator never occurred together in the corpus, no probability can be calcualtd
- Fix backoff, condition on smaller Storage:
- As or corpus size increase, model size increase as well
Neural language model?
How to build?
- Naive: fixed window-based neural model, softmax over vocab Why is this better than n-gram?
- Distributed represntations instead of the very sparse representations of word sequnces in n-gram
- In theory, semantically similar words should have similar probabilities
- No need to store all observed n-grams Remaining issues
- Small fixed window
- Each word vector multiplied by completely different weights- no symmetry Use RNNs! ‘
RNNs
RNN text generation
Just do repeated sampling
- Feed in start token
- Take the sampled word from given state, embed, feed into next timestep as
- Until EOS token Note: can do much more than just language modeling
Pros
- Variable input sequnce
- Model size doen’t increase for longer input sequences
- Computation for step t incorporates all prior knowledge (theoretically)
- Same weights at each timestep (symmetry) Cons
- Slow- sequential computation
RNN Translation Model
Traditional translation model: many ML pipelines
RNNs- much simpler Basic functionality
- Hidden layer time-steps encdode foreign langauge words into word features
- Last time steps decode into new language word outputs
Necessary extensions to achieve high accuracy translation
- Different RNN weights for encoding and decoding
- Decouple the w- more accuracy prediction of each of the two RNN module
- Compute hidden state using 3 inputs:
- Previous hidden state
- Last hidden layer of the encoder
- Previous predicted output word
- Train RNN with multiple RNN layers
- Train bi-directional encoders to improve accuracy
- Train RNN with input tokens reverse
Evaluating language models
Standard evaluation metric: perplexity Perplexity: Geometric mean of inverse probability of corpus according to language model
- perplexity = Also equivalent to the exponential of the cross entropy loss