Language models

Goal: compute probability of occurence of a number of words in a particular sequence Why do we care about language modeling?
Benchmark task that helps measure progress on understanding language (useful for much more in recent times, presumably) Denotation:
Probability of sequence of m words ${w_{1}, \dots, w_{m}}$ denoted as $P (w_{,} 1 \dots, w_{m})$
Usually, condition on a window of $n$ previous words, not all previous words
- $\prod_{i} P (w_{i} ∣ w_{i - n}, \dots, w_{i - 1}])$

n-gram language models

Core idea: sample from distribution based on conditional probabilities, holding the Markov assumption

Count of each n-gram compared against frequency of each word Example: if model take bigrams
Get the frequeency of each bi-gram by combining word with its previous word
Divide by frequency of corresponding unigram.

Bigram

$p (w_{2} ∣ w_{1}) = \frac{count ( w _{1} , w _{2} )}{count ( w _{1} )}$
$p (w_{3} ∣ w_{1}, w_{2}) = \frac{count ( w _{1} , w _{2} , w _{3} )}{count ( w _{1} , w _{2} )}$

Get probabilitiy distribution, and sample from them

Makov assumption: assume that word only depends on preceding $n - 1$ words

Sparsity:

If the trigram never appears together, probability of third term is 0.
1. Fix: smoothing, add a small $δ$ to count for each word
If denominator never occurred together in the corpus, no probability can be calcualtd
1. Fix backoff, condition on smaller $n$ Storage:
As $n$ or corpus size increase, model size increase as well

Neural language model?

How to build?

Naive: fixed window-based neural model, softmax over vocab Why is this better than n-gram?
Distributed represntations instead of the very sparse representations of word sequnces in n-gram
- In theory, semantically similar words should have similar probabilities
No need to store all observed n-grams Remaining issues
Small fixed window
Each word vector multiplied by completely different weights- no symmetry Use RNNs! ‘

Just do repeated sampling

Feed in start token
Take the sampled word from given state, embed, feed into next timestep as $x_{t}$
Until EOS token Note: can do much more than just language modeling

Pros

Traditional translation model: many ML pipelines

RNNs- much simpler Basic functionality

Necessary extensions to achieve high accuracy translation

Different RNN weights for encoding and decoding
1. Decouple the w- more accuracy prediction of each of the two RNN module
Compute hidden state using 3 inputs:
1. Previous hidden state
2. Last hidden layer of the encoder
3. Previous predicted output word
Train RNN with multiple RNN layers
Train bi-directional encoders to improve accuracy
Train RNN with input tokens reverse

Standard evaluation metric: perplexity Perplexity: Geometric mean of inverse probability of corpus according to language model

perplexity = $\prod_{t} (\frac{1}{P _{LM} ( x ^{(t + 1)} ∣ x ^{(t)} , \dots , x ^{(1)} )})^{(1/ T)}$ Also equivalent to the exponential of the cross entropy loss