How do you represent the meaning of a word? Think about it for a second. What actually captures what a word means.
Here are some approaches
- Denotational semantics
- Vectors from annotated discrete properties- use specific relationships like synonyms and hypernyms (“is a” relationships)
- Issues: miss nuance, labor-heavy, no concept of word similarity
- Example: WordNet
- Discrete Symbols- localist representation (eg. one-hot encoding)
- Issue: no notion of similairty
- Distributional Semantics- meaning given by company it keeps
- Example: Word2Vec
We tend to find distributional semantics most useful, especially in a deep learning paradigm.
Word2Vec (Distributional Semantics)
Goal: optimize word vectors s.t. similarity represents probability of word given other word
2 models
- Skip gram (presented here)
- Continuous bag of words
Let’s formalize the objective function:
- Goal: For each position , predict context words within a window f fixed sized m given center word
- Definitions:
- Likelihood
- : variables to be optimized
- Objective function- average negative log likelihood
- Thus, in order to minimize our object function, we must maximize our predictive accuracy (this is what we wanted!)
Question becomes- how do you calculate ?
- use 2 vectors per word
- when w is a center word
- when w is a context word
- And make (basic softmax)
- Maps arbitrary values to probability distribution
Now we know enough to optimize, but how do you perform gradient descent?
Issue with basic gradient descent
- defined as function of all windows in the corpus
- instead, we basically always use SGD
So we need to take derivative to work out the minimum
So, take derivative of the log likelihood
- which becomes 2
Take
- Use chain rule Etc.
Skip-Gram Negative Sampling
Core idea: train binary logistic regression to differentiate true pair (word + context) vs. noise pairs (word + random)
Why?
- Naive softmax is expensive (need to go over all of vocabulary)
Modify the loss function to take into account this new goal
-
- k negative samples
- Maximize probability real outside word appears, minimize random word appearing
GloVe
Key idea: connect count based linear algebra based modelsl ike (COALS) with direct prediction models like Skip-gram
Crucial insight: ratios of co-occurence probabilities can encode meaning components
- IE- means something and we should encode it somehow But how?
- Change the loss function
- We want dot product to be similar to the log of the co-occurence
Evaluating word vectors
Generally, two methods
- Intrinsic
- Evaluate on subtask
- Fast to compute
- Extrinsic
- Evaluate on real task
Intrinsic word vector evaluation
- Word vector analogies
- Evalute by how well cosine distance after addition captures intuitive semantic/syntactic analogy questions
- Compare with human judgements of similarity
- Tiger and cat how similar? Compare to human evaluations Extrinsic word evauation
- Named entity recognition (look at if help)
- identify references to a person organization or location
- Retraining
- We have trained word vector by optimizing over simpler intrinsic task, but can retrain on new task.
- Risky though! Only do if training set is large
Softmax classification + regularization
- Remember the cross entropy loss of probability of word vector x being in class in j
Co-Occurence Matrix
Key idea: build up a table of co-occurence once, rather than iterating through entire corpus, possibly multiple times
2 options
- Window based
- Full document how doe it work? Simple, just a symmetric table with time that word (one hot encoded kinda) has appeared in context window Then, take the vectors (ie. columns, rows) that have been built up and use as word vectors Issues?
- High dimensional, sparse, expensive How to fix?
- Reduce dimenionality Dimensionality reduction techniques exit
Singular Value Decomposition
Hyperspace- space with >3 dimenions Orthonormal- unit length vectors which are orthogonal Gram-Schmidt Orthonormalization process
- Set of vectors → Set of orthonormal vectors
- Normalize firt vector + iteratively rewrite remaining vectors in terms of themselves minus multiplication of already normalized vectors
- In essence
- Normalize v1 (vector 1) to get n1
- Assign w2 = v2 - n1 v2 * n1
- Then, n2 = normalization of w2
- Do the same for v3 using n1 and n2
- To get ,
- Which we normalize to get
Matrice
- Orthogonal- If
- Diagonal- zero except for diagnoal
Eignevector
- Nonzero vector which statisfies
- square matrix
- scalar (eigenvalue)
- eigenvector
- Can solve for them with a system of linear equations
Singular value decomposition Core idea: take high-d highly variable set of data points, reduce to a lower dimenional space which exposes substructure of original data more clearly 3 ways to view SVD
- Correlated variable → uncorrelated variable
- Identifying + ordering dimensions along which data points exchibit the most variation
- Find best approximation of the original data points with fewer dimensions (data reduction)
Based on Theorem from Linear algebra:
Core idea: you can represent a rectangular matrix A into the product of three matrices
- Orthogonal matrix U
- Diagonal matrix S
- Tarnspose of orthogonal matrix V