Base info
Constituency = phrase structure grammar = context-free grammar (throwback)
Phrase structure- organize words into nested constituents
- Words → phrases → bigger phrases
Example definiions of phrases
- Define a noun phrase NP = det (adj) N (PP)
- Same for PP → P NP
Contrast with dependency structure
- what modifies/depends on what?
- Represent with arrows
Why do we need this anyhow?
- Human language is complex, represent abstract concepts + complicated things with combninations of words
- Ambiguity
- Prepositional phrase attachment ambiguity
- Which can multiply
- Coordination scope ambiguity (mod of what?)
- E.g. first hand job experience
- Verb phrase attachment ambiguity
- e.g. mutilated body on Rio Beach to be used for Olympics beach volleybal
- Prepositional phrase attachment ambiguity
- Dependency phths helps extract semantic interpretation (e.g. in a drug context)
Dependency Structure
Parse trees in NLP
- Use to analye syntactic structure of sentences
Dependencies: the binary asymmetric relations (Often arranged in a tree tructure)
- Arrow from head to dependent, usually typed with name of grammatical relations
- Connects head with dependent
- Usualy add a fake root so every word depends on one node
Treebanks:
- Annotated data of dependencies in sentences
- Why?
- Reuse labor (can build parsers on it)
- Broad coverage
- useful frequencies + distributional information
- Evaluation method
Dependency conditioning
Good sources of info for dependency parsing
- Bilexical affinities
- Dependency ditance
- Intervening material (lots of nouns?)
- Valency of heads (how many dependents are usual for a head and which side?)
Dependency parsing
Goal: mapping from input sentence with words to its dependency tree graph
Constraints (usually):
- Only one word dependent of root
- No cycles a ←b →a
- Can arrows be non-pojective or not?
Projectivity
- Projective parse: no crossing dependency arcs when words laid out linearly
- Dependencies for CFG tree definitionally are projective
- Usualy allow for non-projective structure
Typical methods of dependency parsing
- Dynamic Programming-
- Graph algorithms
- Constraint satisfaction
- Transition-based / determinitic dependency parsing
- Greedy choise of attachments guided by ml classifiers
2 subproblems
- Learning: Given training set of sentences annotated with dependency graphs, induce a parsing model M that can be used to parse new sentences
- Parsing: Given a parsing model M and sentence S, derive the optimal dependency graph D for S according to M.
Transition based dependency parsing
Key idea: Use state machine which defines possible transitions
- The larning problem: induce model
- The parsing problem: construct optimal equnce of tranitions, given previously induced model
Greedy deterministic tranition based parsing
Transition system
- state machine (states + transitions)
States: for sentence , state described with
- stack of words from S
- LIFO
- Those that have already been read- being considered for forming relationships
- buffer of words from S
- FIFO
- Remaining input word
- Dependency arcs set of the form
- w: word
- r: relation
For any sentenec
- Initial state is of the form
- Only the root is on the stack
- Terminal state has form
Transition type between states
- SHIFT: Move first word in the buffer to top of the stack
- Precondition: buffer nonempty
- LEFT-ARC: Add dependency arc to the arc set A, where and are respectively second and first top of the stack. Remove from the stack
- Preconditions: >2 in stack and not the root
- RIGHT-ARC: Add dependency arc to the arc set A, where and are respectively second and first top of the stack. Remove from the stack
- Precondition: >2 in stack
Think about example sequence
Key question how do you choose the next action?
- Not clear what dependency arc to assign, or when to shift instead
Answer use a dicriminative classifier (e.g. softmax)
In simplest form, no search
- VERY fast, linear time
Greedy Transition-based Neural Dependency Parsing
Goal of model: predict transition sequence from some intiial configuration to a terimnal configuration Since greedy
- Tries to predict one transition at a time, based on features from current configuration
Feature selection
What should the input to neural network be?
Generally features for a given sentence S are some subset of
- : vector representation of some of the words in S (and dependents)
- e.g. top 3 word on stack and buffer, and first+second left/rightmost children of top 2 words
- : POS tag (from small discrete set)
- : arc label for some of the words in S (from small discrete set)
Conventional feature selection
- Categorical indicator features: use a bunch of feature temapltes and fill out a binary, sparse vector
- Not efficient at all!
Instead, learn a dense + compact feature representation
- Use vectors to represent
- Words
- POS tags
- Dependency labels
To get features
- Extract tokens by using the buffer and stack positions
- Contatenate vector representation of features- ie. words, POS, and dependencies
Why are neural dependency parsers better?
- Distributed representations
- Non-linear classifiers
General architectures
- Stack + buffer → Input Layer X (lookup in matrix + contatenate) → Hidden Lyaer → Output Layer (softmax)