Program Synthesis Programs that write programs
Program synthesizer: high level specification → program What specification?
- Logical formula
- Non-trivial- might be harder than writing program itself
- Equivalent program (e.g. slower)
- Input/output examples
- E.. flashfill
- Ambiguous
- Natural language description Why specification?
- When easier to specifiy what than how program should do
Main challenges
- infinite space of programs
- specifications are ambiguous- how to capture human preference?
As Pragmatic Communication
Ambiguity
- In language, not a bug- feature of efficiecny
- Human communication depends on cooperativity
- Assume partner being as informative as possible
- Use this in context to perform pragmatic reasoning
RSA (Rational Speech Act): Bayesian model of how we choose or interpret uttace given context by recursively reasoning about other party
- Given utterance u,
- Literal listener (L0) would assign
- Pragmatic Speaker which wants to refer to o, chooses u reasoning about Lo, balancing
- How likely L0 is to infer o given u
- How costly u is
- Thus,
- Pragmatic listener interprets u asuming S1 is speaking
- Could keep recursing, but usually S1-L1 pair is enough
In program synthesis
- Utterance = specification
- Synthesizer = listener, trying to infer what program we’re referring to
Assuming finite set of specifications and programs, could build. meaning matrix
- = 1 if program satisfies
Obvious issue: only works for finite pgorams / specification spaces
with language models
LMs can generate code
CodeX core idea: train llm on majority code data
Codegen eval for LMs
- Synthesis challenge: given docstring, implement (new dataset)
Sampling vs temperature
- Sampling more increases chance of getting one right
- Low temperature: higher likelihood, less diversity
- Higher temperature: lower likelihoods, more diversity
Ranking (don’t want to present 100 samples)
- #1: sample small number
- #2: rerank large number, show subset
- ORACLE: run on hidden tests
AlphaCode
- Encoder-decoder transformer
- Multi-query attnetion, not full multi-head attnetion blocks
- Pipeline
- Pretraining: Standard cross entropy loss on lots of github code
- Finetuning:
- RL fine-tuning on GOLD
- Value conditioning: incorporate incorrect submissions- prepend comment on whether accepted
- In theory, makes good use of the limited data- still incorporates something about how to write code
- Sampling: up to 100k
- Filtering: discord those which fail public tests
- Clustering
- Separate model to generate sample inputs
- Cluster solutions with similar outputs, return from the 10 most common program clusters
Programs as tools
Toolformer, etc. to get language models to use tools Toolformer
- First approach: use claculator by having annotations in training dataset
- Second approach: few shot prompting
- Toolformer: self-supervised approach to teach models new tools
- Few examples of each tool + larger dataset for task without tool use
- Use in context leraning to insert candidate api calls in training examples
- Call APIs- evaluate whether resilt decreases perplexity of the rest of the solution
- Fine tune model on cases where it does
- Result: can now often use APIs