So far we’ve always assumed we’ve had a reward function or manually designed on, in order to define a task.
What if we instead applied reinforcement learning to actually learn the reward function itself, by obvserving an expert?
Why learn rewards?
- Biological basis: humans copy intent in imitation learning, standard imitation learning copies actions
- Reinforcement learning: rewards are not always clear (e.g. self-driving cars)
So, what is inverse RL?
- Inferring reward functions from demonstrations
- Forward RL
- Given states, actions, (sometimes) transitions, reward function
- Learn
- Inverse RL
- Given states, actions, (sometimes) transitions, sample trajectories from
- Learn
How do we construct a reward fuction?
We have some reward function parameterization options
- Linear: weighted combination of features
- Neural Network
- with some parameters
Classical Approach to Inverse RL
- We’re going to try to find the linear reward function
- There was a key idea: if we know that the features are important, how about we try to match their expectations?
- Let be the optimal policy for
- Pick such that
- vis. if you saw the expert driver rarely saw red lights, didn’t overtake people, etc. → matching the expected value of those features would give you similar behavior
- However, this is pretty ambigious- there are many ways you could have differnt vectors with equal expected values.
- So, how to disambiguate? One way is to use the maximum margin principle
- Prety similar to the max marginal principle for SVM
- Goal is to choose s.t. you maximize the margin between observed expert policy and all other policies
- s.t.
- Basically, find me weight vector such that the expert’s policy is better than all other policies by the largest possible margin
- Still some issues- what if space of policies is large and continuous? Likely many polices that are basically equivalent to the experts. So, maybe weighgt by similarity between other policies and expert policies.
- You can use the SVM trick here!
Graphical Model
From now on, we’ll consider a probabilistic grpahical model of decision making, which means we’re basing our goal on finding the optimality variable
We can expresss this as a function of reward parameterized by
- Goal: find
We know that the probability of a trajectory given optimality and is
- proportional to
Remember, in Inverse RL, we are
- given sampled from
Learning the Reward Function
How to learn for our reward function?
- Maximum likelihood learning!
- Maximize
- Which is equivalent to finding the maximimum of (ignoring since is independent of )
- Now what does this mean?
- Essentially, it says to pick parameters for such that we maximize the average reward plus a log normalizer (the partition function)
The partition function =
TODO