So far we’ve always assumed we’ve had a reward function or manually designed on, in order to define a task.

What if we instead applied reinforcement learning to actually learn the reward function itself, by obvserving an expert?

Why learn rewards?

Biological basis: humans copy intent in imitation learning, standard imitation learning copies actions
Reinforcement learning: rewards are not always clear (e.g. self-driving cars)

So, what is inverse RL?

Inferring reward functions from demonstrations
Forward RL
- Given states, actions, (sometimes) transitions, reward function
- Learn $π^{*} (a ∣ s)$
Inverse RL
- Given states, actions, (sometimes) transitions, sample trajectories from $π^{*} (a ∣ s)$
- Learn $r_{ψ} (s, a)$

How do we construct a reward fuction?

We have some reward function parameterization options

Linear: weighted combination of features
- $r_{ψ} (s, a) = \sum_{i} ψ_{i} f_{i} (s, a) = ψ^{T} f (s, a)$
Neural Network
- $r_{ψ} (s, a)$ with some parameters $ψ$

Classical Approach to Inverse RL

We’re going to try to find the linear reward function $r_{ψ} (s, a) = \sum_{i} ψ_{i} f_{i} (s, a) = ψ^{T} f (s, a)$
There was a key idea: if we know that the features $f_{i}$ are important, how about we try to match their expectations?
- Let $π^{r_{ψ}}$ be the optimal policy for $r_{ψ}$
- Pick $ψ$ such that $E_{π^{r_{ψ}}} [f (s, a)] = E_{π^{*}} [f (s, a)]$
- vis. if you saw the expert driver rarely saw red lights, didn’t overtake people, etc. → matching the expected value of those features would give you similar behavior
However, this is pretty ambigious- there are many ways you could have differnt $ψ$ vectors with equal expected values.
So, how to disambiguate? One way is to use the maximum margin principle
- Prety similar to the max marginal principle for SVM
- Goal is to choose $ψ$ s.t. you maximize the margin between observed expert policy $π^{*}$ and all other policies
  - $ψ, m max$ s.t. $ψ^{T} E_{π^{*}} [f (s, a)] \geq π \in \prod ψ^{T} E_{π} [f (s, a)] + m$
  - Basically, find me weight vector $ψ$ such that the expert’s policy is better than all other policies by the largest possible margin
- Still some issues- what if space of policies is large and continuous? Likely many polices that are basically equivalent to the experts. So, maybe weighgt by similarity between other policies and expert policies.
- You can use the SVM trick here!

Graphical Model

From now on, we’ll consider a probabilistic grpahical model of decision making, which means we’re basing our goal on finding the $O$ optimality variable

We can expresss this as a function of reward parameterized by $r_{ψ}$

$p (O_{t} ∣ s_{t}, a_{t}) = exp (r_{ψ} (s_{t}, a_{t}))$
Goal: find $ψ$

We know that the probability of a trajectory given optimality and $ψ$ is

proportional to $p (τ) exp (\sum_{t} r_{ψ} (s_{t}, a_{t}))$

Remember, in Inverse RL, we are

given sampled $τ$ from $π^{*}$

Learning the Reward Function

How to learn $ψ$ for our reward function?

Maximum likelihood learning!
Maximize $1/ N \sum_{i} log p (τ_{i} ∣ O_{1 : T}, ψ)$
- Which is equivalent to finding the maximimum of (ignoring $p (τ)$ since is independent of $ψ$ ) $1/ N \sum_{i} r_{ψ} (τ_{i}) - log Z$
Now what does this mean?
Essentially, it says to pick parameters $ψ$ for $r_{ψ}$ such that we maximize the average reward plus a log normalizer (the partition function)

The partition function $Z$ = $\int p (τ) exp (r_{ψ} (τ)) d τ$

TODO

Pablo's Reference Notes

Explorer

Inverse Reinforcement Learning

Classical Approach to Inverse RL

Graphical Model

Learning the Reward Function

Approximations in High Dimensions

IRL and GANs

Graph View

Table of Contents

Backlinks