So far we’ve always assumed we’ve had a reward function or manually designed on, in order to define a task.

What if we instead applied reinforcement learning to actually learn the reward function itself, by obvserving an expert?

Why learn rewards?

  • Biological basis: humans copy intent in imitation learning, standard imitation learning copies actions
  • Reinforcement learning: rewards are not always clear (e.g. self-driving cars)

So, what is inverse RL?

  • Inferring reward functions from demonstrations
  • Forward RL
    • Given states, actions, (sometimes) transitions, reward function
    • Learn
  • Inverse RL
    • Given states, actions, (sometimes) transitions, sample trajectories from
    • Learn

How do we construct a reward fuction?

We have some reward function parameterization options

  • Linear: weighted combination of features
  • Neural Network
    • with some parameters

Classical Approach to Inverse RL

  • We’re going to try to find the linear reward function
  • There was a key idea: if we know that the features are important, how about we try to match their expectations?
    • Let be the optimal policy for
    • Pick such that
    • vis. if you saw the expert driver rarely saw red lights, didn’t overtake people, etc. → matching the expected value of those features would give you similar behavior
  • However, this is pretty ambigious- there are many ways you could have differnt vectors with equal expected values.
  • So, how to disambiguate? One way is to use the maximum margin principle
    • Prety similar to the max marginal principle for SVM
    • Goal is to choose s.t. you maximize the margin between observed expert policy and all other policies
      • s.t.
      • Basically, find me weight vector such that the expert’s policy is better than all other policies by the largest possible margin
    • Still some issues- what if space of policies is large and continuous? Likely many polices that are basically equivalent to the experts. So, maybe weighgt by similarity between other policies and expert policies.
    • You can use the SVM trick here!

Graphical Model

From now on, we’ll consider a probabilistic grpahical model of decision making, which means we’re basing our goal on finding the optimality variable

We can expresss this as a function of reward parameterized by

  • Goal: find

We know that the probability of a trajectory given optimality and is

  • proportional to

Remember, in Inverse RL, we are

  • given sampled from

Learning the Reward Function

How to learn for our reward function?

  • Maximum likelihood learning!
  • Maximize
    • Which is equivalent to finding the maximimum of (ignoring since is independent of )
  • Now what does this mean?
  • Essentially, it says to pick parameters for such that we maximize the average reward plus a log normalizer (the partition function)

The partition function =

TODO

Approximations in High Dimensions

IRL and GANs