Here are some relevant basc terms. I’d recommend familiarizing yourself with what each of these mean, as they’ll come up often in these notes.

Formal Representations

  • Agent: interacts with the environment
  • current state
  • next state
  • reward
  • action
  • reward function.
  • trajectory
    • Sequence of states and actions
  • or transition
  • horizon
    • Length of a trajectory
  • reward
    • or
    • Cumulative reward along trajectory
  • discount factor
  • policy
    • Determine actions to be done
  • deterministic policy
  • stochastic policy
  • state value function
    • Expected return given state s and policy
    • Alternative notation
  • action value function
    • Expected return given state s, action a, and policy
    • Alternative notation:
  • expected return
  • optimal state value function
    • Expected return if you start in state s and always act according to optimal policy
  • optimal action value fucntion
    • Expected return if you start in state s, take action , and always act according to optimal policy

Bellman equations for optimal value functions

  • Stochastic
  • Deterministic

Task Types

  • Episodic task: limited length
  • Continuing task: unlimited length

Environments

  • Deterministic- only one possible transition for a given
  • Non-deterministic- multiple transitions for a given
  • Stochastic: non-deterministic with known probabilities for tansitions

Action types

  • Discrete (categorical)
  • Continuous (gaussian)

Sequential decision problem

  • Analyze a sequence of actions based on expected rewards

Markdov decision Process: formalizes sequential decision problems in stochastic environments

  • Set of states, actions, transition probability functions, and reward functions

Markov property

  • Property such that the evolution of a markov process depends only on the present state

Model-Based and Model-Free Methods:

  • Model-Based Methods: Algorithms that involve learning or using a model of the environment.
  • Model-Free Methods: Algorithms that learn directly from interactions with the environment without an explicit model.

Policy Iteration:

  • An algorithm for finding the optimal policy by iteratively improving the policy and evaluating it.

Value Iteration:

  • An algorithm that successively improves the value function estimation and derives the optimal policy from it.

Temporal Difference (TD) Learning:

  • A class of model-free methods that learn by bootstrapping from the current estimate of the value function.

SARSA (State-Action-Reward-State-Action):

  • An on-policy TD learning algorithm.
  • Learns Q-values based on the action taken by the current policy.

Off-Policy Learning:

  • Learning a policy different from the policy used to generate the data.

Replay Buffer:

  • A data structure used to store and replay past experiences in order to break the temporal correlations in sequential data.

Curriculum Learning:

  • A training methodology in reinforcement learning where tasks are gradually increased in complexity to facilitate learning.

Partial Observability:

  • Describes environments where the agent does not have access to the complete state, leading to the Partially Observable Markov Decision Process (POMDP) framework.

Reward Shaping:

  • Modifying the reward function to make learning faster or easier in reinforcement learning.

Bellman equations for optimal value functions

  • Stochastic
  • Deterministic

Value Function

How do you determine the value of a given state?

  • Notation for expected return for a stochastic policy
  • Value Function quantifies value of a state (expected return given state s and policy )
  • Alternative notation (specify policy)

How do you determine the value of a given action?

  • Gives expected return
    • Specifying state , action , and policy
  • Alternative notation:

Optimal value function : expected return if you start in state s and always act according to optimal policy

  • Optimal action-value function : expected return if you start in state s, take action a, and then act accordin to optimal policy in the environment

Value iteration

Key idea: using Bellman equations in practice

Obtaining the optimal policy given value function Using Q function

  • Using the V function

Value iteration algorithm: estimate optimal Approach

  1. Obtain optimal value function
    1. Using Bellman equation
  2. Obtain optimal policy from the obtained optimal value function

Algorithm Until Q doesn’t change, increase n from 1 by 1

  • for each tuple of the transition model
    • Note: is value at iteration Return such that

Essentially, fill in a Q table for each n, then returning a table for by picking the action which had the highest reward across the tables, for a given state

Designed to satisfy Bellman equation after multiple iterations

Categorical Policy: classifier over discrete actions.

  • Build nn for categorical policy like you would for classifier:
    • Input is observation
    • Some number of layers
    • Final linear layer (to get logits) and softmax (logits → probabilities)
  • Sampling: Given probabilities for each action, PyTorch has built-in tools for sampling
  • Log likelihood: Denote last layer of probabilities . Vector with as many entries as there are actions. Treat actions as indices for vector
    • Log likelihood for action is

Diagonal Gaussian Policies:

Normal Gaussian distribution:

  • We know multivariate Gaussian distribution is described by
    • mean vector (means of individual variables)
    • covariance matrix (covariance between each pair of variables in distribution)
      • : variance of the first variable (which is the square of its standard deviation)
      • : variance of the second variable
      • (or , as they are equal in a covariance matrix) is the covariance between the first and second variables.

Diagonal Gaussian distribution: special case where covariance matrix only has values on diagonal.

  • Implies no correlation between different variables- variables are independent from each other
  • Essentialy , can also represent with a vector

Diagonal Gaussian policy

This policy always has a neural network that maps from observations to mean actions

2 ways the covariance matrix is represented

  1. Single vector of (log standard deviations)
    • Not a function of state: standalone parameters
  2. Neural network which maps from states to
    • May share layers with the mean network

Note: we output log stds, not stds directly

  • Why? Log stds take on any values in

Sampling Given mean action , std and vector of noise from a spherical Gaussian

  • Action sample:

Log Likelihood Log likelihood of

  • -dimensional action
  • for diagonal Gaussian with
    • mean
    • std Is

Bellman Equations

  • Key idea: value of your starting point is the reward you expect to get from being there plus wherever you land next
  • The equations

Bellman backup: right hand side of Bellman equation reward + next value

Advantage function: sometimes we on’t need to describe how good an action is in an absolute sense, but how much better it is than others on average.

  • ie. relative adnatage of that action
  • Equation
  • Corresponding to a policy describes how much better it is to take a specific action a in state s, over randomly selecting action according to assumign you act according to forever

Goal of Reinforcement Learning

Background info

  • probability of a given trajectory

Finite Horizon

Core idea: sample from the state action marginal distribution, not from a trajectory distribution

  • Reduces variance, more efficient, better for long horizons


BECOMES

  • Sampling from the state action marginal

Infinite Horizon

Stationary distribution

Goal is still

If , it matters whether or not converges to a stationary distribution

  • ie
  • ie.

For infinite horizons, we can consider

  • No sum as we just want what’s expected in the end

Anatomy of RL algorithms

  1. Generate samples
  2. Fit model / estimate return
  3. Improve policy

Value Functions

We know that our goal is to optimize for the estimated reward over a timestep

  • Represented by

How do we expand ?

  • =
  • Expectation of state sampled from current probabillities of the expectation of action given sampled state’s expectation of reward + expectation of state sampled from etc. Action-State value function We can represent part of this with
  • State-action value function
  • total reward from taking action in state

Action Value function Value function , same but given a state

  • Equivalent to

Core ideas

  • Idea 1: You can improve a policy if you have
  • Idea 2: Compute gradient to increase probability of good given value function

Rl algos review of types

Core objective:

Policy gradient: directly differentiate objective Value based: estimate V/Q-function of optimal policy (no explicit policy however) Actor-critic: Estimate V/Q-function of current policy, use to improve policy MBRL: Estimate transition model + can either

  • use for planning
  • use to improve a policy
  • something else

Why so many?

  • Tradeoffs
    • Sample efficiency
    • Stability + ease of use
  • Assumptions
    • Stochastic / deterministic
    • Continuous / discrete
    • Horizon
  • Somethigns are easy/hard in settings
    • Difficulty representing model? Policy?