Here are some relevant basc terms. I’d recommend familiarizing yourself with what each of these mean, as they’ll come up often in these notes.

Formal Representations

Agent: interacts with the environment
$s$ current state
$s ’$ next state
$r$ reward
$a$ action
$R$ reward function.
- $r = R (s, a, s ’)$
$τ$ trajectory
- Sequence of states and actions
- $τ = (s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \dots)$
$(s, a, r, s ’)$ or $(s, a, s ’)$ transition
$h$ horizon
- Length of a trajectory
$R (τ)$ reward
- $R (τ) = \sum r_{i}$ or $R (τ) = \sum γ^{i} r_{i + 1}$
- Cumulative reward along trajectory $τ$
$γ$ discount factor
$π$ policy
- Determine actions to be done
$π (s)$ deterministic policy
- $a = π (s)$
$π (a ∣ s)$ stochastic policy
- $P (a ∣ s) = π (a ∣ s)$
$V (s)$ state value function
- Expected return given state s and policy $π$
- $V (s) = E [R (τ) ∣ s_{1} = s]$
- Alternative notation $V^{π} (s) = E_{τ \sim π} [R (τ) ∣ s_{1} = s]$
$Q (s, a)$ action value function
- Expected return given state s, action a, and policy $π$
- $Q (s, a) = E [R (τ) ∣ s_{1} = s, a_{1} = 1]$
- Alternative notation: $Q^{π} (s, a) = E_{τ \sim π} [R (τ) ∣ s_{1} = s, a_{1} = 1]$
$E [R (τ)]$ expected return
- $E [R (τ)] = \sum P (τ) R (τ)$
$V^{*} (s)$ optimal state value function
- Expected return if you start in state s and always act according to optimal policy
- $V^{*} (s) = ma x_{π} V^{π} (s) = ma x_{π} E_{τ \sim π} [R (τ) ∣ s_{1} = s]$
$Q^{*} (s)$ optimal action value fucntion
- Expected return if you start in state s, take action $a$ , and always act according to optimal policy
- $Q^{*} (s, a) = ma x_{π} Q^{π} (s) = ma x_{π} E_{τ \sim π} [R (τ) ∣ s_{1} = s, a_{1} = a]$

Bellman equations for optimal value functions

Stochastic
- $V^{*} (s) = a max \sum_{s^{'}} P (s^{'} ∣ s, a) (R (s, a) + γ V^{*} (s^{'}))$
- $Q^{*} (s, a) = \sum_{s^{'}} P (s^{'} ∣ s, a) [R (s, a) + γ a max Q^{*} (s^{'}])$
Deterministic
- $V^{*} (s) = a max (R (s, a) + γ V^{*} (s ’))$
- $Q^{*} (s, a) = R (s, a) + γ a^{'} max Q^{*} (s ’, a^{'}))$

Task Types

Episodic task: limited length
Continuing task: unlimited length

Environments

Deterministic- only one possible transition $(s, a, s ’)$ for a given $(s, a)$
Non-deterministic- multiple transitions $(s, a, s ’)$ for a given $(s, a)$
Stochastic: non-deterministic with known probabilities for tansitions

Action types

Discrete (categorical)
Continuous (gaussian)

Sequential decision problem

Analyze a sequence of actions based on expected rewards

Markdov decision Process: formalizes sequential decision problems in stochastic environments

Set of states, actions, transition probability functions, and reward functions

Markov property

Property such that the evolution of a markov process depends only on the present state

Model-Based and Model-Free Methods:

Model-Based Methods: Algorithms that involve learning or using a model of the environment.
Model-Free Methods: Algorithms that learn directly from interactions with the environment without an explicit model.

Policy Iteration:

An algorithm for finding the optimal policy by iteratively improving the policy and evaluating it.

Value Iteration:

An algorithm that successively improves the value function estimation and derives the optimal policy from it.

Temporal Difference (TD) Learning:

A class of model-free methods that learn by bootstrapping from the current estimate of the value function.

SARSA (State-Action-Reward-State-Action):

An on-policy TD learning algorithm.
Learns Q-values based on the action taken by the current policy.

Off-Policy Learning:

Learning a policy different from the policy used to generate the data.

Replay Buffer:

A data structure used to store and replay past experiences in order to break the temporal correlations in sequential data.

Curriculum Learning:

A training methodology in reinforcement learning where tasks are gradually increased in complexity to facilitate learning.

Partial Observability:

Describes environments where the agent does not have access to the complete state, leading to the Partially Observable Markov Decision Process (POMDP) framework.

Reward Shaping:

Modifying the reward function to make learning faster or easier in reinforcement learning.

Bellman equations for optimal value functions

Stochastic
- $V^{*} (s) = a max \sum_{s^{'}} P (s^{'} ∣ s, a) (R (s, a) + γ V^{*} (s^{'}))$
- $Q^{*} (s, a) = \sum_{s^{'}} P (s^{'} ∣ s, a) [R (s, a) + γ a max Q^{*} (s^{'}])$
Deterministic
- $V^{*} (s) = a max (R (s, a) + γ V^{*} (s ’))$
- $Q^{*} (s, a) = R (s, a) + γ a^{'} max Q^{*} (s ’, a^{'}))$

Value Function

How do you determine the value of a given state? $V (s)$

Notation for expected return $E [R (τ)] = \sum P (τ) R (τ)$ for a stochastic policy $τ$
Value Function $V (s)$ quantifies value of a state (expected return given state s and policy $π$ )
- $V (s) = E [R (τ) ∣ s_{1} = s]$
Alternative notation (specify policy)
- $V^{π} (s) = E_{τ \sim π} [R (τ) ∣ s_{1} = s]$

How do you determine the value of a given action? $Q (s, a)$

Gives expected return
- Specifying state $s$ , action $a$ , and policy $π$
$Q (s, a) = E [R (τ) ∣ s_{1} = s, a_{1} = 1]$
Alternative notation: $Q^{π} (s, a) = E_{τ \sim π} [R (τ) ∣ s_{1} = s, a_{1} = 1]$

Optimal value function $V^{*} (s)$ : expected return if you start in state s and always act according to optimal policy

$V^{*} (s) = ma x_{π} V^{π} (s) = ma x_{π} E_{τ \sim π} [R (τ) ∣ s_{1} = s]$ Optimal action-value function $Q^{*} (s)$ : expected return if you start in state s, take action a, and then act accordin to optimal policy in the environment
$Q^{*} (s, a) = ma x_{π} Q^{π} (s) = ma x_{π} E_{τ \sim π} [R (τ) ∣ s_{1} = s, a_{1} = a]$

Value iteration

Key idea: using Bellman equations in practice

Obtaining the optimal policy given value function Using Q function

$π^{*} (s) = a r g ma x_{a} Q^{*} (s, a)$ Using the V function
$π^{*} (s) = a r g ma x_{a} \sum P (s ’∣ s, a) (R (s, a, s ’) + γ V^{*} (s ’))$

Value iteration algorithm: estimate optimal $π$ Approach

Obtain optimal value function $Q^{*} (s, a) = R (s, a) + γ a^{'} max Q^{*} (s ’, a^{'}))$
1. Using Bellman equation
Obtain optimal policy from the obtained optimal value function

Algorithm Until Q doesn’t change, increase n from 1 by 1

for each tuple of the transition model
- $Q_{n} (s, a)$ ← $R (s, a) + γ a^{'} max Q_{n - 1} (s^{'}, a^{'})$
- Note: $Q_{n} (s, a)$ is value $Q (s, a)$ at iteration $n$ Return $π$ such that $π (s) = a ’ argmax Q_{n} (s, a ’)$

Essentially, fill in a Q table for each n, then returning a table for $π$ by picking the action which had the highest reward across the tables, for a given state

Designed to satisfy Bellman equation after multiple iterations

$Q^{*} (s, a) = R (s, a) + γ a^{'} max Q^{*} (s ’, a^{'}))$

Categorical Policy: classifier over discrete actions.

Build nn for categorical policy like you would for classifier:
- Input is observation
- Some number of layers
- Final linear layer (to get logits) and softmax (logits → probabilities)
Sampling: Given probabilities for each action, PyTorch has built-in tools for sampling
Log likelihood: Denote last layer of probabilities $P_{θ} (s)$ . Vector with as many entries as there are actions. Treat actions as indices for vector
- Log likelihood for action $a$ is $l o g π_{θ} (a ∣ s) = l o g [P_{θ} (s)]_{a}$

Diagonal Gaussian Policies:

Normal Gaussian distribution:

We know multivariate Gaussian distribution is described by
- mean vector $μ$ (means of individual variables)
- covariance matrix $\sum$ (covariance between each pair of variables in distribution)
  - $Σ = (σ_{1}^{2} σ_{21} σ_{12} σ_{2}^{2})$
  - $σ_{1}^{2}$ : variance of the first variable (which is the square of its standard deviation)
  - $σ_{2}^{2}$ : variance of the second variable
  - $σ_{12}$ (or $σ_{21}$ , as they are equal in a covariance matrix) is the covariance between the first and second variables.

Diagonal Gaussian distribution: special case where covariance matrix only has values on diagonal.

Implies no correlation between different variables- variables are independent from each other
Essentialy $Σ = (σ_{1}^{2} 0 0 σ_{2}^{2})$ , can also represent with a vector

Diagonal Gaussian policy

This policy always has a neural network that maps from observations to mean actions $μ_{θ} (s)$

2 ways the covariance matrix is represented

Single vector of $log σ$ (log standard deviations)
- Not a function of state: standalone parameters
Neural network which maps from states to $log σ_{θ} (s)$
- May share layers with the mean network

Note: we output log stds, not stds directly

Why? Log stds take on any values in $(- \infty, \infty)$

Sampling Given mean action $μ_{θ} (s)$ , std $σ_{θ} (s)$ and vector $z$ of noise from a spherical Gaussian $(z \sim N (0, I))$

Action sample: $a = μ_{θ} (s) + σ_{θ} (s) ⊙ z$

Log Likelihood Log likelihood of

$k$ -dimensional action $a$
for diagonal Gaussian with
- mean $μ = μ_{θ} (s)$
- std $σ = σ_{θ} (s)$ Is
$lo g π_{θ} (a ∣ s) = - \frac{1}{2} (\sum_{i = 1}^{k} (\frac{( a _{i} - μ _{i} ) ^{2}}{σ _{i}^{2}} + 2 lo g σ_{i}) + k lo g 2 π) .$

Bellman Equations

Key idea: value of your starting point is the reward you expect to get from being there plus wherever you land next
The equations
- $V_{π} (s) = E_{a \sim π, s ’ \sim P} [r (s, a) + γ V^{π} (s ’)]$
- $Q_{π} (s) = E_{s ’ \sim P} [r (s, a) + γ E_{a ’ \sim π} [Q^{π} (s ’, a ’)]]$
- $V^{*} (s) = max_{a} E_{a \sim π, s ’ \sim P} [r (s, a) + γ V^{*} (s ’)]$
- $Q^{*} (s) = E_{s ’ \sim P} [r (s, a) + γ max_{a} E_{a ’ \sim π} [Q^{*} (s ’, a ’)]]$

Bellman backup: right hand side of Bellman equation reward + next value

Advantage function: sometimes we on’t need to describe how good an action is in an absolute sense, but how much better it is than others on average.

ie. relative adnatage of that action
Equation $A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$
Corresponding to a policy $π$ describes how much better it is to take a specific action a in state s, over randomly selecting action according to $π (\cdot ∣ s)$ assumign you act according to $π$ forever

Goal of Reinforcement Learning

Background info

probability of a given trajectory $p_{θ} (τ) = p (s_{1}) \prod_{t} π_{θ} (a_{t} ∣ s_{t}) p (s_{t + 1} ∣ s_{t}, a_{t})$

Finite Horizon

Core idea: sample from the state action marginal distribution, not from a trajectory distribution

Reduces variance, more efficient, better for long horizons

$θ^{*} = θ arg max E_{τ \sim p_{θ} (τ)} [\sum_{t} r (s_{t}, a_{t})]$
BECOMES $θ^{*} = θ arg max \sum_{t} E_{(s_{t}, a_{t}) \sim p_{θ} (s_{t}, a_{t})} [r (s_{t}, a_{t})]$

Sampling from the state action marginal $p_{θ} (s_{t}, a_{t})$

Infinite Horizon

Stationary distribution

Goal is still $θ^{*} = θ arg max \sum_{t} E_{(s_{t}, a_{t}) \sim p_{θ} (s_{t}, a_{t})} [r (s_{t}, a_{t})]$

If $T = \infty$ , it matters whether or not $p (s_{t}, a_{t})$ converges to a stationary distribution

ie $μ = T μ$
ie. $(T - I) μ = 0$

For infinite horizons, we can consider $θ^{*} = θ arg max E_{(s_{t}, a_{t}) \sim p_{θ} (s_{t}, a_{t})} [r (s_{t}, a_{t})]$

No sum as we just want what’s expected in the end

Anatomy of RL algorithms

Generate samples
Fit model / estimate return
Improve policy

Value Functions

We know that our goal is to optimize for the estimated reward over a timestep

Represented by $E_{τ \sim p_{θ} (τ)} [\sum_{t} r (s_{t}, a_{t})]$

How do we expand $E_{τ \sim p_{θ} (τ)} [\sum_{t} r (s_{t}, a_{t})]$ ?

$E_{τ \sim p_{θ} (τ)} [\sum_{t} r (s_{t}, a_{t})]$ = $E_{s_{1} \sim p (s_{1})} [E_{a_{1} \sim π (a_{1} ∣ s_{1})} [r (s_{t}, a_{1}) + E_{s_{2} \sim π (s_{2} ∣ s_{1}, a_{1})} [\dots]]]$
Expectation of state sampled from current probabillities of the expectation of action given sampled state’s expectation of reward + expectation of state sampled from etc. Action-State value function We can represent part of this with $Q^{π}$
State-action value function $Q^{π} (s_{t}, a_{t}) = \sum_{t} E_{π_{θ}} [r (s_{t^{'}}, a_{t^{'}}) ∣ s_{t}, a_{t}]$
total reward from taking action $a_{t}$ in state $s_{t}$

Action Value function Value function $V^{π}$ , same but given a state

$V^{π} (s_{t}) = \sum_{t} E_{π_{θ}} [r (s_{t^{'}}, a_{t^{'}}) ∣ s_{t}]$
Equivalent to $V^{π} (s_{t}) = E_{a_{t} \sim π (a_{t} ∣ s_{t})} [Q^{π} (s_{t}, a_{t})]$

Core ideas

Idea 1: You can improve a policy $π$ if you have $Q^{π} (s, a)$
Idea 2: Compute gradient to increase probability of good $a$ given value function

Rl algos review of types

Core objective: $θ^{*} = θ arg max E_{τ \sim p_{θ} (τ)} [\sum_{t} r (s_{t}, a_{t})]$

Policy gradient: directly differentiate objective Value based: estimate V/Q-function of optimal policy (no explicit policy however) Actor-critic: Estimate V/Q-function of current policy, use to improve policy MBRL: Estimate transition model + can either

use for planning
use to improve a policy
something else

Why so many?

Tradeoffs
- Sample efficiency
- Stability + ease of use
Assumptions
- Stochastic / deterministic
- Continuous / discrete
- Horizon
Somethigns are easy/hard in settings
- Difficulty representing model? Policy?

Pablo's Reference Notes

Explorer

Terms