Here are some relevant basc terms. I’d recommend familiarizing yourself with what each of these mean, as they’ll come up often in these notes.
Formal Representations
- Agent: interacts with the environment
- current state
- next state
- reward
- action
- reward function.
- trajectory
- Sequence of states and actions
- or transition
- horizon
- Length of a trajectory
- reward
- or
- Cumulative reward along trajectory
- discount factor
- policy
- Determine actions to be done
- deterministic policy
- stochastic policy
- state value function
- Expected return given state s and policy
- Alternative notation
- action value function
- Expected return given state s, action a, and policy
- Alternative notation:
- expected return
- optimal state value function
- Expected return if you start in state s and always act according to optimal policy
- optimal action value fucntion
- Expected return if you start in state s, take action , and always act according to optimal policy
Bellman equations for optimal value functions
- Stochastic
- Deterministic
Task Types
- Episodic task: limited length
- Continuing task: unlimited length
Environments
- Deterministic- only one possible transition for a given
- Non-deterministic- multiple transitions for a given
- Stochastic: non-deterministic with known probabilities for tansitions
Action types
- Discrete (categorical)
- Continuous (gaussian)
Sequential decision problem
- Analyze a sequence of actions based on expected rewards
Markdov decision Process: formalizes sequential decision problems in stochastic environments
- Set of states, actions, transition probability functions, and reward functions
Markov property
- Property such that the evolution of a markov process depends only on the present state
Model-Based and Model-Free Methods:
- Model-Based Methods: Algorithms that involve learning or using a model of the environment.
- Model-Free Methods: Algorithms that learn directly from interactions with the environment without an explicit model.
Policy Iteration:
- An algorithm for finding the optimal policy by iteratively improving the policy and evaluating it.
Value Iteration:
- An algorithm that successively improves the value function estimation and derives the optimal policy from it.
Temporal Difference (TD) Learning:
- A class of model-free methods that learn by bootstrapping from the current estimate of the value function.
SARSA (State-Action-Reward-State-Action):
- An on-policy TD learning algorithm.
- Learns Q-values based on the action taken by the current policy.
Off-Policy Learning:
- Learning a policy different from the policy used to generate the data.
Replay Buffer:
- A data structure used to store and replay past experiences in order to break the temporal correlations in sequential data.
Curriculum Learning:
- A training methodology in reinforcement learning where tasks are gradually increased in complexity to facilitate learning.
Partial Observability:
- Describes environments where the agent does not have access to the complete state, leading to the Partially Observable Markov Decision Process (POMDP) framework.
Reward Shaping:
- Modifying the reward function to make learning faster or easier in reinforcement learning.
Bellman equations for optimal value functions
- Stochastic
- Deterministic
Value Function
How do you determine the value of a given state?
- Notation for expected return for a stochastic policy
- Value Function quantifies value of a state (expected return given state s and policy )
- Alternative notation (specify policy)
How do you determine the value of a given action?
- Gives expected return
- Specifying state , action , and policy
- Alternative notation:
Optimal value function : expected return if you start in state s and always act according to optimal policy
- Optimal action-value function : expected return if you start in state s, take action a, and then act accordin to optimal policy in the environment
Value iteration
Key idea: using Bellman equations in practice
Obtaining the optimal policy given value function Using Q function
- Using the V function
Value iteration algorithm: estimate optimal Approach
- Obtain optimal value function
- Using Bellman equation
- Obtain optimal policy from the obtained optimal value function
Algorithm Until Q doesn’t change, increase n from 1 by 1
- for each tuple of the transition model
- ←
- Note: is value at iteration Return such that
Essentially, fill in a Q table for each n, then returning a table for by picking the action which had the highest reward across the tables, for a given state
Designed to satisfy Bellman equation after multiple iterations
Categorical Policy: classifier over discrete actions.
- Build nn for categorical policy like you would for classifier:
- Input is observation
- Some number of layers
- Final linear layer (to get logits) and softmax (logits → probabilities)
- Sampling: Given probabilities for each action, PyTorch has built-in tools for sampling
- Log likelihood: Denote last layer of probabilities . Vector with as many entries as there are actions. Treat actions as indices for vector
- Log likelihood for action is
Diagonal Gaussian Policies:
Normal Gaussian distribution:
- We know multivariate Gaussian distribution is described by
- mean vector (means of individual variables)
- covariance matrix (covariance between each pair of variables in distribution)
- : variance of the first variable (which is the square of its standard deviation)
- : variance of the second variable
- (or , as they are equal in a covariance matrix) is the covariance between the first and second variables.
Diagonal Gaussian distribution: special case where covariance matrix only has values on diagonal.
- Implies no correlation between different variables- variables are independent from each other
- Essentialy , can also represent with a vector
Diagonal Gaussian policy
This policy always has a neural network that maps from observations to mean actions
2 ways the covariance matrix is represented
- Single vector of (log standard deviations)
- Not a function of state: standalone parameters
- Neural network which maps from states to
- May share layers with the mean network
Note: we output log stds, not stds directly
- Why? Log stds take on any values in
Sampling Given mean action , std and vector of noise from a spherical Gaussian
- Action sample:
Log Likelihood Log likelihood of
- -dimensional action
- for diagonal Gaussian with
- mean
- std Is
Bellman Equations
- Key idea: value of your starting point is the reward you expect to get from being there plus wherever you land next
- The equations
Bellman backup: right hand side of Bellman equation reward + next value
Advantage function: sometimes we on’t need to describe how good an action is in an absolute sense, but how much better it is than others on average.
- ie. relative adnatage of that action
- Equation
- Corresponding to a policy describes how much better it is to take a specific action a in state s, over randomly selecting action according to assumign you act according to forever
Goal of Reinforcement Learning
Background info
- probability of a given trajectory
Finite Horizon
Core idea: sample from the state action marginal distribution, not from a trajectory distribution
- Reduces variance, more efficient, better for long horizons
BECOMES
- Sampling from the state action marginal
Infinite Horizon
Stationary distribution
Goal is still
If , it matters whether or not converges to a stationary distribution
- ie
- ie.
For infinite horizons, we can consider
- No sum as we just want what’s expected in the end
Anatomy of RL algorithms
- Generate samples
- Fit model / estimate return
- Improve policy
Value Functions
We know that our goal is to optimize for the estimated reward over a timestep
- Represented by
How do we expand ?
- =
- Expectation of state sampled from current probabillities of the expectation of action given sampled state’s expectation of reward + expectation of state sampled from etc. Action-State value function We can represent part of this with
- State-action value function
- total reward from taking action in state
Action Value function Value function , same but given a state
- Equivalent to
Core ideas
- Idea 1: You can improve a policy if you have
- Idea 2: Compute gradient to increase probability of good given value function
Rl algos review of types
Core objective:
Policy gradient: directly differentiate objective Value based: estimate V/Q-function of optimal policy (no explicit policy however) Actor-critic: Estimate V/Q-function of current policy, use to improve policy MBRL: Estimate transition model + can either
- use for planning
- use to improve a policy
- something else
Why so many?
- Tradeoffs
- Sample efficiency
- Stability + ease of use
- Assumptions
- Stochastic / deterministic
- Continuous / discrete
- Horizon
- Somethigns are easy/hard in settings
- Difficulty representing model? Policy?