Browsers are
difficult
Please wait, loading your map...
define recursively and find max
define recursively and find max
n = inf
n = 0
average multiple n-step returns
๐ = 0
๐ = 1
multi state MDP with one time step per episodic
single-state MDP
.
Reiforcement learning
.
Finite MDPs
States, actions, and
rewards have an finite
number of elements
Infinite MDPs
States, actions, and
rewards have an infinite
number of elements
Bandit framework
Stochastic Bandits
enviroment
upper confidence
Bound (UCB) pg#84
KL--UCB pg#115
stochastic linear
bandits pg#207
combinatorial bandits
pg#316
context given
context linear bandits
pg#207
LinUCB pg#208
regularized least
squared estimator
pg#208
confidence bounds
pg#219
optimal design pg#230
Adversarial enviroment
EPX3 Algorithm
pg#127
context given
bandits with expert
advise framework
Exp4
feature construction for
state representations
Radial Basis Function
(RBF)
polynomial
(pg#210)
Fourier basis
corse coding
tile coding
Neural Nets
Memory based
MDP Framework
how do we know if an MDP policy its
optimum
action-value function
for policy ฯ
state-value function for
policy ฯ
bellman optimality
conditions
pg#62
How do we solve Finite MDPs in practice?
Generalized Policy Itertion
model based formulation
We know the MDP 4
argsuments
Dynamic programming
Value Iteration
algorithm
pg#82
Policy Iteration
algorithm
pg#80
Asynchronous
implementation
asynchronous value
iteration algorithm
pg#85
Sampling-based
planning
Model-based data
generation
Value-equivalence
prediction
model free formulation
(online learning)
We want to learnt the
model (MDP) of the
environment
Monte Carlo
pg#82
on-policy
prediction problem
(estimating value
function)
first-visit MC
pg#92
control problem
(estimating value
function)
Exploring Starts (ES)
pg#99
off-policy
prediction problem
(estimating value
function)
via importance
sampling
off-policy MC.
prediction
pg#110
control problem
(estimating value
function)
Off-policy MC control
pg#111
one step
temporal
difference
learning TD(0)
pg#119
on-policy
prediction problem
(estimating value
function)
Tabular TD(0)
pg#120
control problem
(estimating value
function)
SARSA
pg#129
off-policy
via importance
sampling
prediction problem
(estimating value
function)
Value Iteration
algorithm
pg#82
control problem (given
value estimation)
Expected SARSA
pg#133
Q-learning
pg#131
๐ - return algorithms
forward view
n-step bootstrapping
pg#142
on-policy
prediction problem
(estimating value
function)
n-step TD
pg#144
control problem (given
value estimation)
n-step SARSA
pg#147
off-policy
via importance
sampling
prediction problem
(estimating value
function)
control problem (given
value estimation)
n-step SARSA
pg#149
Tree backup
algorithm
pg#154
off-line ๐-return
algorithm
pg#290
backgrounds view w/
eligibility traces
TD(๐)
prediction problem
(estimating value
function)
semi-gradient TD(๐)
pg#293
True online TD(๐)
pg#300
control problem (given
value estimation)
Sarsa(๐) with
binary features &
linear
approximation
pg#305
True online TD(๐)
with binary
features & linear
approximation
pg#307
Policy gradient Methods
actor critic methods
Policy gradeint theorem
REINFORCE
pg#328
×
Created using
MindMup.com