Browsers are difficult Please wait, loading your map...
define recursively and find maxdefine recursively and find maxn = infn = 0average multiple n-step returns๐›Œ = 0๐›Œ = 1multi state MDP with one time step per episodicsingle-state MDP.Reiforcement learning.Finite MDPsStates, actions, andrewards have an finitenumber of elementsInfinite MDPsStates, actions, andrewards have an infinitenumber of elementsBandit frameworkStochastic Banditsenviromentupper confidenceBound (UCB) pg#84KL--UCB pg#115stochastic linearbandits pg#207combinatorial banditspg#316context givencontext linear banditspg#207LinUCB pg#208regularized leastsquared estimatorpg#208confidence boundspg#219optimal design pg#230Adversarial enviromentEPX3 Algorithmpg#127context givenbandits with expertadvise frameworkExp4feature construction forstate representationsRadial Basis Function(RBF)polynomial(pg#210)Fourier basiscorse codingtile codingNeural NetsMemory basedMDP Frameworkhow do we know if an MDP policy itsoptimumaction-value functionfor policy ฯ€state-value function forpolicy ฯ€bellman optimalityconditionspg#62How do we solve Finite MDPs in practice?Generalized Policy Itertionmodel based formulationWe know the MDP 4argsumentsDynamic programmingValue Iterationalgorithmpg#82Policy Iterationalgorithmpg#80Asynchronousimplementationasynchronous valueiteration algorithmpg#85Sampling-basedplanningModel-based datagenerationValue-equivalencepredictionmodel free formulation(online learning)We want to learnt themodel (MDP) of theenvironmentMonte Carlopg#82on-policyprediction problem(estimating valuefunction)first-visit MCpg#92control problem(estimating valuefunction)Exploring Starts (ES)pg#99off-policyprediction problem(estimating valuefunction)via importancesamplingoff-policy MC.predictionpg#110control problem(estimating valuefunction)Off-policy MC controlpg#111one steptemporaldifferencelearning TD(0)pg#119on-policyprediction problem(estimating valuefunction)Tabular TD(0)pg#120control problem(estimating valuefunction)SARSApg#129off-policyvia importancesamplingprediction problem(estimating valuefunction)Value Iterationalgorithmpg#82control problem (givenvalue estimation)Expected SARSApg#133Q-learningpg#131๐›Œ - return algorithmsforward viewn-step bootstrappingpg#142on-policyprediction problem(estimating valuefunction)n-step TDpg#144control problem (givenvalue estimation)n-step SARSApg#147off-policyvia importancesamplingprediction problem(estimating valuefunction)control problem (givenvalue estimation)n-step SARSApg#149Tree backupalgorithmpg#154off-line ๐›Œ-returnalgorithmpg#290backgrounds view w/eligibility tracesTD(๐›Œ)prediction problem(estimating valuefunction)semi-gradient TD(๐›Œ)pg#293True online TD(๐›Œ)pg#300control problem (givenvalue estimation)Sarsa(๐›Œ) withbinary features &linearapproximationpg#305True online TD(๐›Œ)with binaryfeatures & linearapproximationpg#307Policy gradient Methodsactor critic methodsPolicy gradeint theoremREINFORCEpg#328

Created using MindMup.com