Brendan Ang

Search

Search IconIcon to open search

Monte Carlo Policy

Last updated Nov 8, 2022 Edit Source

# Monte Carlo

Pasted image 20220415190047 Estimate the value function from sampling: Pasted image 20220415190342

First visit MC: average returns only for first time (s,a) is visited in an episode/trial Repeated visits of (s,a) in the trial does not constitute a new learning condition

Grid World Scenario: Discount factor $\gamma = 1$

Trial(1,1)(2,2)
(1,1)->(1,2)->(1,3)$G_t=0+0+1^2\times-5=-5$NA
(1,1)->(1,2)->(2,2)->(2,3)$G_t=0+0+5=5$$G_t=5$
(1,1)->(2,1)->(2,2)->(2,3)$G_t=5$$G_t=5$
Monte Carlo Estimates Q for (1,1): $\frac{5+5-5}{3}=\frac{5}{3}$
Monte Carlo Estimates Q for (2,2): $\frac{5+5}{2}=5$

This only works when we have the entire path ending in a goal state, what if we do not have this whole path? Use Q-Learning