Monte Carlo Policy

Last updated Nov 8, 2022 Edit Source

# Monte Carlo

Pasted image 20220415190047 Estimate the value function from sampling: Pasted image 20220415190342

First visit MC: average returns only for first time (s,a) is visited in an episode/trial Repeated visits of (s,a) in the trial does not constitute a new learning condition

Grid World Scenario: Discount factor $\gamma = 1$

Monte Carlo Estimates Q for (1,1): $\frac{5+5-5}{3}=\frac{5}{3}$
Monte Carlo Estimates Q for (2,2): $\frac{5+5}{2}=5$

This only works when we have the entire path ending in a goal state, what if we do not have this whole path? Use Q-Learning