Motivation: I could not find an easy-to-understand post that described exactly how policy iteration worked along with a functional implementation.

Table of Contents:

What is Policy Iteration?

Concept

Policy Iteration is the process of finding an optimal policy for a given environment. An optimal policy in this context is a 1-D vector of actions where each action at index i maps to a state at index i within the given environment. For example, imagine you're given a 5x5 grid. Each square is a state in the grid, so we can also represent the state space as a 1-D vector (25 points in this particular vector since there are 25 squares). For each square in vector S, there is an optimal action A that can be taken. Policy iteration computes the optimal policy vector such that each action A is the most optimal action to take for a given state S.

Jargon:

Sanity Checks:

What is Policy Evaluation?

Concept

Policy Evaluation is a process that computes the value function given an optimal policy.

To see policy evaluation in action for a 5x5 GridWorld environment, copy the following code to a python file and run it . Ensure you have the requirements installed beforehand as well as the GridWorld environment code included.

Required GridWorld Environment Code

Store this code inside of the same file as where you'll be writing your policy iteration code. If you want to store it in a separate file, make sure you can import it in your policy iteration code file.