Reinforcement learning #
This is my summary of David Silver’s lectures. These are very information dense and may be worth a second watch to really get all the ideas (or read along and do some exercises).
Introduction to Reinforcement Learning #
 Science of decision making.
 Sits at the intersection of ML / OR / Neuroscience / Psychology / Economics.
 No supervisor, only a “reward signal”, and feedback (may be) delayed. Hence, time plays an important role. Data is not i.i.d and the agent affects the data.
 Reward hypothesis: Any goal can be described by the maximisation of expected cumulative reward (reward \(R_t\) is a scalar feedback signal). An interesting discussion of the reward hypothesis here.
 At each step:
 The agent executes an action and receives an updated observation and the reward
 The environment receives the action, and emits the observation and the reward
 The history is the sequence of observations, actions and rewards to now. What happens next depends on the history: based on the history the agent selects actions, and the environment selects observations and rewards.
 Typically we look at the current state: a concise summary of the history, the information we will use to determine what happens next.
 Formally, state is a function of the history: \(S_t=f(H_t)\).
 Environment state. The data the environment uses to determine what the next observation/reward is. Not usually visible to the agent. A formalism to help us understand things but not usable (or particularly relevant) in RL algorithms.
 Agent state. The internal representation used by the agent to decide the next action.
 Information / Markov state. Contains “all useful” information from the history. A state \(S_t\) is Markov iff \[P(S_{t+1}\mid S_t) = P(S_{t+1}\mid S_1,\ldots,S_t).\] The future is independent of the past given the present. \(H_{1:t}\to S_t\to H_{t+1:\infty}\). The state is a sufficient statistic of the future.
 Markov state of the helicopter: position, velocity, angular velocities etc. NonMarkov state would be just position, so you need to “go back in time” and look at the history to determine the velocities.
 The environment state is Markov.
 The history \(H_t\) (considered as a state) is Markov.
 Fully observable environment. Agent state = environment state = information state. This is a Markov decision process.
 Partially observable environment.
 Most interesting examples.
 Partially observable Markov decision process.
 The agent must construct its own state representation.
 Simplest (quickly infeasible): \(S_t^a=H_t\).
 Bayesian: keep a vector of probabilities of what the environment state is.
 Recurrent neural network \(S_t^a=\sigma(S_{t1}^a W_s+O_tW_o)\).
 Major components of an RL agent
 Policy: the behaviour function. A map from state to action \(a=\pi(s)\) (if deterministic), \(\pi(a\mid s)=P(A=a\mid S=s)\) (if stochastic).
 Value function: scoring states and actions. Prediction of future reward, used to evaluate goodness/badness of states. \[v_\pi(s)=\mathbb{E}_\pi(R_t+\gamma R_{t+1}+\gamma^2+\ldots\mid S_t=s).\]
 Model: representation of the environment, predicts what the environment will do next. Not a requirement (many examples in the course will be modelfree). Two parts:
 Transition model (predicts the next state) \(P(S’=s’\mid S=s,A=a)\)
 Reward model (predicts the next (immediate) reward) \(\mathbb{E}(R\mid S=s, A=a)\).
 Maze example.
 Rewards: 1 per time step.
 Actions: N, E, S, W
 States: Agent’s location
 Policy: what to do in each state (arrows on the blocks in the maze)
 Value function: distance from goal based on this policy
 Categorising RL agents
 Value based. No policy (implicit from the value function, just pick actions greedily), value function.
 Policy based. Policy, no value function.
 Actor critic. Both
 Model free. Policy and/or value function, no model.
 Model based.
 Two fundamental problems in sequential decision making.
 Reinforcement learning. Environment is initially unknown. The agent interacts with the environment and improves its policy.
 Planning. The model of the environment is known and the agent performs computations with its model and improves its policy.
Markov Decision Processes #
Markov processes #
 State transition probability \(P_{ss’}=P(S_{t+1}=s’\mid S_t=s)\). A Markov process (or chain) is a tuple \((S, P)\): a finite set of states and the matrix of state transition probabilities.
 An episode is sampling from the chain from the start to a terminal state.
Markov reward processes #
 Add a reward function \(R_s=\mathbb{E}(R_{t+1}\mid S_t=s)\)
 Add a discount factor \(\gamma\in[0,1]\). The present value of future rewards. 0 is nearsighted, 1 is farsighted. The value of receiving reward \(R\) after \(k+1\) timesteps is \(\gamma^k R\).
 The return \[ G_t=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^\infty \gamma^k R_{t+k+1}. \]
 We discount because there is more uncertainty in the future, but mainly to keep the maths easier (sums converge). Also a decent cognitive model (proven with tests).
 If all sequences terminate it is possible to use undiscounted Markov reward processes.
 The value function \(v(s)\) gives the longterm value of state \(s\). Formally, \[ v(s) = \mathbb{E}(G_t\mid S_t=s), \] i.e. the value function is the expected return when starting from state \(s\).
 The value function can be decomposed into two parts:
 the immediate reward \(R_{t+1}\), and
 the discounted value of the successor state \(\gamma v(S_{t+1})\).
 The Bellman equation: \[ v(s)= \mathbb{E}(R_{t+1}+\gamma v(S_{t+1})\mid S_t=s). \] A tautology essentially, describing what we mean by reward. A “onestep lookahead search”. Can be expressed concisely using matrices: \(\mathbf{v}=\mathcal{R}+\gamma\mathcal{P}\mathbf{v}\). \(\mathcal{R}\) is the reward vector, \(\mathcal{P}\) is the transition matrix.
 This is a linear equation that can be solved directly: \[ \mathbf{v}=(I\gamma\mathcal{P})^{1}\mathcal{R}. \]
 Complexity is \(O(n^3)\) for \(n\) states, so not usually tractable.
Markov decision processes #

An MDP is a Markov reward process with decisions. We add a finite set of actions \(\mathcal{A}\).

\(\mathcal{P}\), state transition probability matrix, now depends on which action we take \[ \mathcal{P}^a_{ss’}=P(S_{t+1}=s’\mid S_t=s, A_t=a). \]

The reward function also depends on the action.

A policy \(\pi\) is a distribution over actions given states, \[ \pi(a\mid s)=P(A_t=a\mid S_t=s). \]
 A policy fully defines the behaviour of an agent.
 Policies depend on the current state only for an MDP (i.e not the history).
 Maximise the reward from now onwards.

For a given MVP and a policy \(\pi\), the state sequence \(S_1,S_2,\ldots\) is a Markov process.

The state and reward sequence is a Markov reward process.

The statevalue function \(v_\pi(s)\) of an MDP is the expected return starting from state \(s\) and then following policy \(\pi\): \[ v_\pi(s)=\mathbb{E}_\pi(G_t\mid S_t=s). \]

The actionvalue function \(q_\pi(s,a)\) is the expected return starting from state \(s\), taking action a, and then following policy \(\pi\): \[ q_\pi(s,a)=\mathbb{E}_\pi(G_t\mid S_t=s, A_t=a). \]

The statevalue function and actionvalue function can again be decomposed into immediate reward plus discounted value of successor state with all the conditions on policy and action:
\begin{align*} v_\pi(s)&=\mathbb{E}_\pi(R_{t+1}+\gamma v_\pi(S_{t+1})\mid S_t=s)\\ q_\pi(s,a)&=\mathbb{E}_\pi(R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})\mid S_t=s, A_t=a). \end{align*}

There is a similar analogy here with the onestep lookahead search. \[ v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a\mid s)\Bigl(\mathcal{R}_s^a+\gamma\sum_{s’\in S}\mathcal{P}_{ss’}^av_\pi(s’)\Bigr) \] and we can write the same thing for actionvalues. (Bellman expectation equation).
The optimal value function #
 The optimal statevalue function \(v_*(s)=\mathrm{max}_\pi v_\pi(s)\).
 The optimal actionvalue function \(q_*(s,a)=\mathrm{max}_\pi q_\pi(s,a)\). If you know this, you’re done. An MDP is solved when we know the optimal value function.
 We define a partial ordering over policies by \(\pi\geq \pi’\) if \(v_\pi(s)\geq v_{\pi’}(s)\) for all \(s\).
 Theorem. For any Markov decision process, there exists an optimal policy \(\pi_*\) that is better than or equal to all other policies. All optimal policies achieve the optimal value function \(v_{\pi_*}(s)=v_*(s)\) and the optimal actionvalue function \(q_{\pi_*}(s,a)=q_*(s,a)\).
 We can find an optimal policy by maximising over \(q_*(s,a)\): \[ \pi_*(a\mid s)=1, \quad\mbox{if } a=\mathrm{argmax}_{a\in\mathcal{A}}q_*(s,a) \] and \(0\) otherwise.
 The optimal value functions are recursively related by the Bellman optimality equations: \[ v_*(s)=\max_a q_*(s,a). \]
 Similarly for \(q_*\): \[ q_*(s,a)=\mathcal{R}_s^a+\gamma\sum_{s’\in\mathcal{S}}\mathcal{P}^a_{ss’}v_*(s’). \]
 And combining these: \[ v_*(s)=\max_a \mathcal{R}_s^a +\gamma\sum_{s’\in S}P^a_{ss’} v_*(s’). \]
 These equations are nonlinear, with no closedform solution. Usually we use iterative solution methods (value iteration, policy iteration, Qlearning, Sarsa).
Planning by Dynamic Programming #
Introduction #
 Dynamic programming: c.f. linear programming (a program in the mathematical sense). Dynamic means there is some sequential or temporal component to the problem; steps.
 A method for solving complex problems by breaking down into subproblems and then combining the solutions to the subproblems into a solution for the full problem.
 Needs two properties in the problem:
 Optimal substructure (optimal solutions from subproblems can be built into an optimal solution for the full problem: principle of optimality, see shortest path).
 Overlapping subproblems (subproblems occur again and again, so solutions can be cached and reused).
 MDPs satisfy both of these properties. The Bellman equation gives us the recursive decomposition. Value function gives us the cache and the ability to reuse solutions.
 Dynamic programming is used for planning in an MDP  assuming full knowledge.
 Two cases:
 Prediction: input MDP and policy, output value function.
 Control: input MDP, output optimal value function and policy.
 Used in many other types of algorithms (shortest path, Viterbi, lattice models, string algorithms).
Policy evalution #
 You are given an MDP and a policy: how much reward will we get (evalute the policy).
 Iterative application of the Bellman expectation equation.
 \(v_1\to v_2\to \cdots \to v_pi\). Start with an initial arbitrary value function, then using synchronous backups:
 at each iteration \(k+1\),
 for all states \(s\)
 update \(v_{k+1}(s)\) from \(v_k(s’)\)
 where \(s’\) is a successor state of \(s\)
 Prove convergence later (contraction mapping theorem a.k.a Banach fixedpoint theorem).
 Small gridworld example. Very cool.
Policy iteration #
 Start with a given policy \(\pi\).
 Evaluate the policy using the above policy evaluation method.
 Improve the policy by acting greedily with respect to this value function.
 In the small gridworld example it only took one iteration; in general it can take many but this process does always converge to the optimal policy \(\pi^*\).
 Relationship to expectation maximisation?
 Consider a deterministic policy \(a=\pi(s)\). We can improve the policy by acting greedily
\[ \pi’(s) = \mathrm{argmax}_{a\in\mathcal{A}} q_\pi(s, a). \]
 This does indeed improve the value from any state \(s\) over one step, since
\[ q_\pi(s, \pi’(s)) = \max_{a\in\mathcal{A}}q_\pi(s, a)\geq q_\pi(s, \pi(s))=v_\pi(s). \]
 Therefore improves the value function \(v_{\pi’}(s)\geq v_\pi(s)\) (telescoping argument).
 If improvements stop, i.e. we achieve equality, then the Bellman optimality equation has been satisfied by definition, and hence we have the optimal policy.
 Do we actually need to converge to the optimal policy? Maybe stopping conditions.
 Just do it once and then continue: value iteration.
Value iteration #
 Any optimal policy can be subdivided into two components: an optimal first action \(A^*\) followed by an optimal policy from the successor state \(S’\) (principle of optimality).
 Formulate into a theorem: a policy \(\pi(a\mid s)\) achieves the optimal value from state \(s\) if and only if for any state \(s’\) reachable from \(s\), \(\pi\) achieves the optimal value from state \(s’\) onwards.
 Use this to build a value iteration algorithm (“backwards induction”). If we know the solution to subproblems \(v_*(s’)\), then the solution \(v_\pi(s)\) can be found by onestep lookahead.
 The idea of value iteration is to apply these updates iteratively.
 Intuition: start with final rewards and work backwards. Even works with “loopy, stochastic” MDPs.
 \(v_1\to v_2\to \cdots \to v_*\). Start with an initial arbitrary value function, then using iterative application of Bellman optimality:
 at each iteration \(k+1\)
 for all states \(s\)
 update \(v_{k+1}(s)\) from \(v_k(s’)\)
 Convergence to \(v_*\) will be proven later (Banach)
 Unlike policy iteration, there is no explicit policy: intermediate value functions may not be the value function of any real policy.
Problem  Bellman equation  Algorithm 

Prediction  Bellman expectation equation  Iterative policy evaluation 
Control  Bellman expectation equation + greedy policy improvement  Policy iteration 
Control  Bellman optimality equation  Value iteration 
 Toy example of a larger Gridworld.
 These ideas are actually quite intuitive, the notation makes it all a little cumbersome but it’s quite natural once you think it through. Silver’s explanation of the above table is a very coherent summary of this complicatedfeeling lecture.
 Can also do similar things with the action value function \(q\); these give rise to highercomplexity algorithms but they are definitely useful in the general RL problem.