## Alex's Predictive Knowledge Primer

This is a really tiny bare-bones explanation of predictive knowledge I wrote a while ago to share with someone. WIP.

#### What is Predictive Knowledge?

A key challenge for machine intelligence is that of representation. The performance of a system is tied to its ability to represent and perceive its environment. Predictive knowledge is a proposal for constructing representations of the environment by making large collections of predictions. An agent is able to continually anticipate its sensation from its environment by making many predictions about the dynamics of its environment with respect to its behaviour (Modayil, 2014). These predictions about expected sensation can then be used to inform an agent's internal representation of its environment (Littman and Sutton, 2002). Other proposals describe inter-relations of predictions, similar to Temporal Difference Networks (Tanner and Sutton 2005; Makino, and Takagi, 2008) to enable abstract, conceptual representations by making predictions of predictions (Schapire and Rivest, 1988).

Predictive knowledge systems have been shown to be a scalable way to update and verify an agent's representation of the world, with examples of real-world robotic prediction tasks making thousands or tens of thousands of predictions in real-time on consumer-grade devices (Sutton, 2011).

We specifically consider predictive knowledge methods that 1) are able to expand their representations by proposing new predictions---possibly increasing required resources, 2) are able to self-verify their predictions through interaction with their environment, and 3) are able continually learn their predictions on-line.

#### General Value Functions

An integral part of predictive knowledge is the formulation of the predictor. A method of making predictions are General Value Functions (GVFs) (White, 2015). GVFs estimate the expected discounted return of a signal $C$ defined as . Value is estimated with respect to a specific policy $\pi$, discount function $\gamma$, and cumulant $c$: or, $v(s;\pi, \gamma, c) =\mathop{\mathbb{E}}_\pi[G_t|S_t = s]$ where $s$ is the state of the environment.

A policy $\pi$ specifies the behaviour over which the prediction is being made as the probability of taking an action $a_t$ in state $s_t$: $\pi(a_t|s_t)$. The cumulant $c$ describes the signal of interest---the value which the GVF is accumulating over time. The cumulant $c$ may be a function of observations an agent makes, or internal signals produced by the system (i.e. predictions made by the system, or other learned values).The discounting function $\gamma(s_t,a_t,s_{t+1})$ describes how future rewards are discounted.

The simplest discounting functions use a constant value where $0 \leq \gamma \leq 1$. GVFs where $\gamma = 0$ are myopic: the return at any given state $t$ will be the next observed $c_{t+1}$. Where $\gamma = 1$, the return is undiscounted: all signals equally contribute to the return $G_t$. When varying the discount between $0$ and $1$ we are describing how much future $c$ values should impact value of the current the return.

Another variant of discount functions are termination functions: functions where $0 < \gamma \leq 1$ until some event, where $\gamma = 0$. For example:

In this example, $\gamma = 1$ until a robot runs into something which causes its bump sensor to depress. At this point, $\gamma = 0$. This has the effect of acting as a timer. We accumulate the cumulant $c$ until some event occurs, at which point the accumulation is terminated.

The parameters $c$, $\pi$, and $\gamma$ are the question parameters which specify what a GVF is about. In addition, any learning algorithm specific parameters are considered answer parameters describe how a learning method learns the GVF. For example, in Temporal Difference (TD) learning, the step size $\alpha$ and eligibility decay $\lambda$ for TD learning are the answer parameters.

#### Learning General Value Functions: Temporal-difference Learning

GVFs can be learnt online, incrementally through methods such as Temporal-difference (TD) learning (Sutton 1988). TD methods are of interest, as they are able to update their estimates based on previous estimates: they do not need to wait to observe a final outcome to update their estimates. For instance, if we were to formulate a GVF where we considered what the expectation of the position of the gripper on a robot arm over 10 seconds was, using TD learning we would not need to wait 10 seconds and observe the true return $G$ to update our estimate.

Using TD learning we update a value function $v$ which estimates the true value of a given state $v_*(s) := \mathop{\mathbb{E}} \lbrace G | S_t = s \rbrace$. When considering traditional Reinforcement Learning problems, value could be the estimated accumulation of some reward $r$ associated with a goal-oriented behaviour; however, value is not limited to reward: it can be other signals we wish to estimate, including cumulants of GVFs.

In this primer we describe TD learning with linear function approximation where $\phi(s)$ is a binary feature vector which describes the state $s$. The estimate of a state is then the linear combination of the feature vector $\phi$ and a set of learned weights $w$: $V(s) = w^\top \phi(s)$.

The TD error is:

We do not use the value $v_*$ in our formulation of the target value of our TD error. Instead, the target value is $R_{t+1} + \gamma_t w^\top \phi(s_{t+1})$---a biased estimate of the true return using our current learned estimate $V$. We bootstrap our update by using our learned estimate of the value of the following state $V(s_{t+1})$ to construct the target value. As a result, when performing TD learning, the target value is updated through learning.

Using the TD error $\delta$, we update the weights by taking a step of size $0 < \alpha$ to reduce the error for the presently active features $\phi(s)$.

We can assign credit to previously visited states for currently observed rewards by using elegibility traces. Eligibility traces $z$ maintain a decaying trace of previously active features. At each time-step, the elegibility traces are decayed by $0 \leq \lambda \leq 1$ and the discounting function $\gamma$ and incremented by the currently visited state $\phi(s)$. Where $\lambda = 0$, the weight update is equivalent to $\alpha \delta \phi(s)$, and when $\lambda = 1$ the TD update is a Monte Carlo update.

With eligibility traces the weights are updated by $w \gets w + \alpha \delta z$.

TD methods can be applied online, in real-time with little computational cost. TD's bootstrapped update makes it effective in lifelong continual learning systems where the true return $G_t$ may never be observed, or where the dynamics of the environment are too complex to model.