Learning a problem representation while exploring it
Learning algorithms generally need significant prior information on the task they are attempting to solve. This requirement limits their flexibility and forces engineers to provide appropriate priors at design time. A question then arises naturally: how can we reduce the amount of prior task information needed by the algorithm? Being able to answer this question could spark the development of algorithms with higher generalization potential, all the while reducing preliminary engineering efforts. If we want to discard any a priori knowledge about the task, we need the learning algorithm to efficiently explore and represent the space of possible outcomes it can achieve.
This blog post investigates the question and discusses the approach presented in the paper Unsupervised Learning and Exploration of Reachable Outcome Space published at ICRA 2020.
Recently, Reinforcement Learning (RL) algorithms sparked a lot of excitement thanks to their ability to solve very complex problems, e.g. AlphaGo beating humans at the game of Go. RL is a general problemsolving class of algorithms that learn behaviors by trialanderror to optimize a reward function. The reward function is a fundamental part of these algorithms as it is the mean through which the designer can communicate to the algorithm the objective of the task. This means that a lot of effort has to be put into the design of such a function, and if not carefully tuned it can lead the search astray and reduce performances.
At the same time, for many interesting problems, it is not possible to easily define a good reward function. This might be due to many reasons: the goal might change over time, or it might be difficult to extract the information needed to calculate a reward, or the robot might get damaged, rendering the old reward function outdated. Moreover, Reinforcement Learning algorithms usually are gradientbased. This means that the optimization process follows the reward gradient In some situations, such a gradient can be misleading or not easy to calculate, making these methods struggle and require a lot of finetuning.
Divergent search algorithms try to address the issues related to the reward gradient by using evolutionary approaches: the search is performed by imitating the mechanisms of natural evolution on a population of behaviors (called policies in RL).
Classic evolutionary algorithms drive the search through a fitness metric, so only the best performing policies are selected at each step. On the other hand, divergent search algorithms drive the search through different kinds of metrics: novelty, curiosity, or surprise.
All these metrics have one thing in common: they push the search performed by the algorithm to diverge, hence the name, by looking for novel or unexpected behaviors and observations. If you're interested in how evolutionary algorithms work, check out Introduction to evolutionary algorithms.
Among all the divergent search methods, we will focus in particular on Novelty Search (NS), an algorithm introduced by Lehman and Stanley in the paper Abandoning Objectives: Evolution through the Search for Novelty Alone. Its aim is to find a set of policies whose outcomes are as diverse as possible, where the diversity is measured by representing each outcome into a lowdimensional handdesigned space, called outcome space of behavior space. Novelty search is a quite powerful algorithm that has been recently shown to perform search tending to uniformity in this low dimensional search space, without the need for any external reward.
Before delving into the details of Novelty search, let's first define some concepts that will help us better understand how the algorithm works.
Policy¶
We mentioned policies before, but what are they exactly?
A policy is a function $$\pi$$ that tells us how to act in a given situation. This function can be implemented in different ways and is parametrized by a set of parameters $$\theta \in \Theta$$. For example, if the function is represented through a neural network, the parameters could be both the structure of the network and its weights.
It's on these parameters that a learning algorithm operates to learn to solve a task.
Environment¶
Learning to solve a task requires the evaluation of the policies. But where are they evaluated? Here the concept of environment comes into play: an environment is the world with which the policies interact when being evaluated. It can be either the real world, or a part of it, or a simulation. Note that in this formalization, the agent itself is part of the environment.
At every given time step, the environment finds itself in a state $$s_t$$, which represents all the properties relevant to describe it (ex: position of the agent, speed of its effectors, positions of objects,...). The state cannot necessarily be observed directly by the agent, but its sensors give us an observation $$o_t$$ that inform us about this state. The kind of observations we obtain depends on the kind of sensor we use, e.g. by using a camera we get an image as observation. When the state is not directly available, these observations are used by the policy to decide what to do when acting on the environment instead.
Now that we introduced these concepts we can better understand how Novelty search works.
Novelty Search
As we said previously, Novelty Search is a divergentsearch populationbased algorithm. This means that at every step of the algorithm a population of policies, parametrized by $$\theta_i \in \Theta$$, is evaluated in the environment. During this evaluation, each policy traverses a trajectory of states $$\tau = [s_0, \dots, s_T]$$ that are observed to generate a trajectory of corresponding observations $$[o_0, \dots, o_T]$$.
Describing the behavior of a policy¶
The observations generated during the evaluation of a policy are used by Novelty Search to extract a description of the behavior it generates. This description is then used to calculate how novel the policy is. This is done through the use of an observer function $$O_B: \mathcal{O}^T \rightarrow \mathcal{B}$$ that maps the trajectory of observations to a low dimensional set of features $$b_i \in
\mathcal{B}$$. The tensor $$b_i$$ is called behavior descriptor or outcome descriptor and, as the name implies, describes the behavior of the policy in the environment. At the same time, the space $$\mathcal{B}$$ in which $$b_i$$ belongs is called behavior space or outcome space. In the rest of the post, we will use these two names interchangeably. The behavior descriptor and the behavior space are the two concepts at the core of the novelty search algorithm.
The whole process of evaluating a policy, collecting the trajectory of observations, and representing its behavior in the outcome space through the observer function can be summarized by introducing the function $$\phi(\theta_i) = b_i$$. This function is called the behavior function and maps each policy $$\theta_i$$ to its outcome description $$b_i$$.
Let's show this relationship with an example. In the figure below, we have on the right the standard HardMaze environment used in many Novelty Search publications. The task in this environment is to find a policy that allows a twowheeled robot to navigate the maze and reach its top left corner. On the left of the figure, we have a representation of the higherdimensional parameter space, in which each red point corresponds to a different policy. When a policy is evaluated in the environment, it goes through a trajectory of points whose corresponding trajectory of observations consists of the list of $$(x,y)$$ position of the robot at each timestep. The behavior descriptor for each policy is here chosen as the final $$(x,y)$$ position of the robot in the maze.
The behavior descriptors for each policy on the left are represented as blue points in the maze. The behavior function $$\phi(\theta_i)$$ is thus the green arrow connecting each policy in the parameter space to its corresponding behavior descriptor representation in the outcome space.
Calculating the novelty of each policy¶
Once the behavior of each policy has been represented in the outcome space, it is finally possible to calculate its novelty.
The novelty is evaluated as the average distance between the behavior descriptor of the considered policy and the ones of the $$k$$ closest other policies in the outcome space:
$$n(\theta_i) = \frac{1}{k}\sum^{k}_{j=1}dist(\phi(\theta_i), \phi(\theta_j))$$
The most novel policies are thus the ones that end up being the furthest apart from their neighbors, and are then chosen to form the next generation of the population.
At the same time, some policies from the current generation are also saved into a collection called repertoire. The function of this collection is twofold:
 Output of the algorithm. In fact, at the end of the overall search , the repertoire will be outputted as a set of solutions;
 Track the already explored parts of the outcome space. The calculation of the novelty of a policy in the population also takes into account behavior descriptors of the policies in the repertoire. This means that a policy whose behavior descriptor is in an area of the space already explored in the past will be considered less novel than one in a less explored area.
The whole process is repeated until a desired number of evaluations has been made, as shown in the figure below. For a more indepth discussion about Novelty Search, you can check the original Novelty Search paper.
We just showed that the behavior space is a fundamental part of Novelty Search, and of populationbased divergent algorithms in general, making its design fundamental.
The design process is usually done by the engineer setting up the algorithm. This requires prior knowledge on the system and on the problem to be solved. Moreover, the designer can introduce some unexpected bias in situations in which is not clear which features would benefit the search. This could hinder performances and drive the search towards local minima difficult to overcome.
A way to prevent this is to learn the behavior space on the flight, directly from the raw data generated by the interaction between the policies and the environment. Let's see how we could do this.
Learning the description
Raw data generated by the sensors are usually very high dimensional: e.g. a $$64\times 64$$ pixels grayscale image has 4096 dimensions. This can make working directly on them difficult, for two main reasons:
 Distance metrics in highdimensional spaces do not work properly. It has been shown by Aggarwal and colleagues that the higher the dimensionality of a space, the less contrast there is between distances calculated on points in such space;
 The pixelwise distance is not enough to distinguish between different situations in RGBimages. As can be seen from the figure below, the pixelwise distance between the second and third images with respect to the first one is the same, notwithstanding the distance in space of the circles being different.
At the same time, using raw data requires the least amount of prior information on the environment while having high information content. For these reasons a dimensionality reduction algorithm is needed to move to a lowdimensional space where the distance between observations can be safely calculated.
There are many ways to learn representations from raw data. Among them, autoencoders have proven very useful in learning low dimensional representations from high dimensional data like RGBimages.
An autoencoder is a type of neural network composed of two parts:
 Encoder $$\mathcal{E}$$: It takes the high dimensional input and generates a low dimensional representation $$f(x) \in \mathcal{F}$$;
 Decoder $$\mathcal{D}$$: takes the low dimensional representation generated by the encoder and tries to reconstruct the high dimensional input in the most accurate way possible.
The presence of the low dimensional bottleneck forces the network to extract the most important features that would allow the reconstruction of its original input. All of this is done in an unsupervised way and the network is trained by minimizing the reconstruction error between its input and its output: $$L(x) = x  \mathcal{D}(\mathcal{E}(x))^2$$, where $$x$$ is the input of the network.
We propose to use an autoencoder to learn the low dimensional outcome descriptor of a policy from the high dimensional sensors observations. This could be done by using the encoder $$\mathcal{E}$$ as observation function to extract the features from the image and its low dimensional feature space $$\mathcal{F}$$ as outcome space.
The autoencoder is then trained online on the observations generated at each iteration by the new policies' evaluations.
The new resulting algorithm, called Task Agnostic eXploration of Outcome space through Novelty and Surprise (TAXONS), is shown in the picture below.
The algorithm consists of two processes running in parallel: Search process & Training.
Search process: highlighted in gray in figure (2), generates and evaluates the policies, storing the one with higher diversity in the repertoire. It does so by firstly generating a population of policies, evaluating them in the environment, and then selecting the best ones to add to the repertoire. During the evaluation, the last observation $$o_T$$ for each of the policies is collected and fed to the AE that will be used to extract the features for the selection step. This last step is fundamental to the correct operation of TAXONS and is described in detail in the Policy selection section;
Training: highlighted by the red arrow in figure (2), is the training of the AE on the observations collected during the evaluation of the policies. This training process happens in parallel with the search process and is described in more detail in the Search and Training section.
TAXONS can perform the search for policies with minimal interventions from the designer other than the choice of which observation is richer in information, requiring no task or reward definition. In our case, we consider that the last observation of the trajectory is informative enough to characterize the behavior of the policy in the environment. Consequently, only this last observation will be used to build the behavior space representing the outcome of the policy, for this reason called outcome space.
Let's show this better through an example. In the figure below we can see a trajectory (black dotted line) of the robot (red circle) in the HardMaze with three observations collected during the trajectory. The observation from which the autoencoder will extract the behavior descriptor is the one collected towards the end of the trajectory, in this case, the one on the top right.
Policy selection
Novelty Search uses only the distance in the behavior space as a measure for novelty. In our approach, we also consider another metric thus ensuring both their novelty and the representativity of the behavior space.
This means we have two metrics:
 Novelty: as in novelty search is calculated as the average distance between the policies in the learned behavior space;

Surprise: is defined as the reconstruction error of the autoencoder.
It is expressed as: $$s(\theta_i) = \big\bigo_T^{(\theta_i)}  \mathcal{D}\big(\mathcal{E}(o_T^{(\theta_i)})\big)\big\big^2$$.
The reconstruction error in fact tends to be large when the autoencoder processes observations which have not been frequently encountered yet. By maximizing this metric during the search, we ensure that new policies tend to explore novel parts of the state (observation) space. This ensures that the observations are representative of the states the system can reach.
The two metrics are combined by choosing randomly between them at every new iteration of the algorithm with a probability of $$0.5$$.
Combining these two metrics drives the search towards learning an outcome space representative of the reachable states of the system and towards policies that are diverse in this space.
Experiments and Results
Environments¶
To study how TAXONS performs we tested it in four different simulated environments:
 Billiard: a 2jointed arm pushing a ball in a 2D room, shown in box (a) of the below figure;
 Maze: a twowheeled robot navigating a 2D maze, shown in box (b) of the below figure;
 Ant: a fourlegged robot ant moving on the floor, represented from a topview in box (c);
 Kuka: a 7jointed Kuka robotic arm, simulated in Pybullet, that has to learn how to push a box on a table. The setup is shown in box (d).
Each of these environments is observed through RGBcamera sensors that return $$64 \times 64$$ resolution observations. In the first three, we obtain a top view of the whole environment, while in the Kuka environment the table is observed from the side.
We compared TAXONS against five different baselines:
 NS: a vanilla novelty search algorithm with handcrafted features tailored using a priori knowledge about the agent and environment;

PNS: a policy search algorithm, similar to NS but where the outcome space directly corresponds to the parameter space $$\Theta$$ of the policies.
The outcome descriptor characterizes the policy but not the final observation; 
RNS: a novelty search algorithm where the outcome description of each policy is randomly sampled in a $$10$$D space.
The outcome descriptor does not characterize the observation nor the policy;  RS: a random search in which all policies are randomly generated and randomly selected to be added to the repertoire;
 NT: a novelty search algorithm in which the outcome space corresponds to the feature space of an autoencoder whose weights are randomly generated at the beginning of the search and left unmodified during the whole search process. As with TAXONS, the AE is fed only with the last observation $$o_T$$ of the environment.
The vanilla version of TAXONS is also compared against two ablated versions:
 TAXON: in which only the novelty calculated in the learned feature space is used as selection metric;
 TAXOS: in which only the surprise metric is used for policy selection.
The algorithms are compared based on how well they cover the groundtruth outcome space of the system. By design, this groundtruth outcome space corresponds to the $$(x,y)$$ position of the ball for the Billiard environment, of the center of the robot for the Maze and Ant environment, and of the box on the table for the Kuka arm environment. We thus define the coverage as the percentage of this $$(x, y)$$ space reached by the final repertoire of policies. This is done by dividing this space in a $$50 \times 50$$ grid and then calculating the ratio of the number of cells reached at least once over the total number of cells.
Note that the groundtruth $$(x, y)$$ space is unknown to the methods (except for NS) and is only used a posteriori for evaluation.
Results¶
The results displayed in the figure below show that TAXONS leads to good coverage of the groundtruth $$(x,y)$$ outcome space. Its performance is lower than the upperbound performance of NS, which has direct access to the groundtruth outcome space, but significantly higher than the other baselines, which use a high dimensional outcome space (PNS), a random outcome space (RNS), or no outcome space at all (RS).
This shows two things:
 performing NS in a lowdimensional outcome space capturing information about the final state of the system (through the last observation) is beneficial;
 TAXONS successfully builds such a space.
Indeed when the generation and selection of policies are purely random (RS) the coverage is very low. Similarly, when lowdimensional outcome descriptors are randomly assigned to the policies (RNS) the coverage is only slightly better than purely random (Ant), or as bad (Billiard, Maze, and Kuka). Finally, performing the NS directly in the highdimensional policy parameters space $$\Theta$$ (PNS), leads to a coverage that is similar to the RNS case. This suggests that performing the search in the highdimensional policy parameter space is equivalent to assigning random descriptors to the policy, meaning that the selection process has no information about the actual outcome of the policy. The performance of the two ablated versions (TAXON and TAXOS) is similar to the vanilla version of TAXONS, as they lay between the NS upperbound and the PNS, RNS, and RS baselines.
Nonetheless, their efficiency varies between experiments. This means that combining novelty and surprise renders TAXONS more robust to different environments than its two ablated versions, while still being able to perform almost as well as NS.
In the next figure, we show in column (a) the coverage of the repertoire of policies found by TAXONS in the groundtruth $$(x,y)$$ space. Highlighted in red are 3 policies for which the related final observation $$o_T$$ are shown. In column (b), a sample policy from the repertoire generated by TAXONS is represented. The trajectories followed by the policies are highlighted in red; for the Billiard, the trajectory of the point of the arm is also highlighted in blue. The final observation of the trajectory, with the respective reconstruction and descriptor generated by the autoencoder, is also shown.
Conclusion
In conclusion, in this post we discussed the issue of requiring too much prior information on the task at design time.
We introduced an approach to address the problem that can:
 learn on the fly a lowdimensional representation of the observations;
 use this lowdimensional representation as behavior descriptor to evaluate the novelty of a policy;
 generate a diverse set of policies in this learned outcome space.
If you want to know more about this method, see the paper: Unsupervised Learning and Exploration of Reachable Outcome Space