C ́edric Colas
INRIA and Univ. de Bordeaux; Bordeaux (FR)
INRIA and Univ. de Bordeaux; Bordeaux (FR)
Sorbonne Universit ́e; Paris (FR)
INRIA; Bordeaux (FR) and ENSTA Paris Tech; Paris (FR)
Building autonomous machines that can explore open-ended environments, discover possible interactions and autonomously build repertoires of skills is a general objective of artificial intelligence. Developmental approaches argue that this can only be achieved by autonomous and intrinsically motivated learning agents that can generate, select and learn to solve their own problems. In recent years, we have seen a convergence of developmental approaches, and developmental robotics in particular, with deep reinforcement learning (RL) methods, forming the new domain of developmental machine learning. Within this new domain, we review here a set of methods where deep RL algorithms are trained to tackle the developmental robotics problem of the autonomous acquisition of open-ended repertoires of skills. Intrinsically motivated goal-conditioned RL algorithms train agents to learn to represent, generate and pursue their own goals. The self-generation of goals requires the learning of compact goal encodings as well as their associated goal-achievement functions, which results in new challenges compared to traditional RL algorithms designed to tackle pre-defined sets of goals using external reward signals. This paper proposes a typology of these methods at the intersection of deep RL and developmental approaches, surveys recent approaches and discusses future avenues.
Building autonomous machines that can explore large environments, discover interesting interactions and learn open-ended repertoires of skills is a long-standing goal in Artificial Intelligence.
What is most striking, perhaps, is their ability to invent and pursue their own problems, using internal feedback to assess completion. We would like to build artificial agents able to demonstrate equivalent lifelong learning abilities.
Developmental approaches, Developmental robotics
Developmental robotics takes inspirations from artificial intelligence, developmental psychology and neuroscience to model cognitive processes in natural and artificial systems
Developmental robotics is a field oriented towards answering particular questions around sensorimotor, cognitive and social development
Developmental robotics directly aims to model children learning and, thus, takes inspiration from the mechanisms underlying autonomous behaviors in humans.
Reinforcement learning (RL)
Reinforcement learning, on the other hand, is the field interested in problems where agents learn to behave by experiencing the consequences of their actions under the form of rewards and costs.
Reinforcement learning is a field organized around a particular technical framework and set of methods
Intrinsic motivations (IMs) a set of brain processes that motivate humans to explore for the mere purpose of experiencing novelty, surprise or learning progress
Population-based optimization algorithms non-parametric models trained on datasets of (policy, outcome) pairs
Convergence of these two fields(intrinsic motivations and population-based optimization algorithms), forming a new domain that we propose to call developmental machine learning, or developmental artificial intelligence
These convergences can mostly be categorized in two ways depending on the type of intrinsic motivation (IMs) being used :
Knowledge-based IMs compare the situations experienced by the agent to its cur- rent knowledge and expectations, and reward it for experiencing dissonance
This family includes IMs rewarding prediction errors, novelty, surprise, negative surprise, learning progress or information gains
It can also be used to facilitate the construction of world models
RL algorithms using knowledge-based IMs leverage ideas from developmental robotics to solve standard RL problems.
Competence-based IMs, on the other hand, reward agents to solve self- generated problems, to achieve self-generated goals.
competence-based IMs organize the exploration of the woRLd and, thus, might be used to facilitate learning in sparse reward settings or train world models
RL algorithms using competence-based IMs organize exploration around self-generated goals and can be seen as targeting a developmental robotics problem: the open-ended and self- supervised acquisition of repertoires of diverse skills
Intrinsically Motivated Goal Exploration Processes (IMGEP) is the family of algorithms that bake competence-based IMs into learning agents
generate and pursue their own goals as a way to explore their environment, discover possible interactions and build repertoires of skills
Recently, goal-conditioned RL agents were also endowed with the ability to generate and pursue their own goals and learn to achieve them via self-generated rewards. We argue that this set of methods form a sub-category of IMGEPs that we call goal-conditioned
IMGEPs or GC-IMGEPs. In contrast, one can refer to externally-motivated goal-conditioned RL agents as GC-EMGEPs.
pop-IMGEP, GC-IMGEP and GC-EMGEP refer to population-based intrinsically motivated goal exploration processes, goal-conditioned IMGEP and goal-conditioned externally motivated goal exploration processes
While IMGEPs (pop-IMGEP and GC-IMGEP) generate their own goals, GC-EMGEPs require externally-defined goals. This paper is interested in GC-IMGEPs, the intersection of goal-conditioned RL agents and intrinsically motivated processes that is, the set of methods that train learning agents to generate and pursue their own goals with goal-conditioned RL algorithms.
This paper is interested in:
goal-conditioned IMGEP (GC-EMGEP) the intersection of goal-conditioned RL agents and intrinsically motivated processes that is, the set of methods that train learning agents to generate and pursue their own goals with goal-conditioned RL algorithms
This paper proposes a formalization and a review of the GC-IMGEP algorithms at the convergence of RL methods and developmental robotics objectives.
We define goals as the combination of a compact goal representation and a goal-achievement function to measure progress
While traditional RL agents only need to learn to achieve goals, GC-IMGEP agents also need to learn to represent them, to generate them and to measure their own progress.
After learning, the resulting goal-conditioned policy and its associated goal space form a repertoire of skills, a repertoire of behaviors that the agent can represent and control. We believe organizing past GC-RL methods at the convergence of developmental robotics and RL into a common classification and towards the resolution of a common problem will help organize future research.
We are interested in algorithms from the GC-IMGEP family as algorithmic tools to enable agents to acquire repertoires of skills in an open-ended and self-supervised setting.
We first formalize the notion of goals and the problem of the open- ended and self-supervised acquisition of skill repertoires, building on the formalization of the RL and multi-goal RL problems. After presenting a definition of the GC-IMGEP family, we organize the surveyed literature along three axes: 1) What are the different types of goal representations? (Section 4); 2) How can we learn goal representations? (Section 5) and 3) How can we prioritize goal selection? (Section 6). From this coherent picture of the literature, we identify properties of what humans call goals that were not addressed in the surveyed approaches. This serves as the basis for a discussion of potential future avenues for the design of new GC-IMGEP approaches.
2. Self-Supervised Acquisition of Skills Repertoires with Deep RL
definition of our main objective: enabling agents to acquire repertoires of skills in an open-ended and self-supervised setting
we present the traditional reinforcement learning (RL) problem, propose a formal generalized definition of goals in the context of RL, and define the multi-goal RL problem which will serve as a basis to introduce our problem
2.1 The Reinforcement Learning Problem
In a reinforcement learning (RL) problem, the agent learns to perform sequences of actions in an environment so as to maximize some notion of cumulative reward
2.2 Defining Goals for Reinforcement Learning
This section takes inspiration from the notion of goal in psychological research to inform
the formalization of goals for reinforcement learning
Goals in psychological research. Working on the origin of the notion goal and its use in past psychological research propose a general definition: A goal is a cognitive representation of a future object that the organism is committed to approach or avoid
Generalized goals for reinforcement learning. RL algorithms seem to be a good fit to train such goal-conditioned agents. Indeed, RL algorithms train learning agents (organisms) to maximize (approach) a cumulative (future) reward (object). In RL, goals can be seen as a set of constraints on one or several consecutive states that the agent seeks to respect.
To represent these goals, RL agents must be able to
1) have a compact representation of them and
2) assess their achievement
The goal-achievement function and the goal-conditioned policy both assign meaning to a goal. The former defines what it means to achieve the goal, it describes how the woRLd looks like when it is achieved. The latter characterizes the process by which this goal can be achieved, what the agent needs to do to achieve it. In this search for the meaning of a goal, the goal embedding can be seen as the map: the agent follows this map and via the two functions above, experiences the meaning of the goal.
2.3 The Multi-Goal Reinforcement Learning Problem.
The multi-goal RL problem can thus be seen as the particular case of the multi-task RL problem where MDPs(Markov Decision Processes) differ by their reward functions.
2.4 The Intrinsically Motivated Skills Acquisition Problem
In the intrinsically motivated skills acquisition problem, the agent is set in an open-ended environment without any pre-defined goal and needs to acquire a repertoire of skills.
intrinsically motivated agents evolve in open-ended environments and learn to represent and form their own set of skills
the question of how to evaluate intrinsically motivated agents is quite similar to the question of how to evaluate self-supervised learning systems such as Generative Adversarial Networks (gan) or self-supervised language models
Let us list some approaches to evaluate such models:
Measuring exploration: one can compute task-agnostic exploration proxies such as the entropy of the visited state distribution, or measures of state coverage
Measuring generalization: The experimenter can subjectively define a set of rele- vant target goals and prevent the agent from training on them
Measuring transfer learning: Here, we view the intrinsically motivated exploration of the environment as a pre-training phase to bootstrap learning in a subsequent downstream task
Opening the black-box: Investigating internal representations learned during intrinsically motivated exploration is often informative. One can investigate properties of the goal generation system (e.g. does it generate out-of-distribution goals?), investigate properties of the goal embeddings (e.g. are they disentangled?).
Measuring robustness: Autonomous learning agents evolving in open-ended environment should be robust to a variety of properties than can be found in the real-woRLd.
3. Intrinsically Motivated Goal Exploration Processes with Goal-Conditioned Policies
We build from traditional RL algorithms and goal-conditioned RL algorithms towards intrinsically motivated goal- conditioned RL algorithms (GC-IMGEP).
3.1 Reinforcement Learning Algorithms for the RL Problem
The deep RL family (dRL) leverages deep neural networks as function approximators to represent policies, reward and value functions.
Imitation Learning (IL) leverages demonstrations, i.e. transitions collected by another entity
Evolutionary Computing (EC) is a group of population-based approaches where populations of policies are trained to maximize cumulative rewards using episodic samples
model-based RL approaches can be used to 1) learn a model of the transition function T and 2) perform planning towards reward maximization in that model
we focus on dRL methods that represent a policy
3.2 Goal-Conditioned RL Algorithms
Goal-conditioned agents see their behavior affected by the goal they pursue.
Learning by hindsight, agents can reinterpret a past trajectory collected while pursuing a given goal in the light of a new goal. By asking themselves, what is the goal for which this trajectory is optimal?, they can use the originally failed trajectory as an informative trajectory to learn about another goal, thus making the most out of every trajectory
GC-IMGEP are intrinsically motivated versions of goal-conditioned RL algorithms. They need to be equipped with mechanisms to represent and generate their own goals in order to solve the intrinsically motivated skills acquisition problem
4. A Typology of Goal Representations in the Literature
This section presents a typology of the different kinds of goal representations found in the literature. Each goal is represented by a pair: 1) a goal embedding and 2) a goal-conditioned reward function.
4.1 Goals as choices between multiple objectives
Goals can be expressed as a list of different objectives the agent can choose from.
4.2 Goals as target features of states
Goals can be expressed as target features of the state the agent desires to achieve.
4.3 Goals as abstract binary problems
Some goals cannot be expressed as target state features but can be represented by binary problems, where each goal expresses as set of constraint on the state (or trajectory) such that these constraints are either verified or not (binary goal achievement).
4.4 Goals as a Multi-Objective Balance
Some goals can be expressed, not as desired regions of the state or trajectory space but as more general objectives that the agent should maximize. In that case, goals can parameterize a particular mixture of multiple objectives that the agent should maximize.
When language- based goals were introduced, proposed the gated-attention mechanism where the state features are lineaRLy scaled by attention coefficients computed from the goal representation
Neural Module Networks (NMM), a mechanism that leverages the linguistic structure of goals to derive a symbolic program that defines how states should be processed
This section presented a diversity of goal representations, corresponding to a diversity of re- ward functions architectures. However, we believe this represents only a small fraction of the diversity of goal types that humans pursue.
5. How to Learn Goal Representations?
While individual goals are rep- resented by their embeddings and associated reward functions, representing multiple goals also requires the representation of the support of the goal space, i.e. how to represent the collection of valid goals that the agent can sample from. This section reviews different approaches from the literature.
5.1 Assuming Pre-Defined Goal Representation
Most approaches tackle the multi-goal RL problem, where goal spaces and associated rewards are pre-defined by the engineer and are part of the task definition.
The next sub-section investigates how goal representations can be learned.
5.2 Learning Goal Embeddings
Some approaches assume the pre-existence of a goal-conditioned reward function, but learn to represent goals by learning goal embeddings.
5.3 Learning the Reward Function
A few approaches go even further and learn their own goal-conditioned reward function.
All these methods set their own goals and learn their own goal-conditioned reward function. For these reasons, they can be considered as complete intrinsically motivated goal-conditioned RL algorithms.
5.4 Learning the Support of the Goal Distribution
To represent collections of goals, one also needs to represent the support of the goal distribution, which embeddings correspond to valid goals and which do not.
Most approaches consider a pre-defined, bounded goal space in which any point is a valid goal
Some approaches use the set of previously experienced representations to form the support of the goal distribution
In all cases, the agent can only sample goals within the convex hull of previously encountered goals (in representation space). We say that goals are within training distribution. This drastically limits exploration and the discovery of new behaviors. Children, on the other hand, can imagine creative goals. Pursuing these goals is thought to be the main driver of exploratory play in children. This is made possible by the compositionality of language, where sentences can easily be combined to generate new ones.
While most approaches rely on pre-defined goal embeddings and/or reward functions, some approaches proposed to learn internal reward functions and goal embeddings jointly.
6. How to Prioritize Goal Selection?
Intrinsically motivated goal-conditioned agents also need to select their own goals.
6.1 Automatic Curriculum Learning for Goal Selection
it is important to endow intrinsically motivated agents learning in open-ended scenarios with the ability to optimize their goal selection mechanism.
This ability is a particular case of Automatic Curriculum Learning applied for goal selection: mechanisms that organize goal sampling so as to maximize the long-term performance improvement (distal objective). As this objective is usually not directly differentiable, curriculum learning techniques usually rely on a proximal objective. In this section, we look at various proximal objectives used in automatic curriculum learning (ACL) strategies to organize goal selection.
Intermediate difficulty has been used as a proxy for long-term performance improvement, following the intuition that focusing on goals of intermediate difficulty results in short-term learning progress that will eventually turn into long-term performance increase.
generalize the idea of intermediate difficulty and train a goal generator to sample goals of uniform feasibility. This approach seems to lead to better stability and improved performance on more complex tasks
Novelty – diversity
aim at a uniform coverage of the goal space (diversity)
Short-term learning progress
estimate the learning progress of the agent in different regions of the goal space and bias goal sampling towards areas of high absolute learning progress
6.2 Hierarchical Reinforcement Learning for Goal Sequencing
Hierarchical reinforcement learning (HRL) can be used to guide the sequencing of goals
In HRL, a high-level policy is trained via RL or planning to generate sequence of goals for a lower level policy so as to maximize a higher-level reward.
This allows to decompose tasks with long-term dependencies into simpler sub-tasks.
7. Future Avenues
This section proposes several directions to improve over current GC-IMGEP approaches to- wards solving the intrinsically motivated skills acquisition problem.
7.1 Towards a Greater Diversity of Goal Representations
This section proposes new types of goal representation that RL agents could leverage to tackle the intrinsically motivated skills acquisition problem.
goals whose completion can be assessed from any state
goals whose completion can be judged by observing a sequence of states (e.g. jump twice) – can however be considered by adding time- extended features to the state (e.g. the difference between the current state and the initial state
include goals expressed as repetitions of a given interaction
goals about their own learning abilities as a way to simplify the realization of task goals
we refer to task goals as goals that express constraints on the physical state of the agent and/or environment
refer to goals that express constraints on the knowledge of the agent
Although most RL approaches target task goals, one could envision the use of learning goals for RL agents.
Goals as optimization under selected constraints
We discussed the representations of goals as a balance between multiple objectives. An extension of this idea is to integrate the selection of constraints on states or trajectories.
7.2 Towards Goal Composition and Imagination
imagining creative out-of- distribution goals might help agents to explore more efficiently their environment
Humans are indeed very good at imagining and understanding out-of-distribution goals using compositional generalization (e.g. put a hat on a goat). If an agent knows how to perform atomic goals (e.g. build a tower with the blue blocks and build a pyramid with the red blocks), we would like to automatically generalize to composition of these goals, as well as their logical combinations
7.3 Language as a Tool for Creative Goal Generation
Whether they are specific or abstract, time-specific or time-extended, whether they repre- sent mixture of objectives, constraints, or logical combinations, all goals can be expressed easily by humans through language. Language, thus, seems like the ideal candidate to ex- press goals in RL agents.
In the future, it might be used to express any type of goals. Recurrent Neural Networks (RNN), Deep Sets, Graph Neural Net- works (GNN) or Transformers are all architectures that benefits from inductive biases and could be leveraged to facilitate new forms of goal representations (time-extended, set-based, relational etc.).
8. Discussion & Conclusion
This paper defined the intrinsically motivated skills acquisition problem and proposed to view intrinsically motivated goal-conditioned RL algorithms or GC-IMGEP as computational tools to tackle it.
These methods belong to the new field of developmental machine learning, the intersection of the developmental robotics and RL fields. We reviewed current goal- conditioned RL approaches under the lens of intrinsically motivated agents that learn to represent and generate their own goals in addition of learning to achieve them.
We propose a new general definition of the goal construct: a pair of compact goal representation and an associated goal-achievement function
Intrinsically motivated agents need to learn to represent goals and to measure goal achievement. Future research could extend the diversity of considered goal representations, investigate novel reward function architectures and inductive biases to allow time-extended goals, goal composition and to improve generalization.