Universal Notice Network: Transferable Knowledge Among Agents

Being able to learn and transfer skills from one agent to another is a fundamental feature in constructing even more intelligent behaviors. In this paper, we introduce a new kind of architecture and information pipeline that aims to enable the transmission of skills from one robot to one or several others. The Universal Notice Network (UNN) originality lies in the fact that it clearly distinguishes knowledge necessary to solve the task from the agent intrinsic perceptions and capabilities, hence increasing its reusability and its potential transmission to other agents. In various experiments, focusing on manipulation and comanipulation tasks in original environments, we demonstrate the capabilities of the proposed method that takes advantage of reinforcement learning algorithms and domain knowledge, such as forward geometric model and inverse kinematics. In particular, we show that a learned UNN through the interactions of an agent with its environment is transmissible to other agents, conserving a similar perfomance level.


I. INTRODUCTION
From any point of view, fine-grained and versatile control of robot would yield great benefits.Various research fields focus on this goal: control, planning, and more recently learning-based approaches have been particularly visible among the research community.Combining modern deep learning techniques and reinforcement learning principles has led to impressive results.Indeed, both model-based and model-free methods demonstrate high performance in complex tasks such as the game of go [18], [19] or continuous control [9] [13] and [14].Even though recent works show reflexion and progress toward adaptable skills or knowledge, in particular using character retargeting [15] or environment adaptation [22], most agents are usually trained on a specific environments, with no option for transferring skills or gained knowledge.Whenever the environment requires the agent to learn reusable skills, it would be highly appreciable to be able to embed the logic necessary to solve the task in a separate module so it is potentially transferable to other entities.
This concept is more fully understood if we focus on the industrial use of robots.In this case, there are two main incentives to rely on such a methodology.The first one corresponds to the case where a robot could easily be replaced by another, with more or less different specifications (different number of joints, segment length,...), as depicted in Figure 1.Another motivation behind this approach is lifelong adaptation.In fact, while the instructions given to the robot might be the same during its whole lifetime (e.g., pick and place, reach, push, ...) the robots component will be altered by its use, changing its model over-time.The All authors are with Université Clermont Auvergne, CNRS, SIGMA Clermont, Institut Pascal, F-63000 Clermont-Ferrand, France.
mehdi.mounsif@uca.frNext section introduces relevant works to our current research and emphasizes areas where improvements in transfer learning can be made.Section III presents the main concepts and contributions of this paper, leading to section IV where we demonstrate the capabilities of our method in various experiments.Finally, section V focuses on leads of improvements and discussions.

II. RELATED WORKS
In the robotics field some works describe how to control robots based on a hierarchical decomposition of tasks into a set of simpler sub-tasks with some priorities [10].The high priority sub-tasks ensure the integrity of the robot and of its environment, the low priority tasks describe the action to achieve.The desired task will be performed if it does not conflict with the high priority sub-tasks.Unfortunately, this method requires a model of the robot and a mathematical description of the action to be performed, and of its derivative regarding the robot joint positions.Except for specific cases, it is usually not trivial to frame a robotic problem in such a way.Learning based techniques, however, are generally not impacted by these constraints.For instance, the Dactyl system by OpenAI [11], operating on the Shadow Hand was trained on simulation and yet is able to perform complex manipulation tasks of rigid objects in the real-world.In [15], a model-free method learns to control an articulated ragdoll in a simulator to mimic acrobatic human motion with an endto-end pipeline involving recovering 3D poses images with no depth information and reconstructing the human body joints trajectories.
Commonly, reinforcement learning agents are trained with the only objective of performing the task they are taught through the reward function.In a few cases, this agent can be used as an expert to provide demonstrations, for instance in [6], which can be seen as an instance of Imitation Learning.But in many works, the gained knowledge doesn't go any further than its initially destined task.Transferring knowledge has mostly been an issue the reseachers working with deep convolutional networks [4] or NLP models [7] had to deal with.Indeed, given the resources needed to train a model, many works detail methods aiming at generalizating a model in a setting different of the one it has learned with.This trend is also perceptible in reinforcement learning, with techniques such as domain randomization [21].Also, to match the IMPALA architecture [2], DeepMind proposed [1] a corpus of environments for agents navigation, featuring mazes with walls of differents colors and textures for benchmarking transfer of skills from one environment to another.While partially addressing the transfer of skills between environments using a single agent, the knowledge gained through learning in these models isn't distangled, e.g., it is not possible to interpret and determine regions of the model where the task is solved.The idea of transferring knowledge in a concise and distangled-enough way to be usable for various tasks isn't new in the learning community.
Hence, the most salient drawback with current transfer learning settings may be the fact that no precaution is taken to enforce knowledge in a specific part of the model.This, in turn, implies that it is not possible to isolate parts of the model and transferring them to other agents.
Another promising approach, yet less common, is Meta-Learning.In this setting, a model is trained on a corpus of tasks with the objective of finding a model initialization point in the parameters space where it is only a few iterations away from a local optimum, ie: where it will produce a good generalization concerning a specific task.Recent notable work include the MAML algorithm [3], where the authors introduced the interleaved training procedure, composed of an inner-loop for task-specific optimization and outer-loop for a more general gradient.Further developments such as the CAML algorithm [24] decrease the number of adjustable parameters and improve the overall performance of the model.While the formulation is appealing, the results presented focused on trivial tasks, such as regression over sine curves, or in the reinforcement learning framework, moving a particle towards a point in a bidimensional space, which is rather far from robotic task in an industrial context.
Overall, most of the transfer techniques are rather aiming at transferring skills gained through interactions with a specific environment into another one, increasing the agent adaptability and retargeting their abilities to new domains.However, to the best of our knowledge, there is no specific way to ensure that knowledge necessary for solving a task is not spread all over the layers, making it then difficult to transfer blocks between agents.The following section introduces our novel method, the Universal Notice Network that adresses exactly this important drawback.

A. Problem formulation
We start by introducing several notations, useful for the problem formulation: T , the task to be solved, R is one robot physically able to perform the task and s is a vector of observations coming from the environment.Thus, we are seeking a controller C such that following sequences of actions given by: can ultimately solve the task.

B. Concept
The proposed Universal Notice Network (UNN) objective is to provide a dedicated task module that can be set in the middle of a control pipeline, allowing any compatible robot to solve the task.To motivate our approach we start with an example, depicted in Figure 2. Let us consider two persons charged each one of moving a heavy load from a point to another.The initial and current target location are given on a notice, that they share.In this setting, it is straightforward to see that both individuals have the same task description but with different physical capabilities.In other words, while their environment is the same, they might end up choosing various ways of completing the task.

C. Formalism
Our aim is to show that it is possible to construct a central bloc that holds sufficient knowledge to solve a task.This bloc, the UNN, must be independant from the agents physical capabilities so that it is transferrable to other entites.However, different bodies configurations imply different intrinsic observation e.g., the vector holding this information will have as many dimension as needed to describe the entity.To deal with this particularity we rely on the following mechanism: we do not feed the full observation vector s from the environment into the UNN.Rather, the observation is split into a task observation s t and an intrinsic observation vector Fig. 3: UNN architecture.The agent receives an observation vector s, splitted into a robot observation vector s i processed by the input base and a task observation vector s t .The UNN, operates on the processed robot information and the task informations.Later the output base, computes the action based on UNN output c and robot intrinsic state vector s i that holds data concerning the robot s i , s = {s i , s t }.The latter is processed by an input base B R in , a model specific to the robot R that translate s i into a vector s i , understandable by the UNN.The UNN hence receives as input s such as: Based on this vector s , the UNN model outputs a highlevel instruction, called o UNN .However, in the same manner as before and in order to keep the UNN independant of structures considerations, we introduce an additional model, the output base B R out , again, specific to each agent.This output base uses the intrinsic robot observation s i to translate o UNN into an action suitable with the robot configuration.
Finally, the whole process, as shown in Figure 3 can be synthetized as:

D. Background in Reinforcement Learning
While in principle this technique is compatible with any kind of learning approach, we rely on reinforcement learning to train the UNN.The usual reinforcement learning framework involves various concepts: we consider an agent interacting with an environment.The environment is defined as a Markov Decision Process M = [S, A, P, R, γ] where S, A are respectively the set of states and actions available within the environment.P is the probability transition between s t+1 and s t when selecting action a ∈ A ie: p(s t+1 |s t , a t ).R is the reward function, reflecting the desirability of the state reached and γ is the discount factor for future rewards, which goal is to favorise immediate rewards.
In reinforcement learning, the objective is to find the policy that maximizes the expected cumulative reward: where D is a set of fixed-horizon trajectories, defined as a set of states-actions pairs {(s 0 , a 0 ), (s 1 , a 1 ), ..., (s n , a n )}.When using neural networks as function approximators, we actually seek the best policy parameters θ.
The RL community yields various methods suitable for finding the optimal policy [5], [23].When dealing with continuous spaces, it is common to take advantage of policy gradient methods, rather than Q-Learning that would not be as efficient due to the curse of dimensionality [20].However, simple policy gradient agents are very brittle and fail to improve their performance due to gradient variance issues.Instead, this work is based on a Policy Gradient variant called Proximal Policy Optimization [16], capitalizing on an actorcritic and trust-region setting.These two concepts improve strongly the agent performance thanks to reduced variance in the performance estimation and a controlled policy optimization step.Policy gradients algorithms use gradient ascent to maximize the policy performance, estimated using rollouts of the policy in the environment.The most communly used estimator has the following form: where Ât measures how good the action was and, in the simplest policy gradient theorem [20] is the plain reward r t .Actor-Critic methods use a supplementary network, called the critic to estimate the state value V (s t ), corresponding to the expected cumulative reward from that state until the end of the episode, which in turn enables estimating the advantage such as Ât = r t + γV (s t+1 ) − V (s t ).Finally, trust-region methods aim at taking the biggest optimization step without destroying the policy.In this view, the PPO algorithm maximizes a clipped surrogate objective: where r(θ) is the probability ratio: r(θ) = ( π θ (a|s) π θ old (a|s) ) between the policy parametrized by its current weights π θ and a previous version π θ old .

IV. EXPERIMENTS
We chose to implement the learning environments through the Pymunk physic library, a python API relying itself on the open-source Chipmunk engine [17].For the learning part, the PyTorch [12] learning library was used and the baseline PPO implementation is [8].Videos of trained agents are available at https://bit.ly/2QdOIep.

A. Experiments configuration
As stated previously, we expect the UNN to act as a plug-and-play module by being seamlessly transferred between various agents.In this purpose, we design the various experiments cases to demonstrate our method.Let

B. Manipulation tasks
The first environment, called catcher consists in a multijoint robotic arm with a bar attached to its effector and a ball, falling from a higher position with an initial velocity, see Figure 4a.Each episode starts with setting the robot in a random configuration.The episode ends if the ball height is below a threshold, implying that it has either fallen from the robot's bar or that the agent was not able to hold it, or after 500 steps.The goal of this environment is for the robot to catch the ball and raise it as high as possible.To incentivize the agent to fulfill the task, we introduce the following reward function: R = c × h, where c is a binary operator that is 1 only if there is a contact between the agent effector and the ball, otherwise 0, and h is the ball height.
In this setting, we first assess the learning performances of our method using a baseline PPO algorithm.Figure 5 shows the evolution of the cumulative reward along training for various robots R A 3 , R A 6 , R 3 , R 6 .Cumulative reward is a frequently used metric to benchmark reinforcement learning algorithms.It depends on the environment reward magnitude and represents the performance of an agent through one episode.The higher the cumulative reward, the better the agent performed.As can be seen in Figure 5, the performances of the UNN agents reaches higher cumulative rewards while also being slighlty faster than the baseline PPO agents.Once the UNN is trained, the final performances are evaluated through 2000 episodes on the tasks.Each of these episodes starts with setting the initial condition to some preset determined beforehand in order to have uniform test conditions.Along these runs, if the policy holds the ball for more than 450 steps out of 500 we consider that it has succeeded, otherwise the episode is counted as a failure.Table I shows comparaisons between a baseline PPO agent and a UNN agent for various robot configurations.Given that the ball initial velocity or robot initial configuration may not allow the agent to succeed in the task, the table displays common successes (e.g., both the UNN and PPO reached at least 450 steps), common failures and cases where only one of them succeeded.We can see that in this case the UNN performances are competitive with the PPO baseline.While these results fit with what has been observed during multiple training session and that PPO is among the most stable and consistent RL algorithms, it is important to keep in mind that reinforcement learning performances can vary between two training sessions and that these results could be different for another seed.

C. Transmission of a trained UNN
We now place ourselves in the case where we have in our possession a trained UNN, from last experiment.We demonstrate here that it is possible to conserve a similar level of performances when transfering a trained UNN to another manipulator structure.If the bases are of type A or B, this can be done directly.Using C of D bases types is   also possible by training only the base.This configuration relates to following case: the forward model is known and we want to recover the inverse model.For these runs, the forward model architecture consists of two layers, with 30 units each, with a tanh linearity in between.The inverse model architecture is a single layer of 64 neurons, followed by a tanh activation function that is finally used to compute Gaussians distributions using a diagonal covariance matrix parametrized by a vector of size n, where n is the number of joints, trained using PPO.To test the transfer aspect, we compare performances of a trained UNN on a specific robot, obtained with type A configurations, given in the first column with a different manipulator structure and base, given in the second column.The same episodes initialization as table I are used.As can be seen in table III, results are consistent between robots.In particular, we observe a higher number of shared successes, which means that the knowledge and methodology is indeed transfered.Imperfect learning, for instance when recovering bases in R C i cases, may hinder performances.Figure 8 shows the evolution of the cumulative reward for two R A 3 controlled either by the UNN or a baseline PPO.On this specific case, the UNN outperforms the baseline by a important margin, both in learning speed and final performance.In table IV, we tested several manipulators structure with different bases types and compared the test results with a baseline.The first two columns hold information about manipulators structure, while the next two columns give the base type.Finally, the two rightmost columns details statistics obtained over 1500 episodes, showing again the adaptability of the UNN approach.

V. CONCLUSION & PROSPECTS
This paper adresses the problem of knowledge transfer between agents.A novel plug-and-play architecture is presented to prevent learned models to be unusable for other agents.Its key contribution lies in the fact that knowledge is clearly separated from low-level controllers.Thus, this pipeline allows to create separable tasks models that can be easily shared between agents.Simulation results prove our method consistency between robots configurations and competitivity with respect to state-of-the-art baselines.The UNN however preserves generated knowledge for further use and transmission, as shown in multiple benchmark cases.Future work will focus on improving state representation in order to free the UNN from domain knowledge.

Fig. 2 :
Fig. 2: Solving the same task with different capabilities.Each row shows a possible outcome based on individual strengths and preferences.

Fig. 5 :
Fig. 5: Cumulative reward over training for the catcher task.

Fig. 6 : 3 Fig. 7 :
Fig. 6: Comparaison of the evolution of cumulative reward for a baseline agent, and two UNN configurations: R C 3 and R D 3

Fig. 8 :
Fig. 8: Cumulative reward over training on double catcher PPO R i vs UNN R A

TABLE I :
Comparative results on catcher of a baseline agent and UNN with configuration A: analytical bases.Statistics are obtained after 2000 episodes PPO R i vs UNN R B

TABLE II :
Comparative results on catcher of a baseline agent and UNN with configuration B: supervised model of analytical bases

TABLE III :
Results on 2000 episodes of UNN transmission to various configurations and base types.

TABLE IV :
Comparative results for double-catcher