Defining the problem

The dynamic nature of real world environments demands agents to be able to adapt to the changing states around it, in order to be of use in the long run. This adaptability entails the agent being able to perform multiple tasks and learn them simultaneously. Multi-Task Reinforcement Learning (MTRL) is one of the approaches that is gaining momentum in building such autonomous agents.

MTRL can be described as a problem where the agent has to learn multiple tasks and maximize its cumulative reward over all these tasks, along with learning how to do each of the tasks effectively. In order to optimize learning for the agent, it should utilize the shared environment information in an efficient manner across these tasks.

As stated by Sodhani et. al. [2], utilizing shared environment information aids MTRL when the underlying relationships of the tasks being learnt is also utilized in concert with the shared representations. Their approach focuses on utilizing task metadata embeddings to attend over the shared environment representation, extract task specific knowledge and then train for that particular task.

Our work focuses on introducing a new approach to MTRL. In our approach, we model MTRL as a graph problem, where the tasks to be learnt represent the different nodes of the graph. Inspired by Sodhani et. al. ‘s use of task metadata embeddings, we use the similarity between task embeddings to aggregate task specific information and also let other tasks leverage information from other related tasks. Our approach and experiments are detailed below.

Our proposed solution - MTRL using Graph CNNs

Graphs are considered to be powerful data structures for capturing dynamic environments thanks to its separate edge and node frameworks, where each edge represents the relationship between nodes. Jiang et al. [5] utilized this property and proposed to use graph convolutional neural networks to tackle the multi-agent reinforcement learning problem. In their approach graph convolutions were used to propagate information between agents by using the underlying dynamic graph structure. Inspired by their work, we propose to use graph convolutional neural networks (GCNN) to solve the multi-task reinforcement learning problem.

How to build the graph?

In the outlined representation, a node is expected to hold the task observation and edge is expected to hold the relative observation embeddings of two neighboring tasks. In order to structure this neighborhood, we use contextual task information, inspired by Sodhani et. al. [2] and extract adjacency between tasks by using task metadata.

Sodhani et al. [2] used encoded task descriptions to extract the dynamics and reward information of the environment. Figure 1 illustrates few tasks that might not necessarily have similar state observations but their contextual information can provide us the similarity between the tasks. In [2], authors used a pretrained context encoder to retrieve context embeddings of the tasks with the provided metadata. We decided to merge this insight to our process of defining adjacency between the tasks.

window_tasks.PNG

Figure 1: Different tasks with similar task descriptions. Pictures are taken from Yu et al. [6].

Figure 1: Different tasks with similar task descriptions. Pictures are taken from Yu et al. [6].

Model Architecture

Figure 2: Our proposed architecture

Figure 2: Our proposed architecture

We propose a novel architecture presented in Figure 2. We use task metadata and pretrained language model to get task context encodings, in a manner similar to Sodhani et al. [2]. The adjacency matrix which is used in our graph convolutional neural network is built using these context encodings.

We take observations from the agent, these observations are state based observation in all of our environments. We encode these observations through a single feed forward encoder which has 2 hidden layers with 50 input dimensions, to get a better representation of the environment. Task metadata and adjacency matrix vary remarkably according to the environment that is used. These differences are explained in the Environments section in detail.

After getting the context encodings, we build the adjacency matrix with one of the following two approaches:

Soft attention: After getting the context encodings, we calculate the covariance matrix of these encodings and use that covariance matrix directly as the adjacency matrix. In this approach, adjacencies are floating point numbers between [0,1] .

Hard attention: We set a threshold to represent whether two tasks are related or not. If the covariance of two tasks is higher than that threshold we put 1 as the adjacency and consider these two tasks related and if not, we put 0 at the corresponding location in the adjacency matrix.

We concatenate the observation and context encodings to get the feature matrix, which is then multiplied with the adjacency matrix and passed to the attention block - containing multiple attention heads - and finally, we concatenate these attention heads with the encoded observation.

Main structure of the graph convolutional network is presented in Figure 3. In each convolutional layer, attention (Zambaldi et al. [11]) is applied to retrieve an attention head for that relation kernel, and then all the attention heads are concatenated and given to a final linear layer (instead of the Q network given in Figure 3). Computation happening in each relation kernel is presented in Figure 4.

Figure 3: Graph convolutional neural networks. Image taken from Jiang et al. [5].

Figure 3: Graph convolutional neural networks. Image taken from Jiang et al. [5].

Figure 4: Graph to represent flow design of getting attention heads in GCNNs. Image taken from Jiang et al. [5].

Figure 4: Graph to represent flow design of getting attention heads in GCNNs. Image taken from Jiang et al. [5].

For each attention head to be calculated, query ($Q(x)$), key ($K(x)$) and value $(V(x))$ representations are first extracted.

$softmax\big(Q(x) \times K(x)\big)$ gives the relation matrix between the tasks. For each attention head, the value representations of the features are weighted by this relation matrix. So the following equation basically gives an attention head for the relation kernel:

$$ h_i^{`t} = V(x) \times softmax\bigg(Q(x) \times K(x)\bigg) $$