In cooperative multi-agent reinforcement learning (MARL), for its in politics Policy gradient (PG) methods are generally believed to be less sample efficient than value decomposition (VD) methods, which are out of politics. However, some recent empirical studies show that with proper input representation and hyper-parameter tuning, multi-agent PG can achieve surprisingly strong performance compared to non-policy VD methods.
Why might PG methods work so well? In this publication, we will present concrete analyzes to demonstrate that in certain scenarios, for example, environments with a highly multimodal reward landscape, VD can be problematic and lead to undesired outcomes. Conversely, PG methods with individual policies can converge to an optimal policy in such cases. In addition, PG methods with autoregressive (AR) policies can learn multimodal policies.
Figure 1: Representation of different policies for the 4-player permutation game.
CTDE in Cooperativa MARL: VD and PG methods
Centralized training and decentralized execution (CTDE) is a popular framework in cooperative MARL. take advantage global information for more effective training while maintaining representation of individual policies for testing. CTDE can be implemented using value decomposition (VD) or policy gradient (PG), resulting in two different types of algorithms.
VD methods learn local Q networks and a blending function that blends the local Q networks with a global Q function. The mixing function is typically applied to satisfy the Individual-Global-Max (IGM) principle, which guarantees that the optimal joint action can be computed by greedily choosing the locally optimal action for each agent.
In contrast, PG methods directly apply the policy gradient to learn an individual policy and centralized value function for each agent. The value function takes as input the global state (e.g. MAPPO) or the concatenation of all local observations (e.g. MADDPG), for an accurate estimate of the global value.
The permutation game: A simple counterexample where VD fails
We begin our analysis by considering a stateless cooperative game, namely the permutation game. In a permutation game of $N$ players, each agent can generate $N$ actions $ 1,ldots, N $ . Agents receive $+1$ reward if their actions are mutually distinct, that is, the joint action is a permutation greater than $1, ldots, N$; otherwise, they receive a reward of $0. Note that there are $N!$ symmetric optimal strategies in this game.
Figure 2: The 4-player permutation game.
Figure 3: High-level intuition about why VD fails in the 2-player permutation game.
We now focus on the 2-player permutation game and apply VD to the game. In this stateless configuration, we use $Q_1$ and $Q_2$ to denote the local Q functions, and we use $Q_textrmtot$ to denote the global Q function. The IGM principle requires it[argmax_a^1,a^2Q_textrmtot(a^1,a^2)=\argmax_a^1Q_1(a^1),argmax_a^2Q_2(a^2).]
We show that VD cannot represent the payoff of the 2-player permutation game by contradiction. If VD methods could represent the benefit, we would have[Q_textrmtot(1, 2)=Q_textrmtot(2,1)=1quad textandquad Q_textrmtot(1, 1)=Q_textrmtot(2,2)=0.]
If any of these two agents have different local Q values (eg $Q_1(1)> Q_1(2)$), we have $argmax_a^1Q_1(a^1)=1$ . Then, according to the IGM principle, none optimal joint action[(a^1star,a^2star)=argmax_a^1,a^2Q_textrmtot(a^1,a^2)=\argmax_a^1Q_1(a^1),argmax_a^2Q_2(a^2)\]
satisfies $a^1star=1$ and $a^1starneq 2$, so the joint action $(a^1,a^2)=(2,1)$ is sub-optimal, that is, $Q_textrmtot(2,1)<1$.
Otherwise, if $Q_1(1)=Q_1(2)$ and $Q_2(1)=Q_2(2)$, then[Q_textrmtot(1, 1)=Q_textrmtot(2,2)=Q_textrmtot(1, 2)=Q_textrmtot(2,1).]
As a result, the value decomposition cannot represent the payoff matrix of the 2-player permutation game.
What about PG methods? Indeed, individual policies may represent an optimal policy for the permutation game. Furthermore, stochastic gradient descent can ensure that PG converges to one of these optima under mild assumptions. This suggests that although PG methods are less popular in MARL compared to VD methods, they may be preferable in certain cases that are common in real-world applications, for example, games with multiple strategy modalities.
We also note that in the permutation game, in order to represent an optimal joint policy, each agent must choose different actions. Consequently, a successful PG implementation must ensure that policies are agent-specific. This can be done using individual policies with non-shared parameters (called PG-Ind in our paper) or a policy conditional on Agent ID (PG-ID).
PG outperforms existing VD methods on popular MARL testbeds
Going beyond the simple illustrative example of the permutation game, we extend our study to popular and more realistic MARL benchmarks. In addition to the StarCraft Multi-Agent Challenge (SMAC), where the effectiveness of PG and agent-conditioned policy input has been verified, we show new results in the multiplayer Google Research Football (GRF) and Hanabi Challenge.
Figure 4: (left) winning percentages of PG methods in GRF; (right) best and average evaluation scores in Hanabi-Full.
In GRF, the PG methods outperform the state-of-the-art VD baseline (CDS) in 5 scenarios. Interestingly, we also observe that individual policies (PG-Ind) without parameter sharing achieve comparable, sometimes even higher, win rates compared to agent-specific policies (PG-ID) in all 5 scenarios. We evaluate PG-ID in the large-scale Hanabi game with a variable number of players (2-5 players) and compare them to SAD, a Q-learning variant outside the strong policy in Hanabi, and value decomposition networks (VDN ). As demonstrated in the table above, PG-ID is able to produce results comparable to or better than the best average rewards achieved by SAD and VDN with varying numbers of players using the same number of environment steps.
Beyond highest rewards: learning multimodal behavior using autoregressive policy modeling
In addition to learning higher rewards, we also study how to learn multimodal policies in cooperative MARL. Let’s go back to the permutation game. Although we have shown that PG can effectively learn an optimal policy, the strategy mode it eventually achieves may be highly dependent on the initialization of the policy. Thus, a natural question will be:
Can we learn a single policy that can cover all optimal modes?
In the decentralized PG formulation, the factored representation of a joint policy can only represent a particular mode. Therefore, we propose an improved way to parameterize policies for stronger expressiveness: autoregressive (AR) policies.
Figure 5: Comparison between individual policies (PG) and auto-regressive policies (AR) in the 4-player permutation game.
Formally, we factor the joint policy of $n$ agents in the form of[pi(mathbfa mid mathbfo) approx prod_i=1^n pi_theta^i left( a^imid o^i,a^1,ldots,a^i-1 right),]
where the action produced by agent $i$ depends on its own observation $o_i$ and all the actions of previous agents $1,dots,i-1$. Autoregressive factorization can represent none joint policy in a centralized MDP. The not more each agent’s policy modification is the input dimension, which is slightly expanded by including prior actions; and the policy output dimension of each agent remains unchanged.
With such minimal parameterization overhead, the AR policy substantially improves the representational power of PG methods. We note that PG with AR policy (PG-AR) can simultaneously represent all optimal policy modes in the permutation game.
Figure: Action heatmaps for policies learned by PG-Ind (left) and PG-AR (center), and heatmap for rewards (right); while PG-Ind only converges to a specific mode in the 4-player permutation game, PG-AR successfully discovers all optimal modes.
In more complex environments, including SMAC and GRF, PG-AR can learn interesting emergent behaviors that require strong intra-agent coordination that may never be learned by PG-Ind.
Figure 6: (left) PG-AR-induced emergent behavior in SMAC and GRF. In SMAC’s 2m_vs_1z map, Marines alternate standing and attacking while making sure there is only one Marine attacking each turn; (right) in GRF’s academy_3_vs_1_with_keeper scenario, agents learn a “Tiki-Taka” style behavior: each player keeps passing the ball to their teammates.
Discussions and conclusions
In this post, we provide a concrete analysis of VD and PG methods in cooperative MARL. First, we reveal the expressiveness limitation of popular VD methods, showing that they could not represent optimal policies even in a simple permutation game. In contrast, we show that PG methods are demonstrably more expressive. We empirically verify the expressivity advantage of PG on popular MARL testbeds, including SMAC, GRF, and Hanabi Challenge. We hope that the insights from this work can benefit the community towards more general and more powerful cooperative MARL algorithms in the future.
This post is based on our article: Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning (paper, website).
For many years, researchers have sought to build cooperative Multi Agent Reinforcement Learning (MARL) agents. However, until recently, many of the approaches failed to produce substantial results. Fortunately, the emergence of Policy Gradient (PG) methods has significantly improved agents’ performance in cooperative MARL tasks. There is evidence that PG methods are an effective and powerful technique in enabling multiple agents to learn structure and role-differences in cooperative settings.
At Ikaroa, an emerging full stack tech company, we are developing a powerful policy representation with which serves to make PG methods more efficient and effective. We make use of a novel deep-learning model called Reinforcement Learning Trees (RLT), which provides a good approximation of the standard PG algorithm. This new model provides a more efficient learning-to-latency conversion compared to the standard methodology, thus making the process of training agents much faster and less computationally expensive. Using this method, we can quickly and accurately train our agents to make cooperative decisions within MARL environments.
In addition to our RLT model, we have implemented an enhanced version of the PG-gradient method, which encodes additional spatial features into the environment. This additional feature enables the agents to further differentiate among different states and thus deal with more complex cooperative tasks. Our model thus results in improved agent performance in cooperative settings.
In summary, Ikaroa’s policy-gradient methods are proving to be effective and efficient when dealing with cooperative MARL tasks. By leveraging our Reinforcement Learning Trees model and enhanced PG-gradient implementation, we are providing a powerful policy representation that constitutes an effective way of training agents to make cooperative decisions. We are confident that our approach will further improve the results achievable in MARL tasks.