Model Overview
Contents
Model Overview#
At a high-level, the Reinforcement Learning (RL) Agent simulates what would happen if human players were trying to determine their next move purely on the basis of what had been most successful previously.
The RL Agent tracks the rewards (game points) associated with different combinations of its own moves and its opponent’s moves and chooses the action each round that it expects to be most successful based on this prior experience. While the Model-Based Agent tries to predict its opponent’s most likely next move, the RL Agent does not have any such prediction about its opponent; instead, it merely knows what move it expects to be most successful on the next round based on earlier rounds.
In order to simulate human play against the different bot opponents, the RL Agent has two high-level functions (we spell these out in more detail below):
Track reward patterns: First, the agent maintains a count of rewards associated with different move combinations that it uses to determine which move patterns have been most successful. For example, the simplest version of the RL Agent might choose a move each round based on whichever of the three possible moves had generated the highest reward across all previous rounds. For such an agent, tracking reward patterns would involve adding up the total points (3 for each win, 0 for each tie, -1 for each loss) associated with each move choice (
Rock
,Paper
,Scissors
). This reward tracking allows the agent to choose whichever move has generated the greatest number of points so far. While this represents a very coarse means of deciding what move is likely to be successful, tracking rewards associated with richer combinations of previous moves allows this agent to respond more adaptively.Choose an optimal move: The RL Agent chooses its next move based on prior move sequences that have generated the highest rewards. The higher the accumulated rewards for a given move choice (given relevant previous events), the more likely the agent is to choose this move. Specifically, using the reward totals for each move that the agent could choose in a given round, the agent samples a move probabilistically based on these relative reward values.
How well does the RL Agent described above capture human patterns of learning against the different bot opponents?
To answer this question, we implement different versions of the model and compare them to the human behavior in our experiment data, just as we did with the Model-Based Agent analysis. Each version of the RL Agent has the same decision function from step 2 above but differs in the underlying set of move combinations that it uses to maintain reward counts (step 1). In this way, we simulate different levels of associative learning that people might be engaging in when playing against algorithmic opponents.
Model Details#
Here, we describe in more detail the two central aspects of the RL Agent’s behavior outlined above.
The content in this section is meant to provide enough detail for readers to inspect the underlying code or think carefully about the Model Results on the next page.
Our model evaluation mirrors the Model-Based Agent analysis; we compare the RL Agent to human decision making by running simulations of the agent for each human participant in the original experiment. Thus, model results are based on 300 agent-supplied move choices in each of the games that human participants played; however, the RL Agent’s decisions in these games are based on reward counts of the previous human and bot opponent moves in each game. Critically, this means the agent’s move choices each round reflect learning from the participant behaviors, not learning from what the agent would have done in earlier rounds.
Tracking rewards from past behavior#
The basis for the RL Agent’s move choices is an ongoing count of rewards associated with previous move combinations. To compute this, we add columns to our existing data frame that maintain the relevant cumulative rewards for the agent from earlier rounds. For example, the simplest version of the RL Agent adds columns for rock_reward
, paper_reward
, and scissors_reward
; these track the cumulative rewards associated with each move choice in prior rounds. To populate them, we iterate through each round of human play and update the reward values in these columns based on the participant’s move choice that round (Rock
, Paper
, or Scissors
), and the outcome the participant obtained (3 points for a win, 0 points for a tie, -1 points for a loss). Since this simplified model merely accumulates rewards associated with each single move, the model favors whichever move has won the most. We use this as a null model, since none of the bot opponents in the experiment could be exploited by simply favoring a particular move.
However, the patterns that the RL Agent uses to track rewards aren’t restricted to move base rates and can be arbitrarily complex. In our Model Results on the next page, we evaluate a number of additional models that track different combinations of previous moves using the same process described above. These patterns fall into several distinct categories:
Self previous move rewards: This version of the RL Agent tracks the rewards associated with each possible combination of its own previous move and current move. In other words, if the agent (or, more accurately, the human participant the agent is simulating) previously played
Rock
, its reward tracking allows it to choose the move that has the highest cumulative rewards following a choice ofRock
. This strategy is well-aligned to the opponent-transition bot because, against this opponent, there will always be a move which performs best for the agent given a particular previous move; but how quickly the RL Agent learns this pattern when simulating human play against this bot and how well it performs against the other bots are all open questions (see Model Results).Opponent previous move rewards: We also implement a version of the RL Agent that tracks rewards associated with each combination of agent move and opponent previous move. Tracking this reward pattern allows the agent to choose the optimal move each round based on any dependency between its move choice and its opponent’s previous move. To illustrate, if the agent’s opponent played
Paper
in the previous round, the agent will tend to choose the move for which it has obtained the highest reward previously after the opponent playedPaper
. This pattern is well-adapted to the self-transition bot, since that bot will reliably choose a particular move after its own previous move. More generally, this version of the RL Agent allows us to explore how well opponent previous move reward tracking performs in simulated rounds against all the bot opponents, and whether this aligns with human learning (see Model Results).Combined agent-opponent previous move rewards: In addition to simple self- and opponent- previous move rewards, this version of the RL Agent tracks the rewards achieved through a combination of the opponent’s previous move and the agent’s own previous move. In other words, if the previous round was a tie in which the RL agent and its opponent both chose
Paper
, this version of the RL Agent will prefer whichever move had the highest combined rewards in previous rounds in which both players’ previous moves werePaper
. Under the hood, this agent requires 27 columns to maintain reward counts from one round to the next, one for each unique combination of agent previous move, opponent previous move, and potential subsequent move. This makes it adaptive against all but the most complex bot opponent, since the bot opponents’ move choices are by and large determined by a subset or combination of their own and the human participant’s previous move.Disjunctive agent-opponent previous move rewards: A final version of the RL agent represents a middle ground between the self-/opponent- previous move rewards agent and the combined previous move agent above. This agent tracks the cumulative rewards associated with each combination of its own previous move and subsequent move, as well as the rewards associated with each combination of its opponent’s previous move and its own subsequent move. However, rather than tracking each unique set of previous moves like the combined previous move agent above, this version tracks each previous move dependency on its own. When it comes time to choose a move, it combines the rewards for each possible move given the previous opponent move and rewards for each possible move given its own previous move and uses these combined values to select a move. Thus, in the scenario described above where the agent and its bot opponent tied in the previous round by each choosing
Paper
, this version of the RL Agent will tend to choose the move which has the highest reward associated with either the agent’s own previous move ofPaper
or its opponent’s previous move ofPaper
, rather than the move which was historically best when both players previously chosePaper
. In this way, it is an intermediate stage between the self and opponent previous move agents and the combined previous move agent because it requires 18 columns of cumulative reward counts, nine for each combination of self previous move and subsequent move, and nine for each combination of opponent previous move and agent move. Because the agent doesn’t integrate both previous moves in the same way as the combined previous move agent above, it is not as well adapted to the bots whose move choices are based on previous round outcomes (since this is a contingency on both previous moves combined).
Taken together, the RL Agent’s reward tracking process involves updating cumulative reward values for each move choice given distinct combinations of previous moves by the agent and its opponent. The RL Agent uses these cumulative rewards to learn associations between its opponent’s or its own previous move and the optimal subsequent move. The different classes of RL Agent described above allow us to test distinct learning models that might best account for human behavior against the bots.
In the next section, we walk through how the RL Agent uses the signal from accumulated rewards to select optimal moves in simulated rounds against the bot opponents.
Choosing a move to maximize expected reward#
Each instantiation of the RL Agent tracks a particular set of rewards based on one of the previous move patterns described above. But how does the agent choose its move each round using this information?
Unlike the Model-Based Agent, which chooses moves based on underlying predictions of what the opponent will do next, the RL Agent merely chooses moves in proportion to the rewards they conferred in the past.
In each round, the RL agent has a cumulative reward associated with choosing Rock
, Paper
, and Scissors
in previous rounds; as described above, these rewards incorporate additional contingencies like the participant’s previous move or their bot opponent’s previous move (or both). The task for the RL agent is to sample moves that are associated with the largest prior rewards (given any relevant contingency information). To do this, the agent first converts these reward values to a probability distribution by calculating each move’s softmax probability, and then samples a move from the resulting softmax distribution.
Sample a move based on its softmax probability: For the RL Agent, its move each round will be chosen based on the relative prior rewards associated with each possible choice. For example, if the self-previous move agent above has chosen Rock
in the previous round, it will first select the cumulative rewards associated with each possible move choice after Rock
. Intuitively, if playing Paper
after Rock
has resulted in much higher rewards in the past than playing Scissors
or Rock
after Rock
, it should strongly favor Paper
. On the other hand, if all three moves have produced roughly equivalent outcomes, this agent will be more indifferent about its move choice. To formalize this, the RL Agent computes a softmax probability distribution over the rewards for each possible move. For a candidate agent move \(M_a\) (e.g. Paper
), the softmax probability of choosing that move is:
In the above formulation, \(U(M_a)\) represents the utility of choosing a move \(M_a\)—this will simply be the prior rewards associated with that choice, given all relevant information (such as the agent’s own previous move). The probability of a move choice \(M_a\) is this value scaled by the combined prior rewards of all other possible moves \(M_{a'}\). The \(\beta\) term in the numerator and denominator is a common parameter used to scale how much the agent should favor the highest utility move. In this version of the model, we simply set this to 1. However, in another line of work we estimate a \(\beta\) parameter for each participant as a way of quantifying how well the RL Agent describes human move decisions.
The RL Agent uses the softmax tranformation above to generate a probability distribution over possible moves \(M_a \in \{R, P, S\}\). To select a move, the RL Agent samples a move from this softmax probability distribution. Note that while the Model-Based Agent also relied on a softmax distribution to choose its move, the underlying basis for the distribution was different. In that case, the agent chose its moves in proportion to their underlying expected value. Here, the agent instead chooses a move in proportion to its previous rewards.
In summary, moves with higher rewards from previous rounds are more likely to be chosen in subsequent rounds. This means that if the agent is successfully able to track patterns in its own or its opponent’s previous moves that systematically lead to higher rewards, it will have a high probability of choosing the move that beats the opponent’s next move.
But how well does the RL Agent, choosing its moves in simulated rounds through the process described above, perform against the bot opponents from our experiments?
In the next page (Model Results), we test the RL Agent’s performance in the context of different reward tracking patterns.