The Principled Agent

The journey to a better policy.


An Agent of Chaos – Breakout Baseline #3

As Shakespeare once wrote:

To exploit or to explore. That is the question.
Whether ’tis nobler for the agent to suffer
The meager rewards of a known, safe policy,
Or to take arms against a sea of unknown states,
And by exploring, discover a better one?

That guy was way ahead of his time, huh! Today we’re taking a deep dive into how the entropy bonus in our still-not-so-good MinAtar Breakout agent affects the dynamics of our policy. By the end, I’m hoping we have some clear signs we know to look out for to indicate over- or under-tuned entropy bonuses.

Where We Left Off

If you missed the last couple of posts, here is a quick rundown:

We got a PPO Actor-Critic agent running (learning JAX in the process) in the Gymnax MinAtar Breakout environment. The results started quite poor at 3.98 average episodic reward compared to the baseline’s 28. Through some systematic visualization, we were able to uncover a bug in the PPO loss that was excluding the entropy bonus. Fixing this bug pushed our agent’s average episodic reward all the way up to a whopping 9.81! Ahem… okay, maybe “whopping” is the wrong word.

But we’re left with some questions: are there signs we missed that would have told us earlier that our entropy bonus was too low? Can we catch this kind of issue quicker next time?

The One-Track Mind

Let’s begin in a similar regime to the bug we were dealing with last time and have an extremely low entropy bonus. Now that we know what we’re looking at, an agent with no urge to explore, let’s re-examine some of the data we already have. I’ll set the entropy coefficient to \displaystyle  0.0005. For all intents and purposes, entropy will play no role in the PPO loss.

First, the playback and salience map. We’re going to add one little addition to this playback. Let’s add another chart on the right side of the frame that shows us the entropy over time of the agent. Maybe we’ll glean some additional insights from this.

🔍Analysis

There are a couple of things that immediately jump out at me about this playback.

First, watch the chart in the upper right corner. The action probabilities are almost always 100%. This is further evidenced by the entropy-over-time chart in the bottom right that basically remains at 0 the whole time.

We also see the value estimation completely flat. Let’s keep an eye on that as well. Perhaps this will change with the entropy bonus tuning.

Finally, we see the very rigid attention lanes that we saw in my last post.

Hmm… anything else that I think can be clearly attributed to the very low entropy coefficient? I don’t think so.

💡Takeaway

Far-too-low entropy bonus is immediately clear when looking at the action probabilities because the agent is always 100% certain about the action to take.

There are perhaps some other tell-tale signs, but I’m not sure yet. Things like the very rigid attention could be common for agents low entropy bonus. We’ll have to revisit this when we start working on other baselines.

There’s something else I’m curious about here. Just like the heatmap we made for lose states in my last post, what if we make a heatmap for entropy? Would that tell us anything?

I coded up a quick heatmap using Matplotlib that gives us an average entropy based on where the ball and paddle are located, normalized by the maximum average. Lets take a look:

🔍Analysis

Hmmm, at first glance, this heatmap doesn’t seem very informative. But that’s a finding in itself. The agent’s entropy is near-zero everywhere so there are just a few spots where we get some entropy, meaning the heatmap looks very sparse.

💡Takeaway

Far-to-low entropy bonus is easily identified by the entropy heatmap because it will look very spotty.

The Loose Goose

Okay, lets flip all the way to the other end of the spectrum now. We just looked at very low entropy bonuses. Now lets look at very high entropy bonuses. The agent should want to explore a lot, which should lead to a very different story.

I’ll set the entropy coefficient to \displaystyle  0.5. I expect the entropy bonus to totally saturate the PPO loss and drown out the actual environment reward.

Let’s take a look at our playback and see what the agent is up to:

🔍Analysis

Wow, that’s a short rollout! The agent basically can’t decide between any actions and so is just performing randomly. We can see all the action probabilities are basically equal throughout. And the uncertainty graph basically stays right up at maximum entropy the whole time.

The value estimation is completely flat again. Like last time, let’s note it and see if it improves with better tuning.

Interestingly, we also see attention being essentially random. I guess there is no reason for the model to learn any useful attention when the best reward comes from the entropy bonus and choosing random actions.

💡Takeaway

Far-too-high entropy bonus is immediately clear when looking at the action probabilities because the agent is always completely uncertain about the action to take.

Additionally, the random attention is a sign that something is wrong with the learning signal, though it is likely not unique to too-high entropy bonus.

Okay, some solid take aways there. Nothing too surprising though. Interestingly, if you remember our final playback from the last post, the attention looked very similar. I’m going to go out on a limb here and guess that our entropy coefficient of \displaystyle  0.1 where we left from last post is too high.

Okay, let’s look at the entropy heatmap to see if there’s anything interesting there:

🔍Analysis

This is a solid result! Because the agent’s entropy is high everywhere it goes, the map isn’t showing us interesting variations in uncertainty. Instead, it’s essentially just a visitation map, highlighting the set of states the agent is trapped in.

💡Takeaway

Far-too-high entropy bonus is immediately clear when looking at the entropy heatmap because it will just be a visitation map.

It’s clear that both extremes result in bad performance. One agent learns nothing because it never leaves its comfort zone, while the other learns nothing because the reward signal is completely drowned out by the incentive to be random. It’s clear we need to find a healthy balance between exploration and exploitation.

The Balancing Act

Okay. We’ve seen the two extreme ends of the spectrum where we have far-too-much or far-too-little entropy bonus. Now lets look at a couple of examples in the middle and see if there’s anything we can glean from these regimes.

For these two, I’ll set the entropy coefficient to \displaystyle  0.05, let’s call this the strong example, and \displaystyle  0.005, let’s call this the weak example. This should give us a nice view of the landscape between the extremes.

First, lets watch our agent play with the strong entropy bonus.

And our weak entropy bonus.

🔍Analysis

Well these agents look loads better than our two extremes. They’re both able to play a much longer game (and receive higher reward for it).

But besides that, there are some other things to note. First, it seems the results are shades of gray of our two extreme cases. This makes sense: the weak agent, with its lower entropy bonus, is biased towards exploitation, while the strong agent is biased towards exploration. We are seeing the classic trade-off in action.

The strong entropy bonus agent still mostly sits around maximum entropy until very specific moments when it suddenly decides one action is correct and then entropy is extremely low.

We also see seemingly random attention, just like our extreme high case though I would assume there is some structure there since the agent is following the ball.

The weak entropy bonus agent, on the other hand is generally very confident about it’s actions and sits mostly at 0 entropy, though we see spikes of other actions.

It also still has very static attention, just like the extreme low case though you can see clear fluctuations in attention as the ball moves.

And finally, for the first time, we’re seeing tiny little variations in our value estimates for both agents! Aside from longer rollouts, this is the biggest sign that we’re moving in the right direction.

The weak agent appears to perform better than the strong agent in this one rollout. We’ll have to wait and see if, in the end, the average episodic reward bears out the same results.

💡Takeaway

Fluctuations in uncertainty are a sign of a healthy entropy bonus.

Those agents are looking really good now. We’ve seemed to have honed in on a good entropy coefficient range for our agent. Let’s take a look at the entropy heat maps and see what story they tell.

First the strong agent:

And our weak agent:

🔍Analysis

The ball entropy heatmaps both look much healthier. The entropy seems to follow some set patterns where there is higher entropy on the edges and up by the bricks and lower entropy in the middle of the environment. A possible interpretation of this pattern is that the agent is more certain when the ball is in the predictable, open middle of the screen. It becomes more uncertain when the ball is near the edges or bricks, where bounces are harder to predict.

One other thing to note: The strong agent seems to have the ball in more locations that the weak agent. So unlike our single environment rollout, this heatmap points to the strong agent performing better! Hmm… we’ll have to see what the average episodic reward tells us.

The paddle entropy map doesn’t provide much information.

💡Takeaway

Full but structured entropy heatmaps are a good indicator that the entropy bonus is in a healthy range.

Just a Dash of Entropy

So, it seems there are obvious distinctions that are pretty easy to recognize if you know what to look for. Hopefully, with the takeaways we have from today, we’ll be able to better spot healthy vs unhealthy entropy bonuses.

There’s just one last thing to do. Let’s decide which of these entropy coefficient to use going forward. Currently, with the entropy coefficient we randomly chose from our last post, \displaystyle  0.1, we’re sitting at 9.81 average episodic reward compared to the baseline’s 28. Let’s run all our agents again and see what their average episodic rewards are.

Entropy CoefficientAvg Episodic Reward (1024 Runs)
0.56.33
0.19.81
0.0515.53
0.0056.47
0.00055.07

🔍Analysis

\displaystyle  0.05 is our clear winner with an average reward of 15.53!

There is a nice curve here where extreme values result in bad performance and intermediate values improve the performance. Interestingly, it appears as though favoring higher entropy bonuses over lower ones is beneficial.

There’s something that’s not quite sitting right with me with those results. The \displaystyle  0.005 coefficient agent, our weak agent, seemed to be much healthier than our \displaystyle &bg=F8F9FA 0.5$, one-track mind, agent, based on our visualizations. But in the average episodic rewards table, they appear to be about equal in performance. Perhaps average isn’t telling the whole story. Let’s take a look at the max and min as well.

CoefficientAverageMinMax
0.56.3349
0.19.81514
0.0515.531516
0.0056.4767
0.00055.0746

🔍Analysis

Well this helps clarify things! The high entropy coefficient agents have a huge variance in their performance because they’re just performing random actions, so their average can be misleading.

💡Takeaway

Perhaps quite obviously, choosing higher entropy bonuses for agents will result in higher variance in their final performance.

This clarifies things. The \displaystyle  0.05 agent provides not just the highest average reward, but also a strong and consistent performance.

15.53! Again, a huge improvement gain from where we were sitting last time. We’re inching ever closer to that 28 baseline. After carefully exploring entropy bonuses, we gained some major insight about how to identify where in the spectrum our agent is. Then, using this info, we were able to make an informed decision about the direction to tune our entropy, resulting in a much better performing agent!

We still haven’t reached the baseline, yet, though. So, let’s see if we can’t hit it next time. While we’ve been using MLPs so far, I think it’s time to move on to an architecture that’s better suited for the spatial data that the is the pixel input we’re getting from MiniAtar Breakout. Let’s do a deep dive into CNNs next time and see how we can best tune them to break that baseline!

Code

You can find the complete, finalized code for this post on the V1.3 Release on GitHub.



Leave a comment