I Ran Billions of Simulations to Simplify Multi-Armed Bandit Algorithms for You

Share

Whenever anyone mentions Multi-armed bandit algorithms, the synonymous term reckoned by most data scientists would be "explore-exploit". Pls let me know if that's what came to your mind too! This article is an attempt to introduce a new perspective to associate MAB algorithms with a different term - "simulations". Understanding how simulations play a key role in making sense of MAB algorithms is far more important than understanding the nuances of explore-exploit paradigms or the stats behind Thomson sampling.

Why simulations are more important than the math

  • We're dealing with random processes : The term explore (in "explore-exploit") screams "randomness". The two main components of MAB algorithms - arm selection and the reward distribution mechanism - be it binary (success/failure or converted/not converted) or continuous (gaussian or other probability distributions) - are all based on random distributions. Simulations can better mimic the behavior of these random distributions, making it invaluable for grasping how MAB algorithms work in practice.

  • Math can be abstract: While the math behind Softmax, UCB or Thomson sampling is elegant, it can be hard to connect to real-world scenarios unless you're this guy from Good Will hunting.

This article isn’t for people who see probability distributions in their bathroom mirror

  • Experimentable, Scalable, and Visualizable: Simulations allow you to run “what-if” scenarios with various baseline values, reward rate differences, and numbers of arms. Since simulations don’t rely on actual data, they can scale effortlessly, even when dealing with highly complex reward distributions—common in real-world scenarios. Oh, and did I mention that simulations let you observe the variability in outcomes? Observing and understanding variability - it's the gateway to understanding the universe itself - or if I'm supposed to be less dramatic - essential in quantifying the uncertainty in decision-making.

So let's delve into the results of simulations, right away.

To explore or not, that is the question

In the context of experimentation, the goal of MAB algorithms is primarily to assign experimentation units to each arm based on it's knowledge of the best arm. But herein lies the problem that any MAB algorithm has to overcome. Initially, the knowledge about the best arm is low (or non-existent). It needs to "explore" initially to have more confidence on which the best arm is. After gaining enough confidence, it can begin exploiting the knowledge gained to assign more experimental units to the "best performing arm". Now let's see how one of the most foundational MAB algorithms - a.k.a., Epsilon Greedy - stacks up against the Softmax algorithm.

Epsilon Greedy vs Softmax

The performance of Epsilon greedy when the best arm far better the baseline

To evaluate how well these algorithms perform, we need a clear set of metrics. These metrics help us understand their effectiveness in identifying the best arm and maximizing rewards.

The metrics to evaluate MAB algorithms

Power : The proportion of trials where the algorithm correctly identifies the best arm

Why this metric matters: Since we are playing Gods by doing simulations, we already know which arm is the most successful. This metric helps measure how effectively an algorithm identifies the best arm. Typically, we aim for 80% power —i.e., 80% of simulations should correctly identify the best arm at a given point in time. An algorithm or parameter set that reaches this benchmark sooner (or with fewer samples) is considered highly efficient.This is the top most plot in the pic.

Average reward: The average reward obtained across all simulations in a trial

Why this metric matters: Even arms with lower reward rates occasionally return rewards. While choosing the best arm is ideal, selecting a slightly suboptimal arm isn’t catastrophic, provided it still generates meaningful rewards. This metric reflects the overall effectiveness of the algorithm in maximizing reward across trials.

Cumulative reward: The total sum of average rewards accumulated across trials for the chosen arms

Why this metric matters: Some algorithms are slow learners, but once they learn, they consistently perform better in later trials. These algorithms may take longer to achieve 80% power and might show lower initial average rewards but excel in long-term reward accumulation (e.g., conversion rates or signup rates). This is captured in the bottommost plot in the image.

Armed with these metrics, let’s interpret the results of our simulations and see what they reveal about the performance of different parameter settings. The above graph was generated as a result of 5000 simulations for 250 trials on 6 different parameters, adding up to 7.5 M simulations.

  • When the performance gap between the best arm and the baseline arms is significant (around 80% in this case), algorithms with an epsilon value less than or equal to 0.2 reach the 80% power threshold within 250 samples.

  • The algorithm with an epsilon of 0.1 is able to rack up more rewards in later trials, showing its strength in long-term optimization.

  • Hence epsilon values of 0.1 and 0.2 are better performers wrt. Power, Average reward as well as Cumulative rewards earned.

Let's compare Softmax algorithm for a similar scenario.

We can already observe that the Softmax algorithm significantly outperforms the Epsilon-Greedy algorithm in both metrics: reaching 80% power and achieving higher average rewards, particularly towards the end of the trials.

Now, let’s examine how these algorithms perform when the Minimum Detectable Effect (MDE) —the difference between the best arm and the baseline arms—is much smaller, say 5%.

To evaluate this scenario, we would need to scale up the simulations dramatically: Perform 5000 simulations over 25,000 trials for each of 5-6 parameter settings across both the Epsilon-Greedy and Softmax algorithms. This adds up to a staggering 625 million and750 million simulations, respectovely.

These additional simulations will provide insights into how the algorithms handle tighter performance gaps, testing their efficiency and adaptability under more challenging conditions.

Epsilon Greedy algorithm when the MDE is 5%

Here, we can see that Softmax isn’t even converging to 80% power after 25,000 trials, whereas a few parameter combinations of Epsilon-Greedy manage to reach 80% power much sooner. But this raises an important question: how do we determine the optimal algorithm or parameters within an algorithm to get the best results?

We’ll rely on our limitless wisdom to pick the best parameters.

Nah, just kidding. Never trust instincts alone when it comes to selecting parameters, especially if there’s an automated mechanism available to do it for us. This fallback mechanism is called “Annealing.”

The unreasonable effectiveness of Annealing

Annealing, in the context of MAB algorithms, refers to a strategy where exploration is emphasized early on and gradually reduced as the algorithm learns more about the environment. This allows the algorithm to transition smoothly from exploring to exploiting as it gains confidence in identifying the best arm.

Different algorithms implement annealing in unique ways. For example:

1. Epsilon-Greedy

In Epsilon-Greedy, the epsilon value (probability of exploration) is dynamically adjusted over time based on the number of trials. Early in the process, epsilon is high to encourage exploration, but it decreases logarithmically as trials progress.

2. Softmax

For Softmax, annealing is achieved by adjusting the temperature parameter. Initially, the temperature is high to encourage randomness in arm selection, but it cools down logarithmically as the algorithm learns more.

Don’t believe me? Simulate it yourself and see the magic!

Apparently, even Annealing Softmax fails to converge as quickly as Epsilon-Greedy—an algorithm often considered inferior in its handling of comparable arms. However, let’s not overlook the benefits of annealing. By gradually reducing exploration, Annealing Softmax still manages to stabilize its performance over time , making it a robust choice when the difference between arms is subtle or when a more measured approach to exploration is needed.

Annealing plays a crucial role in refining algorithms to adapt dynamically to the environment, and its impact is undeniable across various contexts.

What did the simulations reveal?

  • Epsilon-Greedy outperformed Softmax in quickly identifying the best arm, especially when the performance gap was large (80%).

  • Annealing helps algorithms adapt by exploring early and exploiting later, improving long-term results.

  • Even simple algorithms like Epsilon-Greedy can be surprisingly effective when fine-tuned.

While annealing helps refine performance, it’s clear that some algorithms still perform better in specific scenarios. This opens the door to exploring more advanced approaches. In the next blogs, we’ll dive into advanced algorithms like UCB and Thompson Sampling , and perhaps explore a bit of the math that drives their effectiveness. Stay tuned for more!


Reference:

Bandit Algorithms for Website Optimization by John Myles White, Publisher(s): O'Reilly Media, Inc. ISBN: 9781449341336


Originally published on LinkedIn.

Read more