The exploration-exploitation dilemma in Multiarm bandit problem


Let’s say you are in a casino and there are 4 slot machines.When played each slot machine gives you a reward of 1 or 0. The win percentage of each of the slot machines are different and they are not known. The results are truly random. The goal is to choose the slot machine with the highest win rate. The only way you could discover their relative win capabilities is by playing each these slot machines and collecting data, but this naive approach has its downside. In the process of collecting the data you would spend lot of time playing the less optimal slot machines. There is a an exploration-exploitation dilemma here.

Exploration finds more information about the environment and exploitation uses the already known information to maximize the reward. So your approach needs to include both maximizing your earning and also in the process find out the slot machine with the maximum win rate.

A number called epsilon is used which represents the probability of exploration, Here is how epsilon is used to choose between explorating a random slot machine and the machine with highest win rate at that point in time.

p = generate_random_number()
if p > e:
  pull a random arm 
  pull the arm with best win rate at this point

In the above process you’ll eventually get to know the arm with the best win rate. This is also called as the Multiarm bandit problem. There are numerous ways of approaching this problem. In next few blog posts will walk through some of the solution for a Multiarm bandit problem.