Abstract:
The K-armed bandit problem is a sequential decision making problem wherein one has to sequentially sample from a given set of K probability distributions (belonging to a known family) informally called 'arms of the bandit'. The goal is to minimize the total opportunity cost of not selecting the arm with the highest expected reward, called the regret. We shall look at the Kullback-Leibler Upper Confidence Bound (KLUCB) Algorithm for regret minimization in K-armed bandits, and see how it meets the lower bound on expected regret for our problem.