Abstract:
In the stochastic K-armed bandit framework, we are given K unknown distributions or arms. At a given time, we can select one arm and can observe one sample from that arm. Our goal is to maximize the reward over a finite time horizon, which is also equivalent to minimizing the regret. Regret minimization is an important aspect of many applications where we have to make sequential decisions under uncertainty and optimize for some objective, for example, clinical trials, recommendation systems, selecting the best portfolio in a financial market, etc. We will analyze some basic regret minimizing algorithms, such as: explore-then-commit, successive reject, upper confidence bound, etc. We will also derive an information-theoretic lower bound on regret.
The talk will be based on the following references:
1. Lattimore, Tor, and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020, Chapter 6-7.
2. Kaufmann, Emilie. Contributions to the Optimal Solution of Several Bandit Problems. Diss. Université de Lille, 2020, Chapter 1.