Multi-armed bandit (MAB) is a popular framework for decision-making in an uncertain environment. In its classical setup, the algorithm has access to a fixed and finite set of unknown, independent probability distributions or arms. At each time step, having observed the outcomes of all the previous actions, the algorithm chooses one of the K arms and observes an independent sample drawn from the underlying distribution, which may be viewed as a reward. The algorithm's goal is either to maximize the accumulated rewards or to identify the "best" arm in as few samples as possible, for an appropriate definition of "best".
Variants of these classical formulations have been widely studied. Tight lower bounds on associated performance metrics and algorithms matching these lower bounds exactly have been developed assuming that the arm-distributions are either Sub-Gaussian or come from a single parameter exponential family (SPEF), for example, Gaussian with known variance or Bernoulli. However, in practice, the distributions may not belong to these simple classes. Developing lower bounds and optimal algorithms for the general class of distributions largely remained open mainly because of the need for new sets of tools and techniques for the analysis in generality.
In this dissertation, we undertake a detailed study of the MAB problems allowing for all the distributions with a known uniform bound on theirĀ $(1+\epsilon)^{th}$ moments, for some $\epsilon > 0$. This class subsumes a large class of heavy-tailed distributions. We develop a framework with essential tools and concentration inequalities and use it for designing optimal algorithms for 3 different variants of the MAB problem. In this talk, we will look at these optimal algorithms for the classical frameworks - regret minimization and best-arm identification.
We will also investigate the new concentration inequalities that we develop for proving theoretical guarantees of the algorithms. These can be used to construct tight, anytime-valid confidence intervals (CIs) for various statistics, for example, mean, quantile, CVaR, etc., and may be of independent interest.
The above results were obtained in collaborations involving Sandeep Juneja, Wouter M. Koolen and Peter Glynn.
Zoom link: https://us02web.zoom.us/j/9290331190