Near-Optimal Sample Complexity for MDPs via Anchoring
Revista : Proceeding of Machine Learning Research (ICML'2025)Volumen : 267
Número : NA
Páginas : 32907-32929
Tipo de publicación : ISI Ir a publicación
Abstract
We study a new model-free algorithm to compute ?-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halperns anchored iteration, and computes an ?-optimal policy with sample and time complexity Õ(|S||A|?h??²sp/?²) both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor ?h??sp. Although the complexity bound involves the span seminorm ?h??sp of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.

English