I use stochastic gradient and stochastic approximation algorithms regularly, and I often find it surprisingly difficult to locate basic results on the convergence of these algorithms, despite multiple books written on the subject. For example, consider a convex function . Denote the set of minima of
by
. If
is not empty, then we know that it is convex set, but we do not know a priori if
consists of a single point. Denote the subdifferential of
at
, i.e., the set of subgradients of
at
, by
. A stochastic subgradient algorithm is an algorithm of the form
with , for
, and
a sequence of nonnegative stepsizes. We can rewrite
, for
, where
is a martingale difference sequence with respect to
, i.e.,
, a.s., for all
. We have the following theorem.
Theorem 1 Suppose that the set of minima of
of
is non empty, and that the stochastic subgradients satisfy
Moreover, assume that
,
. Then the sequence of iterates
following (1) converges almost surely to some minimum
.
In various papers and books and books on stochastic approximations, e.g. [1-4], one can find very general theorems on the convergence of algorithms such as (1), including various additional error terms. Still, trying to apply these general theorems to the simple situation considered here, you might find that you need to make additional assumptions, for example that is continuously differentiable, that
is a single point, and/or that you know somehow a priori that the sequence
is bounded almost surely. The assumptions on
and
might not hold here (consider for example
), and the extra work required to prove that
is bounded is essentially of the same order as proving the theorem from scratch in this case. This is not to say that Theorem 1 cannot be found in the literature. For example, the proof below adapts one found for a somewhat more specific context in [5]. But in my opinion a good book on stochastic approximation algorithms should reference such a basic and widely applicable theorem more clearly. The proof presented here uses the following useful version of the supermartingale convergence theorem, see [6].
Theorem 2 (Supermartingale Convergence Theorem) Let
be three sequences of random variables and let
be a filtration (i.e., sigma algebras such that
). Suppose that
- The random variables
are nonnegative and
-measurable.
- For each
, we have
.
- There holds
.
Then, we have
, and the sequence
converges to a nonnegative random variable
, with probability
.
Proof: (of the stochastic subgradient convergence theorem) For and
, we have by definition of a subgradient
Hence
This inequality is fundamental to study the progress made by many subgradient algorithms. We use it first by taking , for some
, to get
Now by the supermartingale convergence theorem, we deduce that almost surely, and the sequence
converges almost surely. In particular,
is bounded almost surely.
The first conclusion implies immediately that
and it only remains to show that truly converges to one of the minima, rather than just to the set
. For this, take a countable set of points
in
, dense in
. For each such point
, we have as above that
converges. Hence with probability
, all sequences
for
converge. Since the sequence
is bounded, it has at least one limit point. This limit point must be unique in view of the convergence of all the sequences
. Hence
converges.
[1] A. Benveniste, M. Métivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations. Springer, 1990.
[2] H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd Edition, 2003.
[3] J. C. Spall, Introduction to Stochastic Search and Optimization. Wiley-Interscience, 2003.
[4] V. S. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008.
[5] D. P. Bertsekas and A. Nedic and A. E. Ozdaglar, Convex Analysis and Optimization. Athena Scientific, 2003.
[6] J. Neveu, Discete Parameter Martingales, North-Holland, 1975.
You must be logged in to post a comment.