1 Introduction
We analyze some recent results on the use of Mirror Descent (MD) and Optimistic Mirror descent (OMD), [1] that have recently been studied extensively for alleviating convergence issues in training of adversarial generative networks [3, 1]. In particular these papers consider the following general saddlepoint (SP) problem.
(1) 
where are compact convex subset of a finitedimensional normed space , , and is continuously differentiable. Let and let be the dual of . Define as
It is wellknown that if is a solution to (1), then it satisfies the Stampacchia Variational Inequality (SVI) [4], i.e.
When the function is convexconcave, it is also wellknown that the point also satisfies the Minty Variational Inequality (MVI) [4], i.e.
For convexconcave problems, it can be shown that these three conditions, namely, is the optimal point of (1), satisfies SVI, and satisfies MVI, are all equivalent. The interplay of these two VIs characterizing the optimal set of solutions has been investigated extensively. We focus here on the notion of coherence proposed in [2, 1], that are assumed to be satisfied by problem (1) and that are in some sense the weakest set of possible conditions considered for global optimality going beyond convexity, pseudomontonocity, and quasiconvexity [2].
In order to describe MD and OMD, one needs to define the notion of Bregman Divergence (BD) with respect to a differentiable and strongly convex function whose domain includes the set . There are several equivalent definitions of strong convexity, here we provide the one that will directly be used in the proof later:
(2) 
We further assume that is Lipschitz, which is needed in the proof. The BD is defined as,
(3) 
Bregman divergence enjoys a number of properties that are critical to the success of MD and OMD and we refer the reader to the Apendices in [1]
where they are proposed and derived. Given a vector
, and a vector , define the following Bregman projection operator via,(4) 
In the following we will assume that an is given and fixed throughout.
1.1 Variational Inequalities and Coherence
The following definition of coherence is provided in [1, Definition 2.1], where we explicitly define what it means by being sufficiently close to in Condition 3; this definition will be used in the proofs presented in Section 4.
Definition 1.
We say that (SP) is coherent if

Every solution of (SVI) also solves (SP).

There exists a solution of (SP) that satisfies (MVI).

Every solution of (SP) satisfies (MVI) locally. Specifically, for some fixed , for all such that .
If, moreover, (MVI) holds as a strict inequality in Condition 2 when is not a solution of (SP), then (SP) will be called strictly coherent; by contrast, if (MVI) holds as an equality in Condition 2 for all , we will say that (SP) is nullcoherent.
1.2 Mirror Descent Algorithms
Under stochastic gradients, the MD and OMD are defined below along with the almost standard assumptions on the expected values and the variance of the gradient estimates. For all the probabilistic statements in this article, we consider the probability space
and let denote the expectation with respect to .Mirror Descent (MD): The vanilla mirror descent (MD) algorithm is defined as
(5) 
where is some constant and satisfies
(6) 
with , the algebra generated by . Note that with the assumption in (6), we have that there exists some constant such that
(7) 
since a continuous function () is bounded on a compact set ().
Optimistic Mirror Descent (OMD): The optimistic mirror descent (OMD) algorithm is defined as
(8) 
with the assumption that
(9) 
where . Similarly, there exist some finite constant such that
(10) 
2 Main Results
We are now ready to restate the main results from [1] with several important corrections.
First, for coherent problems and OMD with exact (nonstochastic) gradients, it is claimed in [1] that is monotone decreasing for some . However, as we fill out the missing details in the proof, we believe that monotone decreasing is guaranteed only after some sufficiently large number of iterations; our modified statement is given in Theorem 1.
Theorem 1 (Modified from [1, Theorem 4.1]).
Next, for strictly coherent problems, almost sure convergence is claimed in [1] for both MD and OMD with stochastic gradients. However, as we fill out the missing details in their proofs, we believe that only convergence with high probability can be guaranteed; our modified theorem statements for MD and OMD are given in Theorem 2 (a) and Theorem 3, respectively. Moreover, for nullcoherent problems, it is claimed in [1] that the sequence is nondecreasing for all , whereas we believe that it is true only for the ’s that satisfy Condition 2 in Definition 1; our modified statement is given in Theorem 2 (b).
Theorem 2 (Modified from [1, Theorem 3.1]).
Remark 1.
Theorem 3 (Modified from [1, Theorem 4.3]).
3 Conclusions and Future Work
In an attempt towards understanding the recent body of work on MD/OMD dynamics for saddle point problems, in this article we have provided more rigorous and corrected statement of the claims in [1]. As part of future work we aim to shed light on the rates of convergence of MD and OMD under coherency assumptions. In this context we aim to build upon the analysis conducted in [5].
References
 [1] P. Mertikopoulos, B. Lecouat, H. Zenati, C.S. Foo, V. Chandrasekhar, and G. Piliouras, “Optimistic mirror descent in saddlepoint problems: Going the extra(gradient) mile,” in International Conference on Learning Representations (ICLR 2019), 2019.
 [2] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. Glynn, “On the convergence of mirror descent beyond stochastic convex programming,” arXiv preprint arXiv:1706.05681, 2017.
 [3] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training gans with optimism.,” in International Conference on Learning Representations (ICLR 2018), 2018.
 [4] R. Ferrentino, “Variational inequalities and optimization problems,” Applied Mathematical Sciences, vol. 1, pp. 2327–2343, 01 2007.
 [5] P. Mertikopoulos and M. Staudigl, “Stochastic mirror descent dynamics and their convergence in monotone variational inequalities,” Journal of optimization theory and applications, vol. 179, no. 3, pp. 838–867, 2018.
4 Appendix
4.1 Proof of Theorem 1
First, we restate [1, Lemma D.1].
Lemma 1.
Suppose that (SP) is coherent and g is Lipschitz. For any saddle point , the iterates of (OMD) with exact gradient satisfy
(14) 
If moreover, is (one of) the special saddle points that satisfy (MVI) globally, then
(15) 
Proof.
Next, we add a result that is similar to [1, Proposition B.4(a)].
Lemma 2.
Let be a strongly convex distancegenerating function on and further assume that is Lipschitz. Then, for any , we have
Proof.
Let and . By [1, Lemma B.1(b)(c)], we have
Letting and and adding the two inequalities:
Note that the LHS of the above is low bounded by
by strong convexity of . The RHS of the above is upper bounded by
by CauchySchwarz and Lipschitz property of . Combining the two, we have
∎
We are now ready to prove Theorem 1.
Proof.
Let be a saddle point of (SP) and satisfies (MVI) globally. Such exists by definition of coherence. By (15) in Lemma 1, we have
Since , there exists such that for all . Then with this , we have
Telescoping the above, we have
Rearranging the above, we have
where the last inequality follows by positivity of Bregman divergence. Taking the limit as on both sides of the above inequality, we have , which implies that .
Next by compactness of , we have that has a convergent subsequence such that . We show in the following that, in fact, . First notice that
Moreover, suppose that , and so as well. Then
where the last equality follow by
since for all and , taking the limit as on both sides of the resulting inequality, we have that . In the above, step follows by [1, Proposition B.4(a)] and Lemma 2. The above shows that . By [1, (B.7)], we have
That is, for all , which is (SVI). By definition of coherence, must be a saddle point of (SP).
So far, we have proved that . Next we want to show that . Similar to before, fix an such that . Then (14) in Lemma 1 gives
Telescoping the above, we have
where the choice of is as follows: let with being arbitrary but fixed,

Choose sufficiently large such that for all , such choice is possible since we have proved that as ;

Choose sufficiently large such that for all , such choice is possible since by and the Bregman reciprocity condition, we have as ;

Choose such that , such choice is possible since we have proved that .
With such choice of , we will prove that for all .
First we show that for all ,
(16) 
To see this, in [1, Lemma B.2], letting , we have
where the first inequality follows by CauchySchwarz and the Lipschitz property of , and the second inequality follows by our choice of .
Now starting with , we have
Since , by (16) we have . By our modified Condition 3 in the definition of coherence, we have , which implies
Using (16) again, the above implies and hence . Therefore,
Keeping this procedure, we can show that for all , we have . Since can be chosen to be arbitrarily close to zero (by choosing arbitrarily close to zero), we have proved that for all , there exists an such that for all , hence . By the Bregman reciprocity condition, we have . ∎
4.2 Proof of Theorem 2
Proof.

The same technique used for proving (ii) and (iii) in Theorem 3 can also be used to prove the convergence of mirror descent (MD) algorithm; we omit the details here.

Note that there is a typo in the proof of [1, Theorem 3.1(b)] that can be quite confusing: in [1, (C.14)], the plus sign before the last innerproduct should be a minus sign. To see how [1, (C.14)] (with the corrected sign) is obtained, we first recall that is proper, convex, and closed on . Therefore, we have , where is the convex conjugate of . By [1, (B.5),(B.6b)], MD (5) can be written as
It follows that
hence
(17) Now applying [1, Lemma B.2] with (a saddle points that satisfies Condition 2 of Definition 1), , and , we have
where the last equality follows by (17). Taking expectation on both sides,
where step follows by , since is measurable and is an unbiased conditioned on by (6), and step follows by definition of nullcoherence. This shows that the sequence is nondecreasing for the special saddle points that satisfy global (MVI).
∎
4.3 Proof of Theorem 3
Proof.
We first include a result from [1, Proposition B.4(b)], which will be used frequently in the proof of the theorem.
Lemma 3 ([1, Proposition B.4(b)]).
We are now ready to prove the theorem. The proof contains three steps:

Show that
(18) 
Let denote a subsequence of . Show that
(19) 
Show that (13) holds.
Let denote a subsequence of and we notice that (i) and (ii) imply that
(20) 
To see this, denote the events considered in (18), (19), and (20) by , , and , respectively. For any ,
Letting on both sides of the resulting inequality, we have , thus . To recap, we have shown implies , which implies , and so ; (20) is proved. We will see later that (20) will be used to prove (iii).
We now show (i). Let be one of the special saddle points that satisfy global (MVI). Define
Comments
There are no comments yet.