Optimal-order convergence of Nesterov acceleration for linear ill-posed problems

We show that Nesterov acceleration is an optimal-order iterative regularization method for linear ill-posed problems provided that a parameter is chosen accordingly to the smoothness of the solution. This result is proven both for an a priori stopping rule and for the discrepancy principle. The essential tool to obtain this result is a representation of the residual polynomials via Gegenbauer polynomials.


Introduction
One option to calculate a regularized solution to an ill-posed operator equation Ax = y δ , with A : X → Y and X, Y being Hilbert spaces, is to employ iterative regularization schemes, where approximate solutions x δ k are calculated iteratively combined with a stopping rule as regularization parameter choice. The simplest one being Landweber iteration (cf., e.g., [4]), which has the downside of being rather slow. To speed up convergence, acceleration schemes may be used such as the following Nesterov acceleration: where A * A ≤ 1 is assumed and where the sequence α k is chosen, for instance, as Here β is a parameter; common choices are, for example, β = 1 or β = 2. We remark that other alternatives for the sequence α k are possible as well, but for the main analysis of this paper we only consider (2). This iteration (in a general nonlinear context) was suggested by Yurii Nesterov for general convex optimization problems [9]. It is an instance of a method that achieves the best rate of convergence (in the sense of objective function decrease) that is generally possible for a first-order method. The Nesterov acceleration can be employed to speed up convergence of gradient methods in nonlinear or convex optimization. A particular successful instance is the FISTA algorithm of Beck and Teboulle [2] for nondifferentiable convex optimization.
In the realm of ill-posed problems, Hubmer and Ramlau [8] performed a convergence analysis for nonlinear problems, and shows the efficiency of the method.
The background and main motivation of the present article is the recent interesting analysis of Neubauer [10] for ill-posed problems in the linear case. He showed that (1) is an iterative regularization scheme and, more important proved convergence rates, which are of optimal order only for a priori parameter choices and in case of low smoothness of the solution and suboptimal else. What is puzzling is that the method shows a quite unusual "semi-saturation" phenomenon (we explain this term below in Section 3.1).
Our contribution in this article is twofold: At first, we prove a formula for the residuals of the iteration (1) involving Gegenbauer polynomials. On this basis, we can build a convergence rate analysis, which improves and extends the results of Neubauer. In particular, we show that the method can always be made an optimal-order method if the parameter β is chosen accordingly to the (Hölder-)smoothness index of the solution. This result holds for both an a priori stopping rule and for the discrepancy principle.
Our analysis also explains the quite nebulous role that this parameter plays in the iteration, as it is related to the index of the orthogonal polynomials appearing in the residual formula.
Moreover, the above mentioned residual representation also clearly elucidates the semi-saturation phenomenon because the iteration can be interpreted as a mixture of a saturating iteration (Brakhage's ν-method) and a non-saturating one (Landweber method).
In the following we employ some standard notation of regularization theory as in [4]: δ = Ax † − y δ is the noise level and x † denotes the minimum-norm solution to the operator equation Ax = y with exact data y = Ax † . The index δ of y δ indicates noisy data, and analogous, x δ k denotes the iterates of (1) with noisy data y δ , while the lack of δ indicates exact data y and correspondingly the iteration x k with exact data y in place of y δ in (1).

Residual polynomials for Nesterov acceleration
Our work follows the general theory of spectral filter-based regularization methods as in [4], where the convergence analysis results from estimates of the corresponding filter function. The first main result, Theorem 1 is quite useful for this purpose as it represents the residual function in terms of known polynomials. The iteration (1) is a Krylov-space method, and the residual can be expressed as y δ − Ax δ k =: r k (AA * )y δ with the residual polynomials satisfying the recurrence relation (cf. [10]) This is a simple consequence of the definition in (1). The k-th iterate can be expressed via spectral filter functions Observe that the three-term recursion (3) is not of the form to apply Favard's theorem [5], hence r k is not any orthogonal polynomial with respect to some weight functions. (Note that Favard's theorem fully characterizes three-term recurrence relations that lead to orthogonal polynomials).
Before we proceed, we may compare the residual polynomials with other wellknown cases. For classical Landweber iteration [4], which is obtained by setting α k = 0 and thus z δ k = x δ k , the corresponding residual functions r k =: r On the other hand, another class of well-known iteration methods for ill-posed problems that are based on orthogonal polynomials are two-step semiiterative methods [6]. They have the form where µ k and ω k are appropriately chosen sequences. The corresponding residual functions satisfy the recurrence relation and thus, r k (λ) form a sequence of orthogonal polynomials. Of special interest in ill-posed problems are the ν-methods of Brakhage [3,6], defined by the sequences, for k > 1, , the initial values x 0 = 0, x 1 = 4ν+2 4ν+1 T * y δ , and with ν > 0 a user-selected parameter. The associated residual polynomials r k =: r where C (α) n denotes the Gegenbauer polynomials (aka. ultraspherical polynomials); cf. [1].
We now obtain the corresponding representation for the Nesterov residual polynomials, which is the basis of this article.
with the Gegenbauer polynomials C We note that C ( β+1 2 ) n (x) satisfy the recursion relation (cf. [1, p. 782 Using the recurrence relation with x = 1, leads to . Dividing (7) by C ( β+1 2 ) k (1) and using this relation yields By induction (or by well-known formulae [1,11]), it can easily be verified that satisfies the same recursion as h k+1 (λ), and the corresponding initial values for k = 0, 1 agree when setting x = √ 1 − λ. This allows us to conclude that , which proves the theorem.
This theorem relates the residual function of Nesterov acceleration to other known iterations. In particular, the residual r k is roughly the product of that of k 2 Landweber iterations and that of k 2 iterations of a ν-method with ν = β+1 4 . Remark 1. Gegenbauer polynomials are special cases of Jacobi polynomials and themselves embrace several other orthogonal polynomials as special cases. Certain values of β in (1) lead to various specializations in (5): The choice β = 0 leads to Legendre polynomials, the often encountered choice β = 1 leads to Chebyshev polynomials of the second kind [1].
We note that the result of Theorem 1 even holds for β = −1. In this case, only α 1 is not well-defined, but it is always 0 for β > −1. Thus, we may extend the definition of the iteration to β = −1 by setting x δ k := lim β→−1 x δ k . (This just amounts to slightly modifying (2) by setting α 1 = 0 for k = 1; the remaining iteration is well-defined by (1) and (2).) In this case, we may use [1, Eq. 22.5.28] to conclude that the resulting polynomials are Chebyshev polynomials of the first kind.
Before we proceed with the convergence analysis, we state for generality the corresponding theorem for Nesterov iteration with a general sequence α k .
Theorem 2. Consider the iteration (1) with a positive sequence α k . Then the corresponding residual function can be expressed as where P k is a sequence of orthogonal polynomials obeying the recurrence relation with c n and d n recursively defined to satisfy Conversely, given a sequence of orthogonal polynomials defined by the recurrence relation (10) with given sequences c n , d n . Then there exists a sequence α k (defined via (11)) such that the corresponding Nesterov iteration (1) has a residual function as in (9). (6) with h 0 (λ) = 1 and h 1 (λ) = √ 1 − λ and for k ≥ 1. As in the proof of Theorem 1, we may conclude that (10) leads to a similar recursion as (8): From (10) we can conclude by some algebraic manipulations that If (11) holds, then from the recursion for θ k , it follows that we can perform an induction step following that P k (1) satisfy identical recursions and have identical initial conditions with the setting x = √ 1 − λ. Conversely, if (10) is given and the sequence α k is recursively defined by (11), then it follows in a similar manner that P k (x) P k (1) has the same recursion and initial conditions as h k (λ) and thus both functions agree.
The polynomials P k (x) in this theorem correspond to xC As an illustration, we may consider the peculiar choice of α k in Nesterov's original paper [9], which is also used in the well-known FISTA iteration [2]: First, a sequence is defined recursively, and then the sequence α k is given by Note that t k+1 is the positive root of the equation t k+1 (t k+1 − 1) = t 2 k . Using this identity, we may calculate that Thus, coefficients for a recurrence formula for orthogonal polynomials that correspond to such an iteration are However, this does not seem to be related to any common polynomial family, to the knowledge of the author.
On the other hand, we may design Nesterov iterations from the recurrence relation of classical polynomials. For instance, the Hermite polynomials obey a relation (10) with c k = 2, d k = 2k. Thus, the sequence α k has to satisfy the recursion We do not know if this is of any use, though.

Convergence analysis
We consider the iteration (1) with the usual α k -sequence (2) and show that it is an optimal-order regularization methods (of course, when combined with a stopping rule).

Convergence rates and semi-saturation
In the classical analysis of regularization schemes [4], one tries to bound the error in terms of the noiselevel δ: where f is some function decreasing to 0 with δ → 0. Often, Hölder-type rates are considered with f (δ) = Cδ ξ . For such estimates, one has to impose smoothness conditions in form of a source condition It is also well-known [4] that the optimal rate of convergence under (12) is of the form and a regularization scheme that achieves this bound is called of optimal order. The phenomenon of saturation is the effect that for certain regularization method, the convergence rate f (δ) does not improve even when the smoothness is higher, i.e., µ is larger. This happens, for instance for Tikhonov regularization at µ = 1 or for the ν-methods at µ = ν; see [4].
For the Nesterov iteration (1), a detailed analysis has been performed by Neubauer [10] with the result that, assuming a usual source condition (12) and an appropriate a priori stopping rule, the resulting iterative regularization scheme is of optimal order for µ ≤ 1 2 , and, for µ > 1 2 , the convergence rates improve with µ but in a suboptimal way. More precisely, the convergence rates proven in [10] are Thus, contrary to saturating methods, the order still improves beyond the "saturation index" µ = 1 2 but in a suboptimal way. This is what we call "semisaturation", and, to the knowledge of the author, this has not been observed yet for a classical regularization method. A further result of [10] is that using the discrepancy principle as stopping rule, convergence rates are proven, which are, however, always suboptimal.
Our second main contribution is an improvement of Neubauer's result in the sense that we show that the Nesterov iteration is of optimal order for a smoothness index µ ≤ β+1 4 with an a priori stopping rule. Moreover, contrary to [10], we also obtain optimal-order rates with the discrepancy principle provided that µ ≤ β−1 4 . These findings allows one to achieve always optimal-order convergence provided β is chosen sufficiently large.
Moreover, the phenomenon of semi-saturation is made transparent by referring to the representation in Theorem 1: The residual is a product of Landwebertype and ν-type residuals, and keeping in mind that Landweber iteration does not show saturation for Hölder indices while the ν-method do, it is clear that a product as in (5) leads to the above described semi-saturation.

Convergence analysis
In this section we perform a convergence analysis for the iteration (1). By Theorem 1, we may base our investigation on the known results for Landweber iteration and the ν-methods.
We collect some useful known estimates: This is well-known and follows from [11, Eq. (7.33.1), (4.73)]. From this we immediately obtain that which has already been shown in [10]. Moreover, we may conclude from (13) and (5) as well that Recall that we denote by x k the iteration with y δ replaced by the exact data. As usual, this allows one to split the total error into an approximation and stability term. We estimate the stability term: (1) (2) with β > −1. Let x k be the corresponding noise-free iteration with y δ replaced by y = Ax † . Then we have the estimate Proof. Following [4], it is enough to estimate where we used the mean value theorem withλ ∈ (0, λ). The derivative may be calculated from (5) as We use Markov's inequality (cf. [4, Eq. (6.16)]) and (13) to conclude that Thus,

The result now follows with [4, Theorem 4.2] and (14).
Note that this estimate is a slight improvement compared to the corresponding estimate in [10, Equation (3.2)], which has 2kδ on the right-hand side, similar as for the ν-methods.
From this we may conclude convergence: Theorem 3. Let A * A ≤ 1 and β > −1. If the iteration is stopped at a stopping index k(δ) that satisfies k(δ)δ → 0 and k(δ) → ∞ as δ → 0, then we obtain convergence The first term converges to 0 by assumption on k(δ) and the second term does so because k(δ) → ∞ and by the dominated convergence theorem using (14), (15) as in [4].
We now consider convergence rates, and for this, the following rather deep estimate for orthogonal polynomials is needed. It was derived by Brakhage [3] as well as by Hanke [4, Appendix A.2], [6] on basis of Hilb-type estimates for Jacobi polynomials.
As a consequence, we may state our main convergence rate result for an a priori stopping rule: Theorem 4. Let A * A ≤ 1 and β > −1, and suppose that a source condition (12) is satisfied with some µ > 0.

If µ ≤ β+1
4 and the stopping index is chosen as then we obtain optimal order convergence 2. If µ > β+1 4 and the stopping index is chosen as then we obtain suboptimal order convergence Proof. For λ ≤ 1 the estimate (17) yields (by interpolation) and (1 − λ) In case of µ > β+1 4 , we have with additionally using (18) The result now follows by standard means: 4 . Solving for k by equating the two terms in the last bounds yields the a priori parameter choice and the corresponding rates.
These results correspond to those of Neubauer when β = 1. However, for β > 1 this is an improvement as we obtain optimal-order convergence if β is chosen larger than 4µ − 1. We note that in the optimal order case, the number of iteration needed is of order O(δ − 1 2µ+1 ), which is the same order as for semiiterative methods and for the conjugate gradient method. Thus, the Nesterov acceleration certainly qualifies being called a fast method.

Discrepancy principle
With the improved estimates, we can as well strengthen the result of [10] when the iteration is combined with the well-known discrepancy principle. Recall that it defines a stopping index k(δ) a posteriori by the first (smallest) k that fulfils the inequality where τ > 1 is fixed. The corresponding convergence rates can be obtained by a slight modification of the proof in [10] and the general theory in [4].
Theorem 5. Let A * A < 1, β > −1, and assume a source condition (12) satisfied. If the iteration (1) is stopped by the discrepancy principle (21), then we obtain the following convergence rates: 1. If µ + 1 2 ≤ β+1 4 , then we achieve optimal order convergence rates with stopping index being of the same order as in (19).
In case that µ + 1 2 > β+1 4 , the corresponding estimate is from which the result in the second case follow.
These rates agree with those of [10] when setting β = 1. There, however, only the suboptimal case 2. was possible. Our improvement is to show that we may achieve optimal order results even with the discrepancy principle provided β is sufficiently large.

Remark 2.
It is clear that in practice β should be selected in the regime of optimal rates, i.e. β > 4µ − 1 for a prior choices and β > 4µ + 1 for the discrepancy principle. However, it is a rule of thumb to choose such parameter also as small as possible, or more precisely, in such a way to come close to the saturation point, i.e. β ∼ 4µ − 1, respectively β ∼ 4µ + 1.
Remark 3. For semiiterative methods, a modified discrepancy principle [6,4] has been defined, where the residual in (21) is replaced by an expression of the form (y δ , s k (AA * )y δ ) with a constructed function s k . This yields an orderoptimal method as for the a priori stopping rule. An adaption of this strategy for Nesterov iteration is certainly possible and this should yield order-optimal rates for all µ ≤ β+1 4 . However, the strategy is quite involved and it is not completely clear to us how to include this into the iteration efficiently. We thus do not intend to investigate such modifications in this article.

Numerical results
In this section we present some small numerical experiments to illustrate the semi-saturation phenomenon and to investigate the performance of Nesterov's iteration, in particular, with respect to the optimal-order results.
In a first example we consider a simple diagonal operator A = diag( 1 n 2 ), for n = 1, . . . 1000, as well as an exact solution x † = ( 1 n 4 (−1) n ) 1000 n=1 , which amounts to a source condition being satisfied with index µ = 0.75. Thus, we are in a case   of higher smoothness, where the results of the present article really improve those of [10]. We add standard normally distributed Gaussian noise to the exact data and performed various iterative regularization schemes: Landweber iteration, the ν-method, and the Nesterov iteration, the latter two with various settings of the parameters ν and β, respectively. We calculated the stopping index either by the discrepancy principle (21) with τ = 1.01 or, since we have the luxury of an available exact solution in this synthetic example, we also calculate the oracle stopping index, which is defined as k opt = argmin k x δ k − x † . In other words, k opt is the theoretically optimal possible stopping index.
In Figure 1, we display the error x δ k(δ) − x † against various noiselevels on a log-log scale. The curves correspond to convergence rates for Nesterov iteration (full line, blue), Landweber iteration (dotted line, black), and the ν-method (dashed dotted line, red). The parameter were chosen as β = 4 and ν = 1, i.e., we are in the optimal-order case covered by item 1 in Theorem 4 and Theorem 5.
On the left-hand side we employ the oracle stopping rule using k opt and on the right-hand side we use the discrepancy principle.
As can be observed, all three methods show a similar (optimal-order) rate, as stated in Theorems 4 and 5. In particular, this verifies one of our findings that the discrepancy principle for Nesterov's iteration leads to an optimal-order method provided β is chosen appropriately.
In Figure 2, we illustrate the semi-saturation phenomenon: Here β and ν are deliberately chosen as too small (β = 0, ν = 0.4 on the left-hand side and β = −0.5, ν = 0.3 on the right-hand side). We observe that for small ν, the convergence rate of the ν-method is slow as a result of its saturation. On the other hand, the Nesterov iteration also has a slower rate than the non-saturating Landweber iteration, but, as can be expected from our residual polynomial representation, it is in between the other two.
We remark that the ν-methods show some unpleasant behaviour when ν is chosen small. The residual is highly oscillating and for small noiselevel we could not even reach the prescribed discrepancy, and if we did, then the number of iteration was quite high, even higher than for Landweber iteration. This might be attributed to our quite aggressive setting of the discrepancy principle with τ = 1.01. In that respect, the Nesterov iteration was very well-behaved, and we had no problem with a small β, which is probably due to the robust Landweber-component in the representation (5).
The optimal-order convergence only partly illustrates the effective performance of the methods. In Tables 1 we therefore provide the ratio of errors values, i.e., the numbers in the table are denotes Nesterov iteration with the optimal stopping rule and x δ method,k the iteration of the respective method with the respective stopping rule. All results correspond to an optimal-order regime of parameters (those of Figure 1). The number of iterations (both for the oracle stopping rule and the discrepancy principle) are given in Table 2. In these tables, we also include the corresponding results for the conjugate gradient iteration CGNE [7].
In terms of the number of iteration, the Nesterov iteration is slightly slower than the ν-methods (approximately by a constant factor of 1.5) but both have a similar modest increase of iterations when δ is decreased. Both need more iteration than the CGNE-method, which, of course, is the fastest one by design. The slightly higher number of iterations might be attributed to the better error . It might appear a little bit paradoxical that a better estimate leads to slower convergence, but this is clear from the theory as the number of iteration is a decreasing function of δ and thus also of any factor in front of δ. This factor, however, pays off when considering the total error of the method, and we observe that Nesterov iteration with the optimal choice k opt indeed has almost always a slightly smaller error than the ν-method. Surprisingly, it is in several instances also better than the CGNE-method. However, the Nesterov method sometimes loses some of its advantages against the ν-method, when using the discrepancy principle, but the performance is still acceptable.
Some further experiments indicate that the results are rather insensitive to overestimating β. As stated in Remark 2, the best choice is usually related to the smoothness index, but there was arose no serious problems when β was larger.
Further numerical experiments have been performed in [10]: Even though the value of β was not reported there, the results are consistent with our theory with the choice β = 1. The forward operator there was the Green's function for the solution of the 1D boundary value problem −u ′′ = f with homogeneous boundary conditions. Exact solutions with various smoothness are stated there: Example 5.1 with µ = 1 8 , Example 5.2 with µ = 5 8 , and Example 5.3 with µ = 17 8 . We used the same problem and the same examples, but we calculated A by using a FEM-discretization of the boundary value problem and A as the corresponding solution operator. For simplicity we ignored discretization errors and took the discretized (projected) solution as x † .
The main purpose of this experiment is to verify that the discrepancy principle (τ = 1.1) can be made an optimal-order method. We choose β = 3.5 for the first two examples and β = 9.5 for the third, which should in any case lead to an optimal-order situation. In Figure 3, we plotted the error versus the relative noiselevel on a logarithmic scale for the three examples with this choice of β, indicated by the marker 'x'. As a comparison, we also indicated the predicted optimal rate by a solid line. Furthermore, also shown and marked with '+' are the corresponding results for β = 1, i.e., in the suboptimal case.
These results clearly illustrate that for the discrepancy principle we may achieve the optimal order rates with the correct choice of β and for a wrong choice of β the rate deteriorates. For low-smoothness as in Example 1 (left picture in Figure 3, however, there seems to occur almost no deterioration contrary  to expectation.

Conclusion
We have provided a representation of the residual polynomials for Nesterov's acceleration method for linear ill-posed problems as a product of Gegenbauer polynomials and Landweber-type residuals. This allowed us to prove optimalorder rates for an a priori stopping rule and the discrepancy principle as long as β in (2) is sufficiently large. The number of iteration is shown to be of the same order as for other fast methods such as the ν-method or the conjugate gradients methods. Moreover, our representation clearly explains the observed semi-saturation phenomenon.
Within the class of linear iterative methods, the Nesterov acceleration is an excellent choice, as it is a fast method as well as a quite robust one. Although, it must be conceded, that it cannot compete with the conjugate gradient method in terms of number of iterations. However, this is compensated by its flexibility and simplicity of use, which also allows one to easily integrate it into existing gradient methods and also to apply it in nonlinear cases.