When are minimizers continuous? An Exploration…


This post deals with the following question. Suppose we have a function f:X\times\mathbb{R}^k \to \mathbb{R} and we want to minimize f(x,t) with respect to x where x\in X and t\in T\subseteq\mathbb{R}^k. Assume that X is any locally compact topological space. Define the corresponding minimizers and minimum values as

h(t) = \inf_{x\in X}f(x,t)\qquad\mbox{and}\qquad \bar{x}(t) = \mbox{argmin}_{x\in X}f(x,t).

We note here that, in general, \bar{x}(t) can be a set for each t. Now the question we ask is the following: If there is a sequence \{t_n\}\subset\mathbb{R}^k converging to t_0 (say), then when can we get continuity of the minimizers, i.e,

\bar{x}(t_n)\to\bar{x}(t_0)\quad\mbox{as}\quad n\to\infty,

and if \bar{x}(t_n) is a set, then what does this convergence even mean? Am I asking an artificial question? Let me explain the implications of this question in mathematical statistics.

Example 1 (High Dimensional Regression): The first example I consider is a famous one in the high-dimensional statistics, in particular, high-dimensional linear regression. Suppose, we have p regressors or predictor variables X_1, X_2, \cdots, X_p and we have a response variable Y. Assuming a linear model as

Y = X\beta + \varepsilon,

where X = (X_1, X_2, \ldots, X_p)^{\top} and \beta = (\beta_1, \beta_2, \ldots, \beta_p), we get that \beta is not identifiable if p > n where n is the sample size and the least squares estimator is also not defined uniquely. So, in high-dimensional regime where p/n\to c\in(1,\infty) usually, we get that \beta is not identifiable. A proposal made in 1996 by Rob Tibshirani is to minimize the objective function

\frac{1}{n}\sum_{i=1}^n (Y_i - X(i)^{\top}\beta)^2 + \lambda\|\beta\|_1,

where X(i) represents the X vector realization for ith subject/unit/case. Here \lambda_n represents a tuning parameter. This method is called LASSO. Note that the minimizer of the objective function is a function of the tuning parameter. So, it should be represented as \hat{\beta}_n(\lambda). In practice, one would choose \lambda by cross-validation or some such technique.  Now, knowing that \hat{\beta}_n(\lambda) is a continuous function of \lambda gives us a guarantee that even if the choice of \lambda is off by a small value from the optimal value, your estimator is not too much different from the optimal one. Of course, if we can establish something more concrete than continuity (say, Lipschitz or differentiability), we are offering more, but the least we can ask for is whether the path \hat{\beta}(\lambda) over \lambda is a continuous one or not. It is well-known that the solution path for LASSO is piecewise linear as \lambda varies over the real line. There are many papers on computing the whole solution path of the LASSO estimator where this result is available.

Another related recently proposed estimator in high dimensional regression is given by the Dantzig Selector by Candes and Tao which is obtained as a solution of the linear programming problem,

\min \|\beta\|_1\quad\mbox{subject to}\quad\|X^{\top}(X\beta - Y)\|_{\infty} \le \lambda,

where X is the design matrix with each column containing the observations related to each predictor and \lambda_n is a tuning parameter. I do not know any results regarding the solution path for the Dantzig selector but the estimator being a solution of a linear programming problem is easy to compute for each \lambda_n. What can be said about continuity of solution path of the Dantzig selector?

Example 2 (Non-Parametric Regression): This example is very similar to the previous one but the set up is slightly different. Consider the set up of non-parametric regression with one predictor where, we have observations (X_i, Y_i) for 1\le i\le n and we are assuming the model,

Y_i = f(X_i) + \varepsilon_i,\mbox{ for }i = 1,2, \ldots, n.

Now we want to estimate f from the available data. Clearly, the least squares estimator is useless because it just interpolates the given data and so leads to an inconsistent estimator. So, a well-known popular approach is used which recommends using the objective function

\frac{1}{n}\sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \int \{f''(x)\}^2dx,

over all twice differentiable functions. Consistency and rate of convergence properties of these estimators can be found in the book Distribution-free theory of non-parametric regression by Györfi, L., Kohler, M., Krzyzak, A., Walk, H. The estimator in this case also is parametrized by a tuning parameter \lambda and so as before the estimator should be written as \hat{f}_n(\lambda). If there are no restriction on the space of all twice-differentiable functions, then the solution can be explicitly given (can be found in Smoothing Splines and Non-parametric Regression by Eubank) and can analytically shown to be continuous in \lambda. Actually, the estimator can be shown to be convex in \lambda and so is continuous on any open set. This also shows that the solution path is almost everywhere differentiable on any open set (which follows from convexity).

But if we restrict the space of all functions that we are minimizing the objective function over,  then the explicit solution is gone and the conclusion about continuity or convexity is also lost. This kind of restriction on the space of twice differentiable functions can be seen in shape-constrained inference, where we would usually ask for monotone or convex functions in the space. There are algorithms for computing such estimators (as in my package simest where convex regression is considered) and study of their consistency and rate of convergence is also well-understood. See for example, the papers by Utreras in 1981 and 1985 and also the references in simest package description. Here also the advantage as before can be obtained in the choice of tuning parameters.

Also, there are many variants of the non-parametric regression considered in the literature where local adaptiveness is of focus. One such particular estimator is obtained by trend filtering proposed by Ryan Tibshirani, where we minimize

\frac{1}{n}\sum_{i=1}^n(Y_i - \mu_i)^2 + \lambda\|D^{(k)}\mu\|_1,

where D^{(k)} is a k-banded matrix depending on the design points X_1, X_2, \ldots, X_n. The final estimator of the function f is given by by linear interpolation of \mu_i‘s.  This estimator was shown to be asymptotically equivalent to the minimizer of the objective function

\frac{1}{n}\sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda TV(f^{(k)}),

where TV represents the total variation norm of the function. Here in both the cases explicit expressions of the estimators are not available but it was shown in the first case of trend filtering that the solution path is piecewise linear. Another alternative leading to locally adaptive estimators of the regression function can be obtained by minimizing the function

\frac{1}{n}\sum_{i=1}^n (Y_i - f(X_i))^2 + \int \lambda(x)\left[f^{(k)}(x)\right]^2dx,

where \lambda(x) is a function chosen for local adaptive property but these estimators are not computationally easy to handle.

Example 3 (Robust Statistics): This example is different from the previous ones in the way that the objective function considered above involve tuning parameters in a linear way. Robust statistics deals with estimation when the data is suspected of contamination which happens often in data analysis. An old fashioned way involves preprocessing the data so as to delete the suspected outliers from the data and then deal with the cleaned data as usual. This includes many pitfalls, such as not getting right inference (if there are tests involved in identifying the outliers; See Post Selection Inference paper in Annals 2013) or there might be masking involved and so we do not know when to stop deleting observations and the exact criterion used for identifying outliers might have some issues.

The way to deal with this problem is to derive procedures which can automatically ignore observations which might be outliers and give an estimator which is still reasonable. One of the early robust estimation procedures was developed by Huber in 1964 in Annals, where an estimator of the location \mu is obtained as the minimizer of

\frac{1}{n}\sum_{i=1}^n \rho_C(X_i - \mu),\quad\mbox{with respect to}\quad\mu,

where

\rho_C(x) = \begin{cases}x^2, &\mbox{if }|x| \le C,\\2C|x| - C^2,&\mbox{if }|x| > C.\end{cases}

This can be seen as a bridge between the least squares estimator and the least absolute deviation estimator for the location. Here the tuning parameter C is involved in a complicated way. Denote the estimator by \hat{\mu}_n(C). This estimation procedure can be easily extended to find a regression estimator by minimizing the function

\frac{1}{n}\sum_{i=1}^n \rho_C(Y_i - X(i)^{\top}\beta),

where X(i) is as in example 1 above. Note that the estimators in both the cases are not scale equivariant in the sense that if we multiply all the observations by a constant the location estimator would not multiply by the same constant. So, in general, one replaces C by Cs_n, where s_n is a robust scale estimate.

There are many more problems where the estimator depends on a tuning parameter and this is chosen by some criterion. My aim is to give some simple condition on the objective function so that continuity of the solution path can be established. In the literature of variational analysis, minimizing a function parametrized as above is treated under the title of parametrized optimization (no surprise there). There are some results known about continuity of minimizers in this part of the literature which I will describe in another post.

The Solution (I think): Getting back to the original problem where we have a function f:X\times\mathbb{R}^k\to\mathbb{R} and we are minimizing f with respect to x\in X for each t\in T, assume that f(x,t) with x\in\mathbb{R}^m and t\in\mathbb{R}^k, is concave in t. Then we get that h is a concave function of t since it is an infimum of a family of concave functions. Here we used the fact that any concave function can be represented as an infimum of linear functions. Since h is a concave function, it is continuous on any open set. Therefore, for any sequence t_n converging to t_0 with t_0 in the interior of T, we have that h(t_n)\to h(t_0) which implies that for any choice \bar{x}_n\in\bar{x}_n(t_n) and \bar{x}_0\in\bar{x}(t_0) we get that

f(\bar{x}_n, t_n) \to f(\bar{x}_0, t_0).

Furthermore, if f is assumed to be continuous as a function of (x,t) on $latex X\times T$, then any cluster point of the sequence \{\bar{x}_n\}_{n=1}^{\infty} belongs to the set \bar{x}(t_0). This is true for if a subsequence \{\bar{x}_{n_k}\} converges to x, then by joint continuity of f, we get that f(\bar{x}_{n_k}, t_{n_k}) converges to f(x, t_0) but from the above display, we get that f(x, t_0) = f(\bar{x}_0, t_0) and since \bar{x}_0\in \bar{x}(t_0), we have x\in\bar{x}(t_0).

Do we need to assume Joint Continuity?

To show that any convergent subsequence converges to a point in the set of minimizers, we assumed the function f be jointly continuous. Do we really need to assume that? The encyclopedia of convex analysis the book by Rockafeller comes in handy here with the following result.

Theorem (Thm 10.7, Rockafeller: Convex Analysis): Let C be a relatively open convex set in \mathbb{R}^n and let T be any locally compact topological space. Let f be a real-values function on C\times T such that f(x,t) is convex in x and for each t\in T and continuous on t for each x\in C. Then f is continuous on C\times T, i.e, f(x,t) is jointly continuous in x and t.

The conclusion remains valid if the assumption about continuity in t is weakened to the following: there exists a subset C' of C such that C\subset \mbox{cl}(C') and f(x,\cdot) is a continuous function on T for each x\in C'.

Applying this theorem to our function f which is concave in t for each x\in X, we get joint continuity by just assuming continuity of f(\cdot, t) for each t\in T and assuming T is a relatively open convex set in \mathbb{R}^k and X is any locally compact topological space. But noting that Hilbert space like sets are not locally compact, it may be reasonable to stick with the assumption of joint continuity of f which is easy to check in the examples above.

If we assume convexity of f in x (which is also true in the examples above), then \bar{x}(t) is a unique vector instead of a set for each t\in T.

What did I prove?

I proved above that if we assume

(A1) f(x,t):X\times T\to\mathbb{R} with X any locally compact topological space and T is an open convex subset of \mathbb{R}^k, is concave in t for each x\in X.

(A2) f(x,t) is continuous in x for each t\in T,

Then, we get that any cluster point of sequences with each point in \bar{x}(t_n) lies in \bar{x}(t_0) with t_n\to t_0. So, we proved that

Statement: Any limit point of minimizers of f(\cdot, t_n) is a minimizer of f(\cdot, t_0).

But we did not prove that there exists any subsequence which converges. This requires a further assumption of sequential compactness of the space X which is equivalent to compactness of X if it is a metric space. This assumption can be avoided if we can prove that there exists a compact set such that the minimizer lies in that compact set for all t in a local neighbourhood which then guarantees a convergent subsequence because t_n\to t_0 as n\to\infty.

In the proof above, we only required the conclusion that h(t) is continuous in t. But if we are willing to assume that X is compact, then it can be proved that we do not require concavity assumption; continuity would suffice. The statement below is taken from Willie Wong’s blog where a proof is given.

Theorem: If X, Y are topological spaces with Y compact. The for any continuous function f:X\times Y\to\mathbb{R}, the function g(x) = \inf_{y\in Y}f(x,y) is well-defined and continuous in x.

Please Share Your Opinion