This post deals with the following question. Suppose we have a function and we want to minimize
with respect to
where
and
. Assume that
is any locally compact topological space. Define the corresponding minimizers and minimum values as
We note here that, in general, can be a set for each
. Now the question we ask is the following: If there is a sequence
converging to
(say), then when can we get continuity of the minimizers, i.e,
and if is a set, then what does this convergence even mean? Am I asking an artificial question? Let me explain the implications of this question in mathematical statistics.
Example 1 (High Dimensional Regression): The first example I consider is a famous one in the high-dimensional statistics, in particular, high-dimensional linear regression. Suppose, we have regressors or predictor variables
and we have a response variable
. Assuming a linear model as
where and
, we get that
is not identifiable if
where
is the sample size and the least squares estimator is also not defined uniquely. So, in high-dimensional regime where
usually, we get that
is not identifiable. A proposal made in 1996 by Rob Tibshirani is to minimize the objective function
where represents the
vector realization for
th subject/unit/case. Here
represents a tuning parameter. This method is called LASSO. Note that the minimizer of the objective function is a function of the tuning parameter. So, it should be represented as
. In practice, one would choose
by cross-validation or some such technique. Now, knowing that
is a continuous function of
gives us a guarantee that even if the choice of
is off by a small value from the optimal value, your estimator is not too much different from the optimal one. Of course, if we can establish something more concrete than continuity (say, Lipschitz or differentiability), we are offering more, but the least we can ask for is whether the path
over
is a continuous one or not. It is well-known that the solution path for LASSO is piecewise linear as
varies over the real line. There are many papers on computing the whole solution path of the LASSO estimator where this result is available.
Another related recently proposed estimator in high dimensional regression is given by the Dantzig Selector by Candes and Tao which is obtained as a solution of the linear programming problem,
where is the design matrix with each column containing the observations related to each predictor and
is a tuning parameter. I do not know any results regarding the solution path for the Dantzig selector but the estimator being a solution of a linear programming problem is easy to compute for each
. What can be said about continuity of solution path of the Dantzig selector?
Example 2 (Non-Parametric Regression): This example is very similar to the previous one but the set up is slightly different. Consider the set up of non-parametric regression with one predictor where, we have observations for
and we are assuming the model,
Now we want to estimate from the available data. Clearly, the least squares estimator is useless because it just interpolates the given data and so leads to an inconsistent estimator. So, a well-known popular approach is used which recommends using the objective function
over all twice differentiable functions. Consistency and rate of convergence properties of these estimators can be found in the book Distribution-free theory of non-parametric regression by Györfi, L., Kohler, M., Krzyzak, A., Walk, H. The estimator in this case also is parametrized by a tuning parameter and so as before the estimator should be written as
. If there are no restriction on the space of all twice-differentiable functions, then the solution can be explicitly given (can be found in Smoothing Splines and Non-parametric Regression by Eubank) and can analytically shown to be continuous in
. Actually, the estimator can be shown to be convex in
and so is continuous on any open set. This also shows that the solution path is almost everywhere differentiable on any open set (which follows from convexity).
But if we restrict the space of all functions that we are minimizing the objective function over, then the explicit solution is gone and the conclusion about continuity or convexity is also lost. This kind of restriction on the space of twice differentiable functions can be seen in shape-constrained inference, where we would usually ask for monotone or convex functions in the space. There are algorithms for computing such estimators (as in my package simest where convex regression is considered) and study of their consistency and rate of convergence is also well-understood. See for example, the papers by Utreras in 1981 and 1985 and also the references in simest package description. Here also the advantage as before can be obtained in the choice of tuning parameters.
Also, there are many variants of the non-parametric regression considered in the literature where local adaptiveness is of focus. One such particular estimator is obtained by trend filtering proposed by Ryan Tibshirani, where we minimize
where is a
banded matrix depending on the design points
. The final estimator of the function
is given by by linear interpolation of
‘s. This estimator was shown to be asymptotically equivalent to the minimizer of the objective function
where represents the total variation norm of the function. Here in both the cases explicit expressions of the estimators are not available but it was shown in the first case of trend filtering that the solution path is piecewise linear. Another alternative leading to locally adaptive estimators of the regression function can be obtained by minimizing the function
where is a function chosen for local adaptive property but these estimators are not computationally easy to handle.
Example 3 (Robust Statistics): This example is different from the previous ones in the way that the objective function considered above involve tuning parameters in a linear way. Robust statistics deals with estimation when the data is suspected of contamination which happens often in data analysis. An old fashioned way involves preprocessing the data so as to delete the suspected outliers from the data and then deal with the cleaned data as usual. This includes many pitfalls, such as not getting right inference (if there are tests involved in identifying the outliers; See Post Selection Inference paper in Annals 2013) or there might be masking involved and so we do not know when to stop deleting observations and the exact criterion used for identifying outliers might have some issues.
The way to deal with this problem is to derive procedures which can automatically ignore observations which might be outliers and give an estimator which is still reasonable. One of the early robust estimation procedures was developed by Huber in 1964 in Annals, where an estimator of the location is obtained as the minimizer of
where
This can be seen as a bridge between the least squares estimator and the least absolute deviation estimator for the location. Here the tuning parameter is involved in a complicated way. Denote the estimator by
. This estimation procedure can be easily extended to find a regression estimator by minimizing the function
where is as in example 1 above. Note that the estimators in both the cases are not scale equivariant in the sense that if we multiply all the observations by a constant the location estimator would not multiply by the same constant. So, in general, one replaces
by
, where
is a robust scale estimate.
There are many more problems where the estimator depends on a tuning parameter and this is chosen by some criterion. My aim is to give some simple condition on the objective function so that continuity of the solution path can be established. In the literature of variational analysis, minimizing a function parametrized as above is treated under the title of parametrized optimization (no surprise there). There are some results known about continuity of minimizers in this part of the literature which I will describe in another post.
The Solution (I think): Getting back to the original problem where we have a function and we are minimizing
with respect to
for each
, assume that
with
and
, is concave in
. Then we get that
is a concave function of
since it is an infimum of a family of concave functions. Here we used the fact that any concave function can be represented as an infimum of linear functions. Since
is a concave function, it is continuous on any open set. Therefore, for any sequence
converging to
with
in the interior of
, we have that
which implies that for any choice
and
we get that
Furthermore, if is assumed to be continuous as a function of
on $latex X\times T$, then any cluster point of the sequence
belongs to the set
. This is true for if a subsequence
converges to
, then by joint continuity of
, we get that
converges to
but from the above display, we get that
and since
, we have
.
Do we need to assume Joint Continuity?
To show that any convergent subsequence converges to a point in the set of minimizers, we assumed the function be jointly continuous. Do we really need to assume that? The encyclopedia of convex analysis the book by Rockafeller comes in handy here with the following result.
Theorem (Thm 10.7, Rockafeller: Convex Analysis): Let be a relatively open convex set in
and let
be any locally compact topological space. Let
be a real-values function on
such that
is convex in
and for each
and continuous on
for each
. Then
is continuous on
, i.e,
is jointly continuous in
and
.
The conclusion remains valid if the assumption about continuity in is weakened to the following: there exists a subset
of
such that
and
is a continuous function on
for each
.
Applying this theorem to our function which is concave in
for each
, we get joint continuity by just assuming continuity of
for each
and assuming
is a relatively open convex set in
and
is any locally compact topological space. But noting that Hilbert space like sets are not locally compact, it may be reasonable to stick with the assumption of joint continuity of
which is easy to check in the examples above.
If we assume convexity of in
(which is also true in the examples above), then
is a unique vector instead of a set for each
.
What did I prove?
I proved above that if we assume
(A1) with
any locally compact topological space and
is an open convex subset of
, is concave in
for each
.
(A2) is continuous in
for each
,
Then, we get that any cluster point of sequences with each point in lies in
with
. So, we proved that
Statement: Any limit point of minimizers of is a minimizer of
.
But we did not prove that there exists any subsequence which converges. This requires a further assumption of sequential compactness of the space which is equivalent to compactness of
if it is a metric space. This assumption can be avoided if we can prove that there exists a compact set such that the minimizer lies in that compact set for all
in a local neighbourhood which then guarantees a convergent subsequence because
as
.
In the proof above, we only required the conclusion that is continuous in
. But if we are willing to assume that
is compact, then it can be proved that we do not require concavity assumption; continuity would suffice. The statement below is taken from Willie Wong’s blog where a proof is given.
Theorem: If are topological spaces with
compact. The for any continuous function
, the function
is well-defined and continuous in
.