An important pillar is developing practical optimization algorithms for nonconvex optimization problems. I believe that there is an urgent need for practical optimization algorithms in particular in the context of nonconvex data-driven optimization algorithms for which off-the-shelf software is not available. I have worked on (i) developing exact combinatorial optimization algorithms for learning from high-dimensional data such as genomics data and (ii) deriving faster first-order methods for specialized nonconvex optimization problems such as those arising in optimal power flow problems.

Exact Combinatorial Optimization for High-Dimensional Learning

Modern data often counts many more features than measured observations. Such high-dimensional data sets are particularly prevalent in genomics and other areas of computational biology. Cancer data sets as an important example are comprised of the expression levels of many thousands of genes but typically concern only a few hundred individuals. Learning with high-dimensional data is particularly challenging as most classical learning methods tend to fit the noise instead of the signal.

Simple and highly regularized learning methods are required here. Sparse linear models with a small number of nonzero coefficients are a popular choice when confronted with high-dimensional data. It is indeed often the case that all but very few features in high-dimensional data sets are truly relevant. As an example, most cancer types can be predicted based on a very limited number of genetic markers among all those measured. Equally important, sparse linear models are far more interpretable than are most other models. Fast heuristic sparse learning methods such as Lasso consequently rank among the most cited algorithms in the learning community.

Unfortunately, heuristic sparse methods tend to include many more irrelevant features than their exact counterparts do. I developed scalable learning methods to find the best sparse regressor, i.e.,

\begin{equation} \min_{\Vert w \Vert_0 \leq k} ~~ \frac12 \Vert Y -X w \Vert^2 + \frac{1}{2\gamma} \Vert w \Vert^2 \end{equation}

or similarly the best classifier exactly using integer optimization. Moore’s law taken together with a comparable improvement in modern integer optimization solvers indeed represents a true revolution in this context. By exploiting an exact convex dual formulation exact sparse models can indeed be found using modern integer optimization for data sets counting up to a 100,000 observations and features which is two orders of magnitude larger than previously possible.

Related publications

Bertsimas, D., J. Pauphilet, and B.P.G. Van Parys (2021). “Sparse classification: a Scalable discrete optimization perspective”. In: Machine Learning 110.11, pp. 3177–3209. Link
Bertsimas, D. and B.P.G. Van Parys (2020a). “Sparse hierarchical regression with polynomials”. In: Machine Learning 109.5, pp. 973–997. Link
Bertsimas, D., J. Pauphilet, and B.P.G. Van Parys (2020). “Sparse regression: Scalable algorithms and empirical performance”. In: Statistical Science 35.4, pp. 555–578. Link
Bertsimas, D. and B.P.G. Van Parys (2020b). “Sparse high-dimensional regression: Exact scalable algorithms and phase transitions”. In: Annals of Statistics 48.1, pp. 300–323. Link

Optimal First-Order Algorithms

Optimization formulations associated with learning problems based on huge data sets remain however out of reach for exact nonconvex optimization methods. Practical necessity dictates at such huge scale the use of first-order methods which only use simple gradient information to determine a local optimizer. Most convex optimization problems are well studied and have associated optimal first-order methods. In stark contrast, even for simple classes of nonconvex functions such as all smooth nonconvex functions no optimal first-order method is known. I made significant progress in how to design faster first-order methods to solve nonconvex optimization problems to local optimality. More significantly, such faster first-order methods are not determined with pen and paper but rather found in a computer-assisted fashion by solving an auxiliary optimization formulation \begin{equation} \min_{a\in \mathcal A} \max_{f\in \mathcal F} ~P_N(f, a) \end{equation} where the decision variable $a \in \mathcal A$ is best though of as characterizing a particular first-order algorithm whereas the decision variable $f$ represents a particular functions from a class of interest $\mathcal F$, e.g., all smooth nonconvex functions. The performance of an algorithm $a$ on a function $f$ is denoted here as $P_{N}(f, a)$ and may for instance represent the norm of the gradient of the best iterate observed among the first $N$ iterates. I believe that automated discovery of better optimization methods for specialized classes of optimization problems using computer assisted tools is a promising research direction as it allows for fast tailored methods to be determined algorithmically rather than analytically.

Related publications

Das Gupta, S., B.P.G. Van Parys, and E.K. Ryu (2023). “Branch-and-bound performance estimation programming: A unified methodology for construction optimal optimization methods”. In: Mathematical Programming. Appeared also in Quanta Magazine. Link
Das Gupta, S., B. Stellato, and B.P.G. Van Parys (2022). “Exterior-point optimization for nonconvex learning”. In: SIAM Journal of Optimization. Submitted. Link

>> Home

Optimization Algorithms

Exact Combinatorial Optimization for High-Dimensional Learning

Optimal First-Order Algorithms