An automated exact solution framework towards solving the logistic regression best subset selection problem

  • Thomas van Niekerk School of Industrial Engineering, North-West University, Potchefstroom, South Africa
  • Jacques V. Venter Centre for Business Mathematics and Informatics, North-West University, Potchefstroom, South Africa
  • Stephanus E. Terblanche School of Industrial Engineering, North-West University, Potchefstroom, South Africa
Keywords: Best subset selection, Independent variable selection, Logistic regression, Mixed integer programming


An automated logistic regression solution framework (ALRSF) is proposed to solve a mixed integer programming (MIP) formulation of the well known logistic regression best subset selection problem. The solution framework firstly determines the optimal number of independent variables that should be included in the model using an automated cardinality parameter selection procedure. The cardinality parameter dictates the size of the subset of variables and can be problem-specific. A novel regression parameter fixing heuristic that utilises a Benders decomposition algorithm is applied to prune the solution search space such that the optimal regression parameter values are found faster. An optimality gap is subsequently calculated to quantify the quality of the final regression model by considering the distance between the best possible log-likelihood value and a log-likelihood value that is calculated using the current parameter values. Attempts are then made to reduce the optimality gap by adjusting regression parameter values. The ALRSF serves as a holistic variable selection framework that enables the user to consider larger datasets when solving the best subset selection logistic regression problem by significantly reducing the memory requirements associated with its mixed integer programming formulation. Furthermore, the automated framework requires minimal user intervention during model training and hyperparameter tuning. Improvements in quality of the final model (when considering both the optimality gap and computing resources required to achieve a result) are observed when the ALRSF is applied to well-known real-world UCI machine learning datasets.

Research Articles