Skip to main content

Table 1 Algorithm description and references

From: Logistic regression has similar performance to optimised machine learning algorithms in a clinical setting: application to the discrimination between type 1 and type 2 diabetes in young adults

Algorithm

Description

References

Logistic regression

A classic statistical algorithm for binary outcomes that use maximum likelihood estimation. It is fully parametric. There are no model hyperparameters to be set. Coefficients are adjusted to allow for dependence between the characteristics. It is useful for inference, estimation, interpretation and prediction.

[41, 44,45,46]

Random forest

An algorithm that grows a large ensemble of classification trees on bootstrapped samples using a random selection of the predictor variables and performs bagging for class selection; after all the trees have been grown, the predicted class is determined from the average estimated class probability calculated over the ensemble of trees.

[41, 47, 48]

Gradient boosting machine

An ensemble learning technique similar to random forest in the sense they average a large number of decision trees to make prediction. The difference between the two is the application of gradient boosting. In gradient boosting, the decision trees are trained sequentially with the weights of each successive model adjusted based on reducing the errors of the previous model. The predicted class is determined from the average estimated class probability (or majority vote of predicted class) calculated over the ensemble of trees.

[41, 49, 50]

Multivariate adaptive regression spline

MARS and logistic regression share similarities. For the logistic regression model, the logarithm of the odds is fitted with a linear combination of the predictors. For the MARS model, the logarithm of the odds is fitted with splines to cover non-linear and interactions terms. The hinge function (sometimes called rectifier) is used to model the splines.

[51]

Neural network

A method using an adaptive and non-sequential approach to learning that mimics a biological neural network. It is a non-parametric technique where signals travel from the first layer (the input layer), to the last layer (the output layer). Each layer is made of a set of neurons. The output of each neuron is computed by some non-linear function of the sum of its weighted inputs from neurons from the previous layer. The weight increases or decreases the strength of the signal at a connection.

[41, 52,53,54,55]

K-nearest neighbours

A model-free method; it is a type of instance-based learning or lazy learning in which there is no training phase, instead the algorithm memorises the training data. Based on the principle that observations located close together in n-dimensional space will have the same outcome, the classification process involves a search the entire dataset for the k training points closest in Euclidean distance (k-neighbours), the probability predicted class is determined based on the average vote of the actual class among these k-neighbours.

[41, 53, 56, 57]

Support vector machine

It is a quadratic optimisation problem involving minimising penalties and maximising margin width, and the two classes are separated by constructing nonlinear decision boundaries (hyperplanes) using a kernel trick that maximises the margin between them. The produced posterior estimates are a rescaled version of the original classifiers scores through a logistic transformation.

[41, 58, 59]