Hey there! Looking to get started with supervised machine learning? As you dive into this exciting field, you‘re likely to encounter talk of various "algorithms" used to train models.
This beginner‘s guide will explain 9 of the most popular supervised learning algorithms in plain English. I‘ll give you the intuition behind how each one works so you can feel more confident when applying them.
What is Supervised Learning?
Firstly, what does "supervised" learning actually mean?
In supervised learning, we provide the algorithm with labeled training data. This means we give it many examples of inputs and the desired outputs. The algorithm‘s job is to learn the relationship between inputs and outputs, so it can predict the output when given new unseen inputs.
Supervised learning problems fall into two main categories:
- Classification: Predict a categorical label. E.g. Predict whether an email is "spam" or "not spam". The label is discrete.
- Regression: Predict a continuous numerical value. E.g. Predict the price of a house based on size, location, etc. The label is continuous.
Below I‘ll explain 9 top supervised learning algorithms, whether they handle classification or regression, and their common use cases.
Algorithm | Handles Classification | Handles Regression | Use Cases |
---|---|---|---|
Naive Bayes | Yes | Document classification, spam filtering | |
Decision Trees | Yes | Yes | Predict loan defaults, classify images |
Random Forests | Yes | Yes | Finance, medicine, image recognition |
Neural Networks | Yes | Yes | Image/speech recognition, NLP |
Support Vector Machines | Yes | Yes | Bioinformatics, text/image recognition |
Logistic Regression | Yes | Predict customer churn | |
K-Nearest Neighbors | Yes | Yes | Recommendation systems, image recognition |
Gradient Boosting | Yes | Yes | Fraud detection, sales forecasting |
Linear Discriminant Analysis | Yes | Facial recognition |
Now let‘s examine how each algorithm works and where it shines…
1. Naive Bayes Classifiers
Naive Bayes utilises Bayes‘ theorem to predict the probability that an input belongs to a certain class. For example, it could predict that an email had a 95% probability of being spam based on its contents.
The "naive" assumption it makes is that all inputs are independent of each other. For example, it assumes that the presence of certain words in an email is unrelated to the presence of other words.
In reality this is often a flawed assumption. But surprisingly, Naive Bayes still tends to perform very well on real-world problems like document classification and spam filtering.
Where Naive Bayes Shines
- Performs surprisingly well despite its simplicity
- Fast to train (single pass through the data)
- Handles high dimensional data well
Naive Bayes is a good starting point before trying more advanced methods like neural networks.
Use Cases: Spam filtering, document classification (e.g. news vs sports articles)
2. Decision Trees
Decision trees model a sequence of "if X then Y" rules that lead to a target value. Just like branching trees in nature, decision trees split the data repeatedly along different axes.
Each node in the tree represents a test of some parameter (e.g. is a fruit red?). Branches indicate outcomes (yes or no), which lead either to another node or a final leaf node classification.
Trees can capture nonlinear relationships and don‘t require any preprocessing. They‘re also easy for humans to understand compared to "black box" methods like neural networks.
However they tend to overfit complex datasets. Smoothing techniques like pruning are required to improve ability to generalize.
Random forests overcome some limitations of single decision trees.
Where Decision Trees Shine
- Capturing non-linear relationships
- Interpretability
- No data preprocessing required
Use Cases: Predict whether bank loan applicants will default; classify images as containing cats or dogs.
3. Random Forests
Random forests are ensembles of decision trees, combined using a majority vote for classification or averaging for regression.
They introduce randomness when training each constituent tree:
- Each tree is trained on a random subset of features
- And a random subset of data points
The resulting model has low variance (avoids overfitting), and combines predictions from many deep trees to produce accurate predictions.
However the randomness makes the complete model harder to directly interpret. Techniques like feature importance provide some insight into the patterns learned.
Where Random Forests Shine
- Powerful predictive accuracy combining multiple deep decision trees
- Avoids overfitting problems of single trees
- Efficient for large datasets
Use Cases: Finance predictions, medical diagnosis, image classification, robotics
4. Neural Networks
Neural networks take inspiration from neurons in the brain. They contain interconnected layers which transmit signals and enable learning.
The layers transform input data into intermediate representations, which successively convert it into the target output format. Networks can adapt these transformations by updating connection strengths during training.
Their layered processing and flexible learning capabilities enable neural nets to tackle complex problems like image recognition and natural language tasks. However, their adaptability also makes them prone to overfitting without careful regularisation.
Various neural architectures now exist, able to handle different data types like sequences (RNNs), images (CNNs) etc. Researchers actively develop new state-of-the-art models, but core principles remain similar.
Where Neural Networks Shine
- Modelling complex nonlinear relationships
- Pattern recognition in images, text and sound
- Continued innovation in state-of-the-art architectures
Use Cases: Image recognition, natural language processing, speech recognition
5. Support Vector Machines (SVMs)
Support vector machines identify optimal lines or hyperplanes to separate classes in n-dimensional space (with n being the number of features).
Let‘s explain with a simple 2D example. Say you have data points belonging to two different classes:
Figure: SVM classifier separating data points with straight line (Source: Wikimedia)
The SVM algorithm finds a boundary line maximizing the margins between the classes. This helps correctly classify new points based on which side of the line they fall on.
Kernel functions help map data to higher dimensional spaces where a separating line becomes feasible. Overall SVMs enable effective classification with straightforward implementations.
However, performance slows as training dataset size grows. SVMs also struggle to classify overlapping distributions of classes. Thermal dynamics and quantum computing research explores potential solutions for these weaknesses.
Where SVMs Shine
- Good complex dataset classification performance
- Memory efficient compared to neural networks
- Versatile kernel functions
Use Cases: Bioinformatics (gene classification), text and image recognition
6. Logistic Regression for Classification
Despite its regression-sounding name, logistic regression is actually used for classification problems. For example, predicting whether an email is spam (1) or not spam (0).
It provides probability scores describing whether cases belong to particular outcomes. The scores lie between 0-1, indicating lowest to highest probability of class membership.
A key component is the sigmoid logistic function which converts scores to probabilities:
Figure: How logistic regression converts scores to probabilities using the logistic function (Source: Wikimedia)
The ease of interpreting outputs as win probabilities makes logistic regression popular. It trains quickly and works well even without feature scaling.
However, performance deteriorates for very high dimensional data or sparse datasets with few samples. Regularization helps control overfitting.
Where Logistic Regression Shines
- Fast training and prediction
- Intuitive probability outputs
- Feature scaling not required
Use Case: Predict whether customers will churn based on usage data
7. K-Nearest Neighbors (KNN)
The KNN algorithm predicts classes of new data points based on similarity with points in the training data. A data point is classified by assigning the label which is most frequent among its nearest neighbors.
For example, if estimating house prices, KNN will find the 5 most comparable houses with known prices based on features like size and location. The algorithm will average the prices of the 5 nearest neighbors to estimate the price for the new house.
The similarity metric depends on the problem. Common metrics include:
- Euclidean distance
- Cosine similarity
Performance depends heavily on choosing an appropriate value for K and distance metric. KNN struggles with very large training data or high numbers of features. Regularization can help control overfitting.
Where KNN Shines
- Effective when training data is large
- Simple implementation
Use Cases: Recommender systems, image recognition
8. Gradient Boosting Machines
Like random forests, gradient boosting produces ensembles of decision trees combined to yield robust predictive performance. Trees are added sequentially, each one correcting its predecessor.
The existing collection of trees identifies the residual errors of the whole ensemble. Then a new tree is created targeting the errors. Repeated expansion of trees focusing on residuals minimizes bias and variance.
The resulting model usually excels on tabular datasets. However, implementation complexity and computational expense grows substantially with deeper trees.
Where Gradient Boosting Shines
- Powerful predictive accuracy combining weak decision tree models
- Minimizes bias and variance
- Works well with tabular data
Use Cases: Fraud detection, sales forecasting, ad click-through-rate prediction
9. Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) finds linear combinations of features which separate classes. The combinations form a discriminant function that splits classes most effectively.
For example, LDA could generate a function using the expression:
y = 2x1 + 3x2 - 5x3
Here, x1
, x2
and x3
represent numerical input features about each data point. The equation finds weightings that when multiplied with input values and summed, output the highest y
values for Class 1, and lowest y
values for Class 2.
This discriminative approach focuses on separation rather than density modelling. The simplicity of linear classification surfaces avoids overfitting but limits flexibility.
Regularization can enhance resilience to noise when dealing with more complex problems. Neural networks extend concept to non-linear functions between layers.
Where Linear Discriminant Analysis Shines
- Computationally lightweight
- Avoids overfitting
- Extendable to kernelized and neural network versions
Use Case: Facial recognition technology
Putting It All Together
You made it! We‘ve covered a lot of ground on the 9 most popular supervised learning algorithms.
Now you know the core strengths and weaknesses of each method.
Every algorithm can serve a purpose depending on the problem. Start simple and move to more complex approaches only as needed. For example, try logistic regression before neural networks.
Many guides focus on mathematical details. But I hope explaining the intuition in plain English gives you confidence to go apply supervised learning!
Let me know in the comments if you have any other questions. Happy learning!