Algorithms from scratch: RANSAC
RANSAC stands for Random Sample Consensus. In my opinion, it is the best type of algorithm: simple but very powerful and useful. It is especially suited for fitting models when a dataset contains a high number of outliers (e.g. half of the points, or even more). The RANSAC method itself is very general, and it can be used in various use cases: curve fitting, classification, state estimation, number of computer vision tasks, and so on.
There is a lot of good material out there about RANSAC method itself. For example, 5 minutes with Cyrill, or a bit longer video from Cyrill are a great introduction to the method and its applications. This blog post will be more focused on doing the implementation work. My goal is to make implementation general and flexible enough so that it can be used in various kinds of problems. You can find the code from my GitHub repo.
Algorithm
RANSAC is an iterative method to estimate the parameters of a model. Many RANSAC variants have been developed since 1981 when the first RANSAC method was published. At least most of the variants follow the following pseudo-code logic:
Initialize solution
Loop until termination criteria are filled:
- Sample a small number of points from the original dataset
- Fit model with sample dataset
- Based on the fit, divide points in the original dataset as inliers and outliers
- Check the validity of solution; skip to next iteration if not valid (optional)
- Calculate score for solution
- If the score is best so far, update the final inliers
Calculate final model with all the final inliers included (optional)
return final inliers
The algorithm is illustrated with the figure below:
Checking the validity of the solution might serve at least two purposes:
- Filter out solutions we are not interested in (e.g. we want to fit only vertical planes and a solution candidate is a horizontal plane)
- To quickly reject bad candidates (e.g. just based on inliers found), as score calculation might be an expensive operation
In vanilla RANSAC
- The sample is taken randomly
- The division between inliers and outliers is done by model errors for which we have fixed threshold value
- The score is simply just the number of inliers
Implementation
Ok, now we got the algorithm figured out, so let’s do some coding.
Here is my thinking. Basically, we have some data and some sort of model. We want to draw samples from data, either randomly or with certain indices. Model is something that needs to be fitted, with the data. Then we need something that will calculate errors for us so that we can divide data into inliers and outliers.
As data and models come in various forms, I decided to have abstract interfaces for those that need to be implemented by the user. Data has methods to access data with indices. The model has fit method to solve the model parameters and calculate_errors method to divide points to inliers and outliers after fit.
When the model and data are in place, defined by the user, we can create general solver that takes these as an input, together with some parameters, and returns inlier indices. This way, we can decouple the core RANSAC algorithm from use-case specific details.
As there are plenty of RANSAC variants, and you could also develop new one by yourself, I decided to have Solver base class that captures the basic logic of RANSAC methods, but uses abstract methods for implementation details. These abstract methods are then implemented by different variants of the algorithm.
Example use case: curve fitting
As an simple example, let’s use RANSAC together with powerful scipy curve fitting method. The scipy curve fitting function itself is flexiple multi-purpose curve fitting method but it doesn’t handle outliers well; let’s use RANSAC for that. First, let’s define needed Data and Model classes:
After that it is easy to run RANSAC solver with any desider fit function, for example like this:
Here is a comparison with regular curve fit and curve fit with RANSAC, using simulated data with some outliers:
Full curve fitting example can be found from GitHub repo. The repo also contains various other different kind of samples with outliers: plane fit by minimizing orthogonal distances to points, classification with high number of mislabeled data and matching set of points with Iterative closest point (ICP) algorithm.
Summay
In this post I showed how to implement well-known RANSAC algorithm, with non-linear curve fitting as an usage example. In my implementation, I tried to isolate the core RANSAC algorithm, so that it can be used easily in various use cases (with different kind of datasets and model fitting methods). Hopefully you find it useful. As always, any comments are welcome.