Stereo Vision is traditionally done with the help of two cameras which take images simultaneously. Me and Shivang Agarwal thought of working on Single View Stereo or Monocular Stereo Vision for Swarm Robotics as the starting point of something called as “Distributed Vision”. More about distributed vision in a future post.

## The Problem

Traditionally stereo vision (generating disparity maps) is achieved by
using two synchronized cameras taking the pictures at the same time and
in which we know how far the cameras are (the translational matrix **T**) and
how are the cameras orientated with respect to each other (The rotational
matrix **R**) by doing the camera calibration once at the start. Now the
problem is that we don’t want to use two cameras but we want to achieve
the exact same thing using only one camera and the **R** and **T** matrix change
with every set of images and plus for camera calibration we traditionally
use chess board in which we know the real world distances, while right
now we don’t know that. What we intend to do is to take a frame, then
move the camera and take another frame. So now we have two images and
technically we can apply stereo vision. The only problem now is that we
don’t have the **R** and **T** matrix in this case. So we can’t compute the
fundamental matrix **F**. We now need to figure out how to find **F, R and T**.

## Mathematical Theory

Let’s the vector **x** represent any point in the left image and the vector
**x’** represent any point in the right image. The matrix relation between
these is:

\begin{equation} x^T F x’ = 0 \end{equation}

Let this be equation *(1)*.

Also let the matrix **F** be:

So as of now, the problem boils down to finding a, b, c, d, e, f, g, h, i.

### No of distinct points required

As the equation *(1)* suggests, if we have the coordinate of many points
in left image and their corresponding points in the right image, we can
easily figure out the value of the matrix **F**. Let’s just start off by
taking a single point in the left image
and the corresponding point in the right image

On plugging these values in equation *(1)*, we get

After simplifying, we get

where we have 9 unkowns but only one equation. Interestingly if we try for 9 points, we can have 9 such equtions which can be written as

This is equation *(2)*.

which is of the form . For this system of equations, we should avoid the trivial solution of every variable being zero thus we will only require 8 points.

Note: We will use SVD to solve this system of equations. While the minimum number required is 8, a higher number of points will have high accuracy.

### Getting the value of Fundamental matrix F

Thus we can effectively get the solution of this by using just 8 points. On solving, we can easily obtain the values of a, b, c, d, e, f, g, h and i. Please note, it is very essential to make sure that the fundamental matrix is of rank 2 ie. .

### Getting the Essential Matrix E

Fundamental matrix **F** represents the 2D relationship between the point on
the left image and the right image while Essential Matrix **E** tries to
actually map the relationships in 3D space. In traditional stereo vision,
fundamental matrix is obtained from essential matrix using,

This is equation *(3)*.

where **K** is the camera calibration matrix. Using the equation *(3)* we
can easily reverse engineer to get the value of **E**.

This is equation *(4)*.

It is essential to note here that Essential Matrix also rank 1 ie. .

### Calculate Projection

Let and be the projection vectors from both the images.
Now assuming $P_1$ to be our reference point, we can get the in
terms of the Rotational Matrix **R** and **T**.

This is equation *(5)*.

This is equation *(6)*.

### Relation between E, R and T

Now that we have got E, we can decompose that into **R** and **T**. The relation
between these is:

This is equation *(7)*.

Note: Here we take a cross product between the two matrix.

### Reconstructing T

This can be visualized with the help of Epipolar Geometry. The ray
between the two camera locations plus the rays pointing to the point from
both the camera locations form a plane. As you sweep a plane across these,
all the epilines will converge into a single point a.k.a **epipole** which
will be somewhere in between the line joining the two camera locations.
With both the epipoles we can now know the vector **T**. Because the
epipole is the projection of the camera location on the image plane of
the other camera image on the line joining the two locations, we can say
that,

This is equation *(8)*.

Now we can also say using the priciples in traditional stereo vision,

This is equation *(9)*.

which basically means that left nullspace of the essential matrix is the
epipole in right image. Thus we can decompose **E** by using SVD as,

This is equation *(10)*.

where

and thus we can say that or which are two values.

### Reconstructing R

Once the value of **T** is successfully extracted from **E** matrix, we can
show using some linear algebra that

This is equation *(11)*.

or

This is equation *(12)*.

where **U** and **V** are matrix found from SVD of **E** of matrix.

### Removing Extraneous Values

Since we are getting two values of **R** and **T**, we have total of 4
permutations of these while only one is valid. Now, it is actually
interesting to note what all of these values represent. The correct value
represents the point actually in front of both the images. The other
values represent the point at back in either of the images or both of
them. So to get this we can get the 3D location of the point (this is
called as **Point Triangulation** that we are seeing. If the z-coordinate is
positive then we have found our correct solution because all other values
should be negative since we have assumed left image to be our reference
point.

### Point Triangulation

We can get the point triangulation by,

This is equation *(13)*.

This implies that the cross product of two vectors must be zero. Thus,

This is equation *(14)*.

And,

This is equation *(15)*.

Thus we will get the x, y, and z coordinate of all the points. Now we can discard the solutions by checking their z coordinate as described in the above section.

Note: To get more accurate values, we can take multiple points and use least squares method to get better results.

## Actual Implementation

### Getting corresponding feature points

As we said above that we can find the fundamental matrix if we get the corresponding points from both the images, we first need to find all of those points. Technically we need only 8 points to uniquely determine the fundamental matrix but since we are dealing with real world coordinates from camera (which aren’t perfect) so we will take as many points as possible. We will use linear least square algorithms which can be solved using SVD, with RANSAC to get more accurate results which will work better with more number of points.

To get those points, we can use either **SIFT (Scale-invariant Feature
Transform** or **ORB (Oriented FAST and rotated BRIEF)** or **DAISY** algorithm to
get the feature points. In this SIFT algorithm is generally more accurate
but it takes more computational resources and since we are dealing with
a real time problem, it will be better to use **ORB** which is fast and
reliable and comes with opencv unlike Daisy.

### Why not cv::stereoCalibrate()

This function could have been used which does all of our work to
calculate the **R**, **T**, **F**, **K** and **E** matrix for us but unfortunately we
can’t use this as we need the corresponding points in the actual 3D world
that we are going to pass which we don’t know. In traditional stereo
vision using two cameras, this is not a problem since we use a chessboard
in which we know the actual world distances between the two points and we
just do it once so as to get all the values. For our problem, we need to
calibrate the cameras again and again for every frames that we take in
and we won’t even know the actual 3D coordinate for the same.

which takes in two input arrays each of which correspond to their
respective points in the left and the right image and gives us
Fundamental Matrix **F**. This method uses RANSAC to compute the fundamental
matrix and gives it to us after cleaning it up so this is finally of
rank 2 as we require.

### Decomposing F into E, T, and

This is the most challenging part of this problem. We now have **F** but we
don’t have **E**, **R**, **T**, and . Unfortunately there is
no inbuilt method in opencv to actually implement this for us so we will
have to type this out all by ourselves. Since we know the mathematics
behind it, we can easily use the OpenCV library for linear algebra and
compute this. Again in this since we have many points and not just one
we will employ SVD algorithm to help us in solving the various matrix
equations which in turn gives us an approximate solution. While doing
this we will have to be careful about which format we choosing our data
to be because that format has to be compatible with the rest of the
OpenCV functions as well because we intend to use the other methods in
OpenCV to simply our job afterwards.

**Note:** This gives us 4 values of **R** and **T**. We need to remove these by
triangulation of points.

### Removing extra points by triangulation

As we mentioned above, we are getting 4 values of **R** and **T** while we only
require one. Now we choose a point and try to triangulate it with respect
to both the image planes so that we can know whether they lie in front of
them or at the back. To do this there exists a function named

which reconstructs the point by triangulation. We have mentioned the
maths behind it above. It takes Projection Matrix and
and arrays of points corresponding to each image in two separate arrays
and outputs the 3D coordinates in a 4D array in which the last one is the
homogeneous coordinate. Now we can easily see which points are in the
front and thus we can take that particular value of **R** and **T** while we
discard the rest.

### Image Rectification parameters

Now we have all the values that we require but still the images have a definite epipole. This means that the images aren’t exactly “straight” and thus we will need to straighten this out so that we get epipoles at infinity. The benefit of doing this is that now we will need to search for the corresponding point in the x-coordinate only instead of seaching in the whole area. Luckily, we have an opencv function which can do this for us, namely

which gives us the new , , and for
each of the image. The better part about this function is that this can
even get the value of **Q** which gives us the relation between the disparity
map and the depth map which we will need further to show the image in 3D.

### Rectified Image

Now we have got the rectification parameters. After this we will have to make the corresponding changes in the image according to those parameters. Luckily, we have an opencv function namely

which maps the initial point coordinates to their final destination coordinates.

### Generating disparity map with Blob Matching

Now we can finally the opencv function,

which operates only on rectified images to generate the disparity map for us.

### Getting the depth map

Since we now have the disparity map and the **Q** matrix we can easily
compute the depth map. Luckily there exists an opencv function for the
same namely,

You can find the source code for all of this on github.

comments powered by Disqus