Skip to main content

Teaching Machines to Locate Objects in Photos

How to Create a holographic display and camcorder

In the last part of the series "How to Create a Holographic Display and Camcorder", I talked about how to use the cameras to calculate the disparity between photos.  To do so, we have locate the objects in two photos, but this should be done by a machine.

In this part of the series, I'll talk about how to show a machine to locate objects in photos.




To calculate the depth information or the disparity of an object, we need to locate where the object is in each photo.

[Insert an illustration of an object and a camera by X-axis translation]


How to locate an object in each photo?
In each photo, we need to find the same object.  Then, we should calculate the disparity between the object in the first photo and the object in the second photo.

So, how do we locate the same object in each photo?

Let's say we want to locate a tip of a cat's left ear in two photos.  Each photo shows the same cat, but at the different location.

Can we teach a machine to recognize a cat in a photo, in general, by using the machine learning, but they do find them with about 70% confidence rate, and it takes time to process each image.

For the machine learning, I'll write a blog in another time.

We need a fast locating algorithm to find us the tip of the cat's ear, almost every time.


Using a gradient value between a pixel and its neighboring pixels? 

Going back to the basics of image pixels, we can say, whenever there is a steep difference between the intensity value of the current pixel and the intensity values of the neighboring pixels.

This is called taking a gradient value between pixels.

This should give us nice locating points in an image?

There are multiple issues with this problem.  In this sample, we are using the photos showing the one-axis translation.

But in reality, the object is a photo A may be at a different location entirely in a photo B.  The object may look bigger, rotated or at a different angle.

So, if we taking the gradient values, they would give the same reliable locations, if the object in a photo B is now rotated.


Using a pixel corner as a locating point in an image

To work with the rotation issue, we can use a L-shaped gradient.   In any rotation, it is still a L.  This is called an interest point.






How to find the same locating points in other photos?

Okay, so we locate multiple locating points in a photo A.  How do we find the same interest points in a photo B?

Let's say we have an interest point X in a photo A.

And we want to find the same interest point X' in a photo B.

How do we know if they are indeed identical?

One of the ways is to take the neighboring pixel values of an interest point X and compare the neighboring pixel values of the interest point X'.

These neighboring pixel values of an interest point X is called a descriptor.  

A locating point is called an interest point.

An interest point and the descriptor is called a feature.


  • Interest Point
    • A locating point is called an interest point.
  • Descriptor
    • The neighboring pixel values is a descriptor, for example.
  • Feature
    • A descriptor and an interest point is called a feature.

They are actually multiple ways to define a descriptor.  For this project of creating a Holographic Display and a Holographic Camcorder, we have other options.



Shift Invariant Feature Transform (SIFT)
There are multiple locating feature algorithms to consider. 

SIFT
FAST
HOG
SURF
GOLH
[Briefly explain each of these]

[Todo:  Provide Machine Learning and Deep Learning algorithms in comparison in the future]


[Insert a intersection set of a set A and a set B is common among sets A and B]


How to extract the depth of an interest point in a photo?

To extract the depth of an interest point in a photo A in comparison with a photo B, we will use a method called Triangulation.

In the next part of the series, I'll talk about Triangulation.

Comments

Popular posts from this blog

How to project a camera plane A to a camera plane B

How to Create a holographic display and camcorder In the last part of the series "How to Create a Holographic Display and Camcorder", I talked about what the interest points, descriptors, and features to find the same object in two photos. In this part of the series, I'll talk about how to extract the depth of the object in two photos by calculating the disparity between the photos. In order to that, we need to construct a triangle mesh between correspondences. To construct a mesh, we will use Delaunnay triagulation.  Delaunnay Triagulation - It minimizes angles of all triangles, while the sigma of triangles is maximized. The reason for the triangulation is to do a piece wise affine transformation for each triangle mapped from a projective plane A to a projective plane B. A projective plane A is of a camera projective view at time t, while a projective plane B is of a camera projective view at time t+1. (or, at t-1.  It really doesn't matter)...

State of the Art SLAM techniques

Best Stereo SLAMs in 2017 are reviewed. Namely, (in arbitrary order) EKF-SLAM based,  Keyframe based,  Joint BA optimization based,  RSLAM,  S-PTAM,  LSD-SLAM,   Best RGB-D SLAMs in 2017 are also reviewed. KinectFusion,  Kintinuouns,  DVO-SLAM,  ElasticFusion,  RGB-D SLAM,   See my keypoints of the best Stereo SLAMs. Stereo SLAM Conditionally Independent Divide and Conquer EKF-SLAM [5]   operate in large environments than other approaches at that time uses both  close and far points far points whose depth cannot be reliably estimated due to little disparity in the stereo camera  uses an inverse depth parametrization [6] shows empirically points can be triangulated reliably, if their depth is less than about 40 times the stereo baseline.     - Keyframe-based  Stereo SLAM   - uses BA optimization in a local area to archive scalability.  ...

How to train a neural network to retrieve 3D maps from videos

This blog is about how to train a neural network to extract depth maps from videos of moving people captured with a monocular camera. Note: With a monocular camera, extracting the depth map of moving people is difficult.  Difficulty is due to the motion blur and the rolling shutter of an image.  However, we can overcome these limitations by predicting the depth maps by the model trained with a generated dataset using SfM and MVS from the normalized videos. This normalized dataset can be the basis of the training set for the neural network to automatically extract the accurate depth maps from a typical video footage, without any further assistance from a MVS. To start this project with a SfM and a MVS, we will use TUM Dataset. So, the basic idea is to use SfM and Multiview Stereo to estimate depth, while serves as supervision during training. The RGB-D SLAM reference implementation from these papers are used: - RGB-D Slam (Robotics OS) - Real-time 3D Visual SLAM ...