In today’s article, we shall deep dive into video object tracking. Starting from the basics, we shall understand the need for object tracking, and then go through the challenges and algorithmic models to understand visual object tracking, finally, we shall cover the most popular deep learning based approaches to object tracking including MDNET, GOTURN, ROLO etc. This article expects that you are aware of object detection.
Object tracking is the process of locating moving objects over time in videos. One can simply ask, why can’t we use object detection in each frame in the whole video and we can track the object. There are a few problems with that. If the image has multiple objects, then we have no way of connecting the objects in the current frame to the previous frames. If the object you were tracking goes out of the camera view for a few frames and another one appears, we have no way of knowing if it’s the same object. Essentially, during detection, we work with one image at a time and we have no idea about the motion and past movement of the object, so we can’t uniquely track objects in a video.
Object tracking has a wide range of applications in computer vision, such as surveillance, human-computer interaction, and medical imaging, traffic flow monitoring, human activity recognition, etc. If FBI wants to track a criminal running away in a car using citywide surveillance cameras.
Or a sports analytics software that needs to analyze a soccer game.
Or you want to install a camera at the entrance of a shopping mall and count how many people came in and went out each hour, you not only want to track people uniquely but also create a path of each person as shown in the gif below.
Tracking works when object detection fails:
Whenever there is a moving object in the videos, there are certain cases when the visual appearance of the object is not clear. In all such cases, detection would fail while tracking succeeds as it also has the motion model and history of the object. Here are examples of the cases where object tracking works and object detection fails:
- Occlusion: The object in question is partially or completely occluded as shown here:
- Identity switches: After two objects cross each other, how do you know which one is which.
- Motion Blur: Object is blurred due to the motion of the object or camera. Hence, visually the object doesn’t look the same anymore.
- ViewPoint Variation: Different viewpoint of an object may look very different visually and without the context, it become very difficult to identify the object using only visual detection.
- Scale Change: Huge changes in object scale may cause a failure in detection.
- Background Clutters: Background near object has similar color or texture as the object. Hence, it may become harder to separate the object from the background.
- Illumination Variation:
Illumination near the target object is significantly changed. Hence, it may become harder to visually identify it.
- Low resolution: When the number of pixels inside the ground truth bounding box is very less, it may be too hard to detect the objects visually.
Object tracking is an old and hard problem of computer vision. There are various techniques and algorithms which try to solve this problem in various different ways. However, most of the techniques rely on two key things:
1. Motion model:
One of the key components of a good tracker is the ability to understand and model the motion of the object. So, a motion model is developed that captures the dynamic behavior of an object. It predicts the potential position of objects in the future frames, hence, reducing the search space. However, the motion model alone can fail in scenarios where motion is caused by things that are not in a video or abrupt direction and speed change. Some of the earlier methods understand the motion pattern of the object and try to predict that. However, the problem with such approaches is that they can’t predict the abrupt motion and direction changes. Examples of such techniques are optical flow, Kalman filtering, Kanade-Lucas-Tomashi (KLT) feature tracker, mean shift tracking.
2. Visual Appearance Model:
Most of the highly accurate trackers need to understand the appearance of the object that they are tracking. Most importantly, they need to learn to discriminate the object from the background. In single object trackers, visual appearance alone could be enough to track the object across frames, while In multiple-object trackers, visual appearance alone is not enough.
Components of a Tracking Algorithm:
Generally, the object tracking process is composed of four modules:
1. Target initialization: In this phase, we need to define the initial state of the target by drawing a bounding box around it. So, the idea is to draw bounding box of the target in the initial frame of the video and tracker has to estimate the target position in the remaining frames in the video.
2. Appearance modeling: Now, we need to learn the visual appearance of the object by using learning techniques. In this phase, we have to model and learn about the visual features of the object while in motion, various view-points, scale, illuminations etc.
3. Motion estimation: Objective of motion estimation is the learn to predict a zone where the target is most likely to be present in the subsequent frames.
4. Target positioning: Motion estimation gives us the possible region where the target could be present and we scan that using the visual model to lock down the exact location of the target.
In general, tracking algorithms don’t try to learn all the variations of the object. Hence, most of the tracking algorithms are much faster than object detection.
Classification of tracking algorithms:
1. Detection Based or Detection Free Trackers:
1.1 DETECTION BASED TRACKING: The consecutive video frames are given to a pretrained object detector that gives detection hypothesis which in turn is used to form tracking trajectories. It is more popular because new objects are detected and disappearing objects are terminated automatically. In these approaches, the tracker is used for the failure cases of object detection. In an another approach, object detector is run for every n frame and the remaining predictions are done using the tracker. This is very suitable approach for tracking for long time.
1.2 DETECTION FREE TRACKING: Detection free tracking requires manual initialization of a fixed number of objects in the first frame. It then localizes these objects in the subsequent frames. It cannot deal with the case where new objects appear in the middle frames.
It is more popular because new objects are detected and disappearing objects are terminated automatically.
2. Single and Multiple Object trackers:
2.1 : SINGLE OBJECT TRACKING: Only a single object is tracked even if the environment has multiple objects in it. The object to be tracked is determined by the initialization in the first frame.
2.2: MULTI OBJECT TRACKING: All the objects present in the environment are tracked over time. If a detection based tracker is used it can even track new objects that emerge in the middle of the video.
3. Online vs Offline trackers:
3.1 OFFLINE TRACKERS: Offline trackers are used when you have to track an object in a recorded stream. For example if you have recorded videos of a soccer game of an opponent team which needs to be analyzed for strategic analysis. In such case, you can not only use the past frames but also future frames to make more accurate tracking predictions.
3.2 ONLINE TRACKERS: Online trackers are used where predictions are available immediately and hence, they can’t use future frames to improve the results.
4. Based on Learning/Training Strategy:
4.1. ONLINE LEARNING TRACKERS: These trackers typically learn about the object to track using the initialization frame and few subsequent frames. So, these trackers are more general because you can just draw a bounding box around any object and track it. For example, if you want to track a person with red shirt on the airport, you can just draw a bounding box around that person in 1 or few frames. The tracker would learn about the object using these frames and would continue to track that person.
4.2. OFFLINE LEARNING TRACKERS: The training of these trackers only happen offline. As opposed to online learning trackers, these trackers don’t learn anything during run time. These trackers learn about the complete concepts offline i.e. we can train a tracker to identify persons. Then these trackers can be used to continuously track all the persons in a video stream.
Popular tracking algorithms
A lot of traditional(non deep learning based) tracking algorithms are integrated in OpenCV’s tracking API. Most of these trackers are not very accurate comparatively. However, some times they can be useful to run in a resource constraint environment like an embedded system. In case, you are forced to use one, I would recommend Kernelized Correlation Filters(KCF) tracker. In practice, however, deep learning based trackers are now miles ahead of traditional trackers in terms of accuracy. So, In this post I am going to talk about three key approaches that are being used for building AI based trackers.
1. Convolutional Neural Network Based Offline Training Trackers:
This is one of the early series of trackers which apply the discriminative power of convolutional neural networks to the task of visual object tracking. GOTURN is one such offline learning tracker based on convolutional neural network which doesn’t learn online at all. First, the tracker is trained using thousands of videos for general object tracking. Now, this tracker can be used to track most of the objects without any problem even if those objects were not part of the training set.
GOTURN can run very fast i.e. 100fps on GPU powered machine. GOTURN has been integrated into OpenCV tracking API(the contrib part). In the video below, the original author has demonstrated the power of GOTURN.
2. Convolutional Neural Networks Based Online Training Trackers:
These are online training trackers which use Convolutional neural networks. One such example is Multi domain network(MDNet) which was the winner of VOT2015 challenge. Since, convolutional neural networks are computationally very expensive to train, these methods have to use a smaller network to train at fast speed during deployment. However, smaller networks don’t have much discriminative power. One option is that we train the whole network but during inference, we use the first few layers as feature extractor i.e. we only change the weights of last couple of layers which are trained online. So, we have the CNNs as feature extractor and the last few layers can quickly be trained online. Essentially, our goal is to train a generic multi-domain CNN which can distinguish between target and background. However, this poses a problem during training, the target in one video could be the background in another which would just confuse our convolutional neural network. So, MDNet does something clever. It rearranges the network into two parts: first part is the shared part and then there is a part which is independent for each domain. Each domain would mean an independent training video.So, the network is first trained over K-domains iteratively where each domain classifies between its target and background. This helps us in extracting the video independent information in order to learn a better generic representation of tracker. After training, the domain specific binary layers are removed and we obtain a feature extractor (shared network above) which can distinguish between any target object and background in a generic way. During Inference(production), the initial shared part is used as a feature extractor and domain specific layers are removed and we add binary classification layer on top of the feature extractor. This binary classification layer is trained online. In each step, the region around the previous target state is searched for the object by random sampling. MDNet is one of the most accurate deep learning based online training, detection free, single object tracker. Have a look at this this video which compares this with other methods.
3. LSTM+ CNN based detection based video object trackers :
Another class of object trackers which are getting very popular because they use Long Short Term Memory(LSTM) networks along with convolutional neural networks for the task of visual object tracking. Recurrent YOLO (ROLO) is one such single object, online, detection based tracking algorithm. It uses YOLO network for object detection and an LSTM network for finding the trajectory of target object. There are two reasons why LSTM with CNN is a deadly combination.
a) LSTM network are particularly good at learning historical patterns so they are particularly suitable for visual object tracking.
b) LSTM networks are not very computationally expensive so it’s possible to build very fast real world trackers.
YOLO INPUT – Raw Input frames
YOLO OUTPUT – Feature vector of input frames and Bounding box coordinates
LSTM INPUT – Concat(Image features, Box coordinates)
LSTM OUTPUT – Bounding box coordinates of object to be tracked
The above diagram gives us the following understanding
- Input frames go through the YOLO network.
- From the YOLO network two different outputs(Image Features and bounding box coordinates) are taken
- These two outputs are given to the LSTM network
- LSTM outputs the trajectories i.e) bounding box of the object to be tracked
The preliminary location inference(from YOLO) helps the LSTM to pay attention to certain visual elements. ROLO explores the spatio-temporal history i.e. along with location history, ROLO also exploits the history of visual features. Even when YOLO detection is flawed due to motion blur, ROLO tracking stays stable. Also, such trackers are less likely to fail when the target object is occluded.
Recently, there are many more LSTM based object trackers which are much better than ROLO using many improvements. However, we chose ROLO here as it’s simple and easy to understand.
Hopefully, this post gives you a good understanding of visual object tracking along with some insights into key successful object tracking approaches.