Lab 1.2 - Tracking | CMPUT 428/615 Labs

Registration tracking

Overview

In this lab you’ll implement the Lucas-Kanade tracking algorithm. The goal of this algorithm is to minimize the sum of squared error between two regions, the template \(T(\mathbf{x})\), and the warped image \(I(\mathbf{W}(\mathbf{x} ; \mathbf{p}))\):

\begin{equation} \sum_{\mathbf{x}}[I(\mathbf{W}(\mathbf{x} ; \mathbf{p}))-T(\mathbf{x})]^2 \end{equation}

\(\mathbf{p}\) is the vector of parameters that defines our warp. The most basic of which would be a simple 2-dof translation in Equation 2. Higher order warps such as an affine warp, given by Equation 3, or homography warps, given by Equation 4, can also be applied.

\begin{equation} \mathrm{W}(\mathbf{x}, \mathbf{p})=\binom{\mathrm{x}+\mathrm{p}_1}{\mathrm{y}+\mathrm{p}_2}
\end{equation} \begin{equation} \mathrm{~W}(\mathbf{x}, \mathbf{p})=\binom{\mathrm{p}_1 \mathrm{x}+\mathrm{p}_3 \mathrm{y}+\mathrm{p}_5}{\mathrm{p}_2 \mathrm{x}+\mathrm{p}_4 \mathrm{y}+\mathrm{p}_6}
\end{equation} \begin{equation} \mathrm{~W}(\mathbf{x}, \mathbf{p})=\frac{1}{1+p_7 x+p_8 y}\binom{\mathrm{p}_1 \mathrm{x}+\mathrm{p}_3 \mathrm{y}+\mathrm{p}_5}{\mathrm{p}_2 \mathrm{x}+\mathrm{p}_4 \mathrm{y}+\mathrm{p}_6} \end{equation}

The user first defines the tracking region of interest, which becomes the template. We aim to find the warped region in subsequent frames that best match our template. When a suitable match is found, we plot our tracking region and then repeat on the next frame.

This zip file contains the sample videos you can use to complete this lab. However, you are encouraged to capture your own.

Important: You can obtain 4 bonus marks for each section you finish and demo while present during the lab period we introduce the lab.

1. 2-DoF SSD Tracker (60)

Create a function simple_tracker(roi, im0, im1, max_iterations, threshold) that takes in the roi and a pair of images and returns the coordinates of the updated tracking region.

a) Apply your tracker to a pair of images.

Outside of your function, display the first image and define a bounding box roi to track using cv2.selectROI().
- We obtain the template from roi and im0. This can be updated for each new frame or left to be the original template from the first frame.
For subsequent frames, use your function to solve for \(\mathbf{u}=[u,v]^T\) on the roi of the next frame like in optical flow. Your roi is the patch.
- Use \(\mathbf{u}\) to update the tracked region using a 2 DoF transformation given by Equation 2.
  - Repeatedly solve for \(\mathbf{u}\) and update the tracked region until norm(u_k)/norm(u_{k-1}) < threshold or k > max_iterations, where k is the number of iterations (you can define your own end condition, this is just what we suggest).
- Your function should return the new roi. Update and display the this bounding box overlaid onto im1 (see cv2.rectangle for plotting).
- Note: We can choose to update the template every frame or we can keep the template we obtain from the first frame for all subsequent comparisons.

Deliverables:

Display the pair of images in your report.

Report Question 1a: Why do we have to iteratively warp our tracked region instead of just solve for \(\mathbf{u}\) once and update our bounding box?

b) Create a live implementation of your tracker.

Compare the results of using the same template on the first image for all subsequent frames vs updating the template for every new roi we find.

Report Question 1b: What is one benefit and drawback of updating the template every frame?

c) Apply your tracker to a video sequence that you recorded.

Deliverables:

Save your video with the overlaid tracking region for your submission. Ensure the video is less than 10 mb.

Report Question 1c: In what cases does the tracker perform well? In what cases does the tracker perform poorly? Name one type of image processing that may improve the performance of your tracker.

Important

Undergraduate students: choose one of the following three questions to complete for this lab to obtain the remaining 40 marks.
Grad students: choose two of the following three questions to obtain the remaining 60 marks.

2. Pyramidal 2-DoF SSD Tracker

Create a function pyramidal_tracker(roi, im0, im1, levels=4, scale=2) that performs 2-DoF SSD tracking with pyramidal gaussian downsampling. You may call simple_tracker() within this function if you wish.

The gaussian pyramid has levels layers, with each layer having a scale factor resolution of the layer above.
See cv2.pyrDown() for gaussian downsampling.
We obtain the template from roi and im0. This can be updated for each new frame or left to be the original template from the first frame.
For each frame start with the top of the pyramid (coarsest) and solve for the new region. Propogate the result to the more detailed layer below and repeat for all layers.
- Ensure the coordinates of the updated region match the scale factor of the layer below (i.e. multiply the translation by scale)
Update the tracked region with the bounding box of the bottom layer (the layer with no downsampling) and repeat for all frames.

Apply your pyramidal tracker to two videos that you record, one with camera motion and one with object motion.

Deliverables:

Save the videos with overlaid tracking regions to include with your submission. Ensure they are less than 10 mb each.

Report Question 2a: Name one advantage and one disadvantage of implementing pyramidal downsampling to our tracker.

Report Question 2b: How do the number of pyramid levels and the chosen downsampling factor affect the trade-off between computational speed and accuracy?

Report Question 2c: Why do we use a coarse-to-fine strategy in pyramidal tracking, and are there any cases where a fine-to-coarse approach might be beneficial?

3. High-DoF Tracker

Create a function highdof_tracker(img0, img1, roi, max_iterations, threshold) that warps the bounding box using 4 (x, y, rotation, scale) or warp the bounding box using 6 (affine) parameters \(\mathbf{p}\) for 10 additional bonus marks.

See OpenCV documentation on transformations.
See Baker and Matthews Paper on Lukas-Kanade.
See page 27 of these slides.
See Our warp tutorial

Apply your tracker to two videos that you record, one with camera motion and one with object motion.

Deliverables:

Save the videos with overlaid tracking regions to include with your submission. Ensure they are less than 10 mb each.

Report Question 3: Name one advantage and one disadvantage of using higher order warps for our tracker.

4. Learning-Based Tracking

In traditional intensity-based tracking, we estimate the parameters of a transformation that warps a template (I_0) to match a region in the current image (I_t). Typically, this is approached by minimizing the intensity error between the template and the warped patch.

However, instead of analytically computing the Jacobian (as you did in previous exercises), we now want to learn the mapping from image intensities to transformation parameters from a large set of synthetically generated training samples.

You will implement one of the following methods (or both, for double credit):

Hyperplane Approximation
Reference: F. Jurie and M. Dhome, “Hyperplane approximation for template matching,” 2002.
Nearest Neighbor Approximation (Note: This is more difficult, you will receive 10 bonus marks for completing this) Reference: D. Travis, C. Perez, A. Shademan, and M. Jagersand, “Realtime Registration-Based Tracking via Approximate Nearest Neighbour Search,” 2013.

A quick overview of these learning-based approaches can be found in (lec05bRegTrack2.pdf slides 47-52).

a. Sampling the Region (`sample_region`)

Write a function sample_region that takes as input the four corner coordinates of a region in an image and returns a rectangular patch corresponding to that region.
You can implement this by estimating a DLT (Direct Linear Transform) that warps the four-corner region into a fixed rectangle, then use that transformation to sample the image.
Bonus (1): Successfully handle the arbitrary four-corner coordinates. (If time is short, you may use a simpler rectangular crop, but it is strongly recommended to attempt the DLT approach.)

Report Question 4a: Describe the parameter settings you used for sampling (e.g., the range of translations, rotations, etc.).

b. Synthetic Perturbations (`synthesis`)

Write a function synthesis to generate training samples.
Given an original region and a corresponding rectangle, apply small random transformations (e.g., translation, rotation, scaling, or even homography).
Use a Gaussian distribution over the transformation parameters, as described in Travis et al.
Start with smaller DOFs (e.g., translation only) and move on to more complex transformations (4 DOFs: rotation + scale, 6 DOFs: affine, 8 DOFs: full homography).

c. Learning the Tracker (`learn`)

Implement the core learning procedure for either the Hyperplane method (Jurie & Dhome) or the Nearest Neighbor method (Travis et al.).
This involves taking your large set (e.g., 1000–2000 samples) of synthetic transformations and the corresponding intensity differences, then learning how to predict the parameter updates from the intensity errors.

d. Incremental Updating (`update`)

Implement the incremental update procedure, where at each tracking step you refine your estimate of the transformation parameters.
Refer to the chosen paper (Jurie & Dhome, or Travis et al.) to see how the learned model is updated (or how the nearest neighbor lookup is performed in each new frame).

Report Question 4b: If you completed both methods (Hyperplane & Nearest Neighbor), compare their speeds and accuracies.

Practical Tips

Sample Generation: Generating 1000–2000 random samples often suffices. Start with a moderate region size (e.g., from (50, 50) to (100, 100) in your image).
Performance: If you use Python + OpenCV, raw nearest neighbor lookups can be slow. Consider using libraries like pyflann or other approximate nearest neighbor libraries for faster queries.
Debugging: Test on a “static image motion” experiment (e.g., synthetically warped single image) before moving to real video data.

Note: This exercise counts as two if you fully implement and demonstrate both the Hyperplane Approximation and the Nearest Neighbor Approximation. Make sure to manage your time and start with simpler transformations before attempting the full homography approach.

Tracker Competition 🏆:

Use the tracker you developed in this lab to track a selected region of interest (ROI) in a sample video. The target object is a cereal box, and your goal is to accurately track the specified ROI throughout the video.

Rules

Originality: You must use your own implementation. Importing prebuilt tracking libraries (e.g., OpenCV trackers) is not allowed.
Scope: Extensions to the tracker must align with the principles and methods discussed in this lab. If you are unsure about your method
Submission: Submit your code, and a short right up on how you refined your tracker and what you learned in doing so. In your report label this section Tracker Competition

Evaluation

Accuracy: Tracking performance will be quantitatively evaluated.
- The tracker will be scored based on how closely the tracked ROI aligns with the ground truth.
- The tracking error is computed using the following formula: ```python current_error = math.sqrt(np.sum(np.square(actual_corners - tracker_corners)) / 4)
Bonus Marks:
- First Place: 10% bonus marks.
- Last Place: 1% bonus marks.
- Bonus marks will be distributed evenly among all participants based on performance ranking.
Trophy: The winner will also receive a 3D-printed trophy!

Competition Submission

Provide a .txt file with your best tracking results in the following format:
```
frame	ulx	uly	urx	ury	lrx	lry	llx	lly
frame00001.jpg	32.00	313.00	202.00	308.00	316.00	517.00	54.00	540.00
frame00002.jpg	32.02	312.99	202.02	307.99	316.02	516.99	54.04	539.99
frame00003.jpg	32.03	313.00	202.03	307.99	316.03	517.00	54.03	539.99
```
- Each row should contain:
  - frame: Frame name (e.g., frame00001.jpg).
  - ulx, uly: Upper-left corner coordinates.
  - urx, ury: Upper-right corner coordinates.
  - lrx, lry: Lower-right corner coordinates.
  - llx, lly: Lower-left corner coordinates.
- Name the file tracking_results_CCID.txt.

Submission Details

Include accompanying code used to complete each question. Ensure they are adequately commented.
Ensure all functions are and sections are clearly labeled in your report to match the tasks and deliverables outlined in the lab.
Organize files as follows:
- code/ folder containing all scripts used in the assignment.
- media/ folder for images, videos, and results.
Final submission format: a single zip file named CompVisW25_lab1.2_lastname_firstname.zip containing the above structure.
Your combined report for Lab 1.1 and 1.2 is due shortly after (see calendar for details). The report contains all media, results, and answers as specified in the instructions above. Ensure your answers are concise and directly address the questions.
Total marks for this lab is 100 for undergraduate students and 120 for graduate students. Your lab assignment grade with bonus marks is capped at 120%. Report bonus marks will be applied to the report grade, capped at 110%.

Good luck, and happy tracking! 🚀

Overview

1. 2-DoF SSD Tracker (60)

a) Apply your tracker to a pair of images.

b) Create a live implementation of your tracker.

c) Apply your tracker to a video sequence that you recorded.

Important

2. Pyramidal 2-DoF SSD Tracker

Apply your pyramidal tracker to two videos that you record, one with camera motion and one with object motion.

3. High-DoF Tracker

Apply your tracker to two videos that you record, one with camera motion and one with object motion.

4. Learning-Based Tracking

a. Sampling the Region (sample_region)

b. Synthetic Perturbations (synthesis)

c. Learning the Tracker (learn)

d. Incremental Updating (update)

Practical Tips

Tracker Competition 🏆:

Rules

Evaluation

Competition Submission

Submission Details

a. Sampling the Region (`sample_region`)

b. Synthetic Perturbations (`synthesis`)

c. Learning the Tracker (`learn`)

d. Incremental Updating (`update`)