Shooting and Digitizing Pictures
The first step in this project is to take multiple photographs with overlapping views. I took the pictures using my iPhone by rotating only along the camera axis.
Photographs taken at Cedar Street in North Berkeley
Photographs taken in my room
Images adopted from Google Street View
Recovering Homographies
To align overlapping images, we aim to find a homography matrix \( \mathbf{H} \) that allows us to relate the coordinates of one image to another. The homography matrix is a \( 3 \times 3 \) matrix that can be expressed as: \[ \mathbf{H} = \begin{bmatrix} h_1 & h_2 & h_3 \\ h_4 & h_5 & h_6 \\ h_7 & h_8 & 1 \end{bmatrix} \] It transforms a point \((x, y)\) in one image to a corresponding point \((x', y')\) in another image, using the following relationship in homogeneous coordinates: \[ \begin{bmatrix} x' \\ y' \\ w \end{bmatrix} = \mathbf{H} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} \] After applying this transformation, the coordinates are given by: \[ x' = \frac{h_1 \cdot x + h_2 \cdot y + h_3}{h_7 \cdot x + h_8 \cdot y + 1}, \quad y' = \frac{h_4 \cdot x + h_5 \cdot y + h_6}{h_7 \cdot x + h_8 \cdot y + 1} \] To find the entries of \( \mathbf{H} \), we use corresponding points \((x_i, y_i)\) and \((x_i', y_i')\) from two overlapping images. Each pair of points provides two linear equations, which are then combined to form a system of equations: \[ \begin{aligned} x_i' &= h_1 \cdot x_i + h_2 \cdot y_i + h_3 - h_7 \cdot x_i \cdot x_i' - h_8 \cdot y_i \cdot x_i', \\ y_i' &= h_4 \cdot x_i + h_5 \cdot y_i + h_6 - h_7 \cdot x_i \cdot y_i' - h_8 \cdot y_i \cdot y_i' \end{aligned} \] We then set up a system of linear equations in the form \( \mathbf{A} \mathbf{h} = \mathbf{b} \), where \( \mathbf{h} \) is a vector of unknown entries \( h_1, h_2, \ldots, h_8 \). This system can be solved using a least-squares method.
Warping the Images
Using the computed homography matrix \( \mathbf{H} \), we warp each image to ensure they are aligned. The warping is performed using inverse mapping, which calculates the source coordinates for each pixel after the transformation. This approach helps avoid aliasing issues and ensures accurate alignment. The detailed approach involves computing the transformed coordinates of the image's corners to derive the output image's bounding box. We then construct a grid of points within this bounding box and apply the inverse homography matrix \( \mathbf{H}^{-1} \) to find the corresponding source coordinates. The pixel values at these source coordinates are interpolated to generate the final warped images.
Examples of rectified images
Blending Images into a Mosaic
The final step for creating the mosaic is to blend the images. We combine the aligned images using weighted averaging and alpha masks for the overlapped areas to create smooth transitions and minimize visible seams. The detailed approach is inspired by Alec Li's project work from last year, which includes using distance transforms to determine the relative contribution of each image in overlapping regions. We also decompose the images into low-pass and high-pass components using a Gaussian filter. The low-pass components are combined using a weighted average, while the high-pass components are blended based on the distance transforms. This approach yields decent results using original images, as shown below.
Cedar Street
My Room
A mysterious Google Street View location
Harris Corner Detection and ANMS
We first detect Harris corners using code adapted from the course implementation with an threshold_rel of 0.2 to ony keep the top corners. The Harris detector identifies corners as local peaks, which means it uses local gradient information to detect parts with significant intensity changes in multiple directions. Following the MOPS paper, we then apply Adaptive Non-Maximal Suppression (ANMS) to select a better distributed set of corner points, where points are suppressed based on their corner strength relative to neighbors within a radius r. Note that all subsequent steps (feature descriptor extraction, matching, and outlier rejection) are implemented following the approaches described in this paper.
Harris corners after ANMS showing good points distribution
Feature Descriptor Extraction
For each point we get from ANMS, we extract a feature descriptor by sampling an 8×8 patch of normalized intensity values. As described in the paper, this sampling is performed from a larger 40×40 window to achieve better invariance. The descriptor \(d\) for each point is computed as: \[ d = \frac{P - \mu}{\sigma} \] where P is the sampled patch, μ is the mean intensity, and σ is the standard deviation, and this normalization helps with invariance.
Example of feature patches
Feature Matching
Features are matched using Lowe's technique as described in Section 5 of the paper. For each feature in the first image, we find its two nearest neighbors in the second image based on Euclidean distance in descriptor space. A match is accepted only if: \[ \frac{d_{1-NN}}{d_{2-NN}} < 0.8 \] where \(d_{1-NN}\) is the distance to the closest match and \(d_{2-NN}\) is the distance to the second closest match, and the threshold was chosen empirically.
Feature matches between image pairs
RANSAC
After obtaining feature matches, we use RANSAS to robustly estimate the homography while removing outlier matches. The algorithm iteratively selects random sets of four feature pairs to compute candidate homographies. For each candidate, it identifies inlier matches where the geometric transfer error is below a threshold. The homography with the largest set of inliers is selected, and a final refined homography is computed using all inlier matches. This makes feature matching more robust by eliminate incorrect matches, which provides a more accurate transformation for auto alignment.
Automated vs Manual Stitching Results
We compare our autostitching implementation (which includes all the steps above) against manual point selection. Both methods use the same warping and blending techniques from Part A, differing only in how corresponding points are obtained. The automated approach outperforms the manual stitching which is more evident in the second example.
Cedar Street: Manual (left) vs Automated (right)
Google Street View: Manual (left) vs Automated (right)
New images
Cedar Street at Night: Original images (top) vs Autostitch (bottom)