Setup
In this section we introduce the data collection rig, used to capture synchronized RGB, depth, and 360° video data.
Our setup combines two main components: a ZED X stereo camera with an integrated IMU, and an Insta360 X4 action camera, mounted together on a compact and portable rig.
Camera Overview
- Insta360 X4 Camera: Captures ultra-high-resolution 360° videos at 60 FPS. It is used to collect immersive views of the scene and to provide ground-truth camera poses from a user-perspective and ground-truth calibration view.
- ZED X Stereo Camera: Provides high-resolution RGB-D frames (1920×1200 at 60 FPS) with per-pixel depth and confidence estimates. It also includes an IMU tightly synced to the stereo video stream.
- NVIDIA Jetson Orin NX: Mounted in a backpack with a custom power setup and external SSD, this embedded device handles high-throughput data recording from the ZED camera.
Recording Process
- Calibration Targets: We place calibration boards around the environment.
- Dual Recording: We start the 360 and stereo recordings in parallel. A laser flash visible in both cameras is used to synchronize the video streams.
- Two-Part Capture:
- First, a full-scene trajectory is recorded using the 360 camera.
- Then, a close-up pass of the calibration boards is captured to aid in optimizing pose estimation and establishing ground-truth positions.
Camera Pose & Coordinate Frames
Understanding the spatial relationship between the different camera views is essential for sensor fusion and pose estimation.
Coordinate Frame Conventions:
All frames follow a standard right-handed coordinate system.
Red = X-axis, Green = Z-axis, Blue = Y-axis
Relative Transformations
Each transformation below is represented as a 4×4 matrix in the form:
\[
\begin{bmatrix}
\mathbf{R} & \mathbf{t} \\
0 & 1
\end{bmatrix}
\]
where
\( \mathbf{R} \in \mathbb{R}^{3 \times 3} \)
is a rotation matrix and
\( \mathbf{t} \in \mathbb{R}^{3 \times 1} \)
is a translation vector. We have the following transformations:
- 360 GT View ↔ 360 User View
These views are two perspectives from the 360 camera:
- GT View faces downward toward the calibration board (ground truth reference).
- User View faces outward, representing the user's visual experience.
- The 4×4 relative transformation from the 360 GT View coordinate system to the 360 User View coordinate system is provided as
relative_pose_gt_to_user.npy
.
- 360 User View ↔ ZED Left Camera
The ZED camera is mounted above the 360 camera, slightly angled forward.
- It allows mapping between the depth/IMU data and the 360 user perspective.
- Essential for unified pose estimation and trajectory reconstruction.
- The 4×4 relative transformation from the 360 User View coordinate system to the ZED Left Camera coordinate system is provided as
relative_pose_user_to_zed.npy
.
- ZED Left Camera ↔ ZED IMU
The ZED X camera includes a built-in IMU that is factory-calibrated and tightly synchronized with the stereo camera.
- The 4×4 transformation between the ZED Left camera and its IMU is provided as
relative_pose_zed_to_imu.npy
.
More information about the ZED IMU can be found on the
official StereoLabs documentation.