Annotation Pipeline
RoboX processes raw egocentric video through a multi-layer annotation pipeline. A single clip recorded by a contributor generates multiple aligned data layers, each adding training signal without requiring additional collection.
Pipeline Overview
Contributor records clip
On-device face blur + metadata attachment
Encrypted upload (MP4 + JSON)
Server-side QA filtering
Multi-layer annotation
Published to dataset
What Contributors Upload
One MP4 video (up to 15s, 1080p, H.265) and one JSON file containing device metadata. Face blur runs on-device before upload using YOLOv8n-face. Contributors see none of the annotation process.
Input
Curated video paired with annotation layers: object bounding boxes, hand keypoints, action phase segmentation, grip classification, and dense narrations. Annotation is what transforms a smartphone recording into a usable robotics training sample.
Annotation Layers
Each clip receives multiple temporally aligned annotation layers:
Dense Narration Natural language description of the full clip generated by vision-language models. Describes hand motion, object interaction, spatial context, and action sequence.
Action Phase Segmentation Temporal segmentation of each clip into discrete phases: reach, contact, grasp, lift, hold. Each phase includes start/end timestamps and frame indices.
Hand Keypoints 21 keypoints per hand, per frame. Extracted using MediaPipe Hands v2. For an 8.4-second clip at 30fps, a single clip produces 5,292+ data points.
Object Tracking Per-frame bounding boxes for all interacted objects. Includes track ID, confidence score, occlusion flag, and in-hand status. Tracked using YOLO-World + ByteTrack.
Grip Classification Grip type (palmar, pinch, lateral, etc.), hand used, finger count, and confidence score per grasp event.
Scene Context Environment type, surface material, lighting conditions, and clutter level for each clip.
Layer Alignment
All layers share the same temporal axis. For a single 8.4-second clip:
Video: 252 frames at 30fps
Action phases: 5 segments mapped to frame ranges
Hand keypoints: 21 points x 252 frames
Object bboxes: tracked across 252 frames with per-frame in-hand flag
Grip type: one classification per grasp event
Narration: one description per clip
Annotation Models
Dense narration
Gemini 2.0 Pro Vision
Hand keypoints
MediaPipe Hands v2
Object detection
YOLO-World-L
Object tracking
ByteTrack
Face blur (on-device)
YOLOv8n-face
Per-Frame Data Format
Hand keypoints are stored as one JSON line per frame:
Object tracks follow the same per-frame structure:
Per-Clip Annotation Record
Each clip receives a single annotation record containing all layers, stored in annotations/clips.jsonl:
Layered Annotation on Static Assets
The pipeline is designed to run new annotation passes on existing clips as models improve. A clip recorded today can receive updated keypoint extraction, better object detection, or entirely new annotation types without re-collection. The raw video is the permanent asset. Annotation layers stack on top.
Quality Assurance
Every clip goes through automated QA before annotation:
Hands visible in frame
Target object present
Stable framing (no excessive motion blur)
Adequate lighting
Face blur confirmed applied
Clips that fail QA are flagged and excluded from published datasets. Quality scores are included in clip metadata.
Last updated