Annotation Pipeline

RoboX processes raw egocentric video through a multi-layer annotation pipeline. A single clip recorded by a contributor generates multiple aligned data layers, each adding training signal without requiring additional collection.

Pipeline Overview

  1. Contributor records clip

  2. On-device face blur + metadata attachment

  3. Encrypted upload (MP4 + JSON)

  4. Server-side QA filtering

  5. Multi-layer annotation

  6. Published to dataset

What Contributors Upload

One MP4 video (up to 15s, 1080p, H.265) and one JSON file containing device metadata. Face blur runs on-device before upload using YOLOv8n-face. Contributors see none of the annotation process.

Input

Curated video paired with annotation layers: object bounding boxes, hand keypoints, action phase segmentation, grip classification, and dense narrations. Annotation is what transforms a smartphone recording into a usable robotics training sample.

Annotation Layers

Each clip receives multiple temporally aligned annotation layers:

Dense Narration Natural language description of the full clip generated by vision-language models. Describes hand motion, object interaction, spatial context, and action sequence.

Action Phase Segmentation Temporal segmentation of each clip into discrete phases: reach, contact, grasp, lift, hold. Each phase includes start/end timestamps and frame indices.

Hand Keypoints 21 keypoints per hand, per frame. Extracted using MediaPipe Hands v2. For an 8.4-second clip at 30fps, a single clip produces 5,292+ data points.

Object Tracking Per-frame bounding boxes for all interacted objects. Includes track ID, confidence score, occlusion flag, and in-hand status. Tracked using YOLO-World + ByteTrack.

Grip Classification Grip type (palmar, pinch, lateral, etc.), hand used, finger count, and confidence score per grasp event.

Scene Context Environment type, surface material, lighting conditions, and clutter level for each clip.

Layer Alignment

All layers share the same temporal axis. For a single 8.4-second clip:

  • Video: 252 frames at 30fps

  • Action phases: 5 segments mapped to frame ranges

  • Hand keypoints: 21 points x 252 frames

  • Object bboxes: tracked across 252 frames with per-frame in-hand flag

  • Grip type: one classification per grasp event

  • Narration: one description per clip

Annotation Models

Layer
Model

Dense narration

Gemini 2.0 Pro Vision

Hand keypoints

MediaPipe Hands v2

Object detection

YOLO-World-L

Object tracking

ByteTrack

Face blur (on-device)

YOLOv8n-face

Per-Frame Data Format

Hand keypoints are stored as one JSON line per frame:

Object tracks follow the same per-frame structure:

Per-Clip Annotation Record

Each clip receives a single annotation record containing all layers, stored in annotations/clips.jsonl:

Layered Annotation on Static Assets

The pipeline is designed to run new annotation passes on existing clips as models improve. A clip recorded today can receive updated keypoint extraction, better object detection, or entirely new annotation types without re-collection. The raw video is the permanent asset. Annotation layers stack on top.

Quality Assurance

Every clip goes through automated QA before annotation:

  • Hands visible in frame

  • Target object present

  • Stable framing (no excessive motion blur)

  • Adequate lighting

  • Face blur confirmed applied

Clips that fail QA are flagged and excluded from published datasets. Quality scores are included in clip metadata.

Last updated