Annotation Pipeline

RoboX processes raw egocentric video through a multi-layer annotation pipeline. A single clip recorded by a contributor generates multiple aligned data layers, each adding training signal without requiring additional collection.

Pipeline Overview

Contributor records clip
On-device face blur + metadata attachment
Encrypted upload (MP4 + JSON)
Server-side QA filtering
Multi-layer annotation
Published to dataset

What Contributors Upload

One MP4 video (up to 15s, 1080p, H.265) and one JSON file containing device metadata. Face blur runs on-device before upload using YOLOv8n-face. Contributors see none of the annotation process.

Input

Curated video paired with annotation layers: object bounding boxes, hand keypoints, action phase segmentation, grip classification, and dense narrations. Annotation is what transforms a smartphone recording into a usable robotics training sample.

Annotation Layers

Each clip receives multiple temporally aligned annotation layers:

Dense Narration Natural language description of the full clip generated by vision-language models. Describes hand motion, object interaction, spatial context, and action sequence.

Action Phase Segmentation Temporal segmentation of each clip into discrete phases: reach, contact, grasp, lift, hold. Each phase includes start/end timestamps and frame indices.

Hand Keypoints 21 keypoints per hand, per frame. Extracted using MediaPipe Hands v2. For an 8.4-second clip at 30fps, a single clip produces 5,292+ data points.

Object Tracking Per-frame bounding boxes for all interacted objects. Includes track ID, confidence score, occlusion flag, and in-hand status. Tracked using YOLO-World + ByteTrack.

Grip Classification Grip type (palmar, pinch, lateral, etc.), hand used, finger count, and confidence score per grasp event.

Scene Context Environment type, surface material, lighting conditions, and clutter level for each clip.

Layer Alignment

All layers share the same temporal axis. For a single 8.4-second clip:

Video: 252 frames at 30fps
Action phases: 5 segments mapped to frame ranges
Hand keypoints: 21 points x 252 frames
Object bboxes: tracked across 252 frames with per-frame in-hand flag
Grip type: one classification per grasp event
Narration: one description per clip

Annotation Models

Layer

Model

Dense narration

Gemini 2.0 Pro Vision

Hand keypoints

MediaPipe Hands v2

Object detection

YOLO-World-L

Object tracking

ByteTrack

Face blur (on-device)

YOLOv8n-face

Per-Frame Data Format

Hand keypoints are stored as one JSON line per frame:

{
  "frame": 84,
  "timestamp_sec": 2.80,
  "hands": [
    {
      "hand": "right",
      "confidence": 0.96,
      "keypoints": [
        {"name": "wrist", "x": 0.52, "y": 0.61, "conf": 0.98},
        {"name": "thumb_tip", "x": 0.48, "y": 0.55, "conf": 0.95},
        {"name": "index_tip", "x": 0.50, "y": 0.49, "conf": 0.97}
      ]
    }
  ]
}

Object tracks follow the same per-frame structure:

{
  "frame": 84,
  "timestamp_sec": 2.80,
  "objects": [
    {
      "label": "mouse",
      "track_id": 1,
      "bbox": {"x_center": 0.45, "y_center": 0.52, "width": 0.08, "height": 0.05},
      "confidence": 0.92,
      "occluded": false,
      "in_hand": true
    }
  ]
}

Per-Clip Annotation Record

Each clip receives a single annotation record containing all layers, stored in annotations/clips.jsonl:

{
  "clip_id": "grasp_000042",
  "narration": "Right hand reaches across a wooden desk toward a black wireless mouse...",
  "object": {"label": "Mouse", "domain": "Office", "material": "plastic"},
  "action_phases": [
    {"phase": "reach", "start_sec": 0.00, "end_sec": 2.30},
    {"phase": "contact", "start_sec": 2.30, "end_sec": 2.80},
    {"phase": "grasp", "start_sec": 2.80, "end_sec": 3.60},
    {"phase": "lift", "start_sec": 3.60, "end_sec": 5.90},
    {"phase": "hold", "start_sec": 5.90, "end_sec": 8.40}
  ],
  "grip": {"type": "palmar", "hand": "right", "finger_count": 5, "confidence": 0.91},
  "scene": {"environment": "office", "surface": "wooden_desk", "lighting": "indoor_artificial"},
  "hand_keypoints_file": "annotations/hand_keypoints/grasp_000042.jsonl",
  "object_tracking_file": "annotations/object_tracks/grasp_000042.jsonl"
}

Layered Annotation on Static Assets

The pipeline is designed to run new annotation passes on existing clips as models improve. A clip recorded today can receive updated keypoint extraction, better object detection, or entirely new annotation types without re-collection. The raw video is the permanent asset. Annotation layers stack on top.

Quality Assurance

Every clip goes through automated QA before annotation:

Hands visible in frame
Target object present
Stable framing (no excessive motion blur)
Adequate lighting
Face blur confirmed applied

Clips that fail QA are flagged and excluded from published datasets. Quality scores are included in clip metadata.

PreviousKaggle Dataset (Coming Soon)NextEarnings & Rewards

Last updated 9 days ago

hashtagPipeline Overview

hashtagWhat Contributors Upload

hashtagInput

hashtagAnnotation Layers

hashtagLayer Alignment

hashtagAnnotation Models

hashtagPer-Frame Data Format

hashtagPer-Clip Annotation Record

hashtagLayered Annotation on Static Assets

hashtagQuality Assurance