# Annotation Pipeline

RoboX processes raw egocentric video through a multi-layer annotation pipeline. A single clip recorded by a contributor generates multiple aligned data layers, each adding training signal without requiring additional collection.

### **Pipeline Overview**

1. Contributor records clip
2. On-device face blur + metadata attachment
3. Encrypted upload (MP4 + JSON)
4. Server-side QA filtering
5. Multi-layer annotation
6. Published to dataset

### **What Contributors Upload**

One MP4 video (up to 15s, 1080p, H.265) and one JSON file containing device metadata. Face blur runs on-device before upload using YOLOv8n-face. Contributors see none of the annotation process.

### Input

Curated video paired with annotation layers: object bounding boxes, hand keypoints, action phase segmentation, grip classification, and dense narrations. Annotation is what transforms a smartphone recording into a usable robotics training sample.

### **Annotation Layers**

Each clip receives multiple temporally aligned annotation layers:

**Dense Narration** \
Natural language description of the full clip generated by vision-language models. Describes hand motion, object interaction, spatial context, and action sequence.

**Action Phase Segmentation** \
Temporal segmentation of each clip into discrete phases: reach, contact, grasp, lift, hold. Each phase includes start/end timestamps and frame indices.

**Hand Keypoints** \
21 keypoints per hand, per frame. Extracted using MediaPipe Hands v2. For an 8.4-second clip at 30fps, a single clip produces 5,292+ data points.

**Object Tracking** \
Per-frame bounding boxes for all interacted objects. Includes track ID, confidence score, occlusion flag, and in-hand status. Tracked using YOLO-World + ByteTrack.

**Grip Classification** \
Grip type (palmar, pinch, lateral, etc.), hand used, finger count, and confidence score per grasp event.

**Scene Context** \
Environment type, surface material, lighting conditions, and clutter level for each clip.

### **Layer Alignment**

All layers share the same temporal axis. For a single 8.4-second clip:

* Video: 252 frames at 30fps
* Action phases: 5 segments mapped to frame ranges
* Hand keypoints: 21 points x 252 frames
* Object bboxes: tracked across 252 frames with per-frame in-hand flag
* Grip type: one classification per grasp event
* Narration: one description per clip

### **Annotation Models**

| Layer                 | Model                 |
| --------------------- | --------------------- |
| Dense narration       | Gemini 2.0 Pro Vision |
| Hand keypoints        | MediaPipe Hands v2    |
| Object detection      | YOLO-World-L          |
| Object tracking       | ByteTrack             |
| Face blur (on-device) | YOLOv8n-face          |

### **Per-Frame Data Format**

Hand keypoints are stored as one JSON line per frame:

```
{
  "frame": 84,
  "timestamp_sec": 2.80,
  "hands": [
    {
      "hand": "right",
      "confidence": 0.96,
      "keypoints": [
        {"name": "wrist", "x": 0.52, "y": 0.61, "conf": 0.98},
        {"name": "thumb_tip", "x": 0.48, "y": 0.55, "conf": 0.95},
        {"name": "index_tip", "x": 0.50, "y": 0.49, "conf": 0.97}
      ]
    }
  ]
}
```

Object tracks follow the same per-frame structure:

```
{
  "frame": 84,
  "timestamp_sec": 2.80,
  "objects": [
    {
      "label": "mouse",
      "track_id": 1,
      "bbox": {"x_center": 0.45, "y_center": 0.52, "width": 0.08, "height": 0.05},
      "confidence": 0.92,
      "occluded": false,
      "in_hand": true
    }
  ]
}
```

### **Per-Clip Annotation Record**

Each clip receives a single annotation record containing all layers, stored in annotations/clips.jsonl:

```
{
  "clip_id": "grasp_000042",
  "narration": "Right hand reaches across a wooden desk toward a black wireless mouse...",
  "object": {"label": "Mouse", "domain": "Office", "material": "plastic"},
  "action_phases": [
    {"phase": "reach", "start_sec": 0.00, "end_sec": 2.30},
    {"phase": "contact", "start_sec": 2.30, "end_sec": 2.80},
    {"phase": "grasp", "start_sec": 2.80, "end_sec": 3.60},
    {"phase": "lift", "start_sec": 3.60, "end_sec": 5.90},
    {"phase": "hold", "start_sec": 5.90, "end_sec": 8.40}
  ],
  "grip": {"type": "palmar", "hand": "right", "finger_count": 5, "confidence": 0.91},
  "scene": {"environment": "office", "surface": "wooden_desk", "lighting": "indoor_artificial"},
  "hand_keypoints_file": "annotations/hand_keypoints/grasp_000042.jsonl",
  "object_tracking_file": "annotations/object_tracks/grasp_000042.jsonl"
}
```

### Layered Annotation on Static Assets

The pipeline is designed to run new annotation passes on existing clips as models improve. A clip recorded today can receive updated keypoint extraction, better object detection, or entirely new annotation types without re-collection. The raw video is the permanent asset. Annotation layers stack on top.

### **Quality Assurance**

Every clip goes through automated QA before annotation:

* Hands visible in frame
* Target object present
* Stable framing (no excessive motion blur)
* Adequate lighting
* Face blur confirmed applied

Clips that fail QA are flagged and excluded from published datasets. Quality scores are included in clip metadata.