# Data Format

### Dataset Structure

Every RoboX dataset is organized by campaign. Each campaign dataset contains clips, annotations, and metadata.

```
robox-egograsp/
├── manifest.json               # Dataset version, schema version, stats
├── metadata.parquet             # Per-clip metadata (environment, duration, quality score)
├── clips/
│   ├── clip_001.mp4            # Egocentric video
│   ├── clip_002.mp4
│   └── ...
├── annotations/
│   ├── clip_001.json           # All annotation layers for this clip
│   ├── clip_002.json
│   └── ...
└── README.md                   # Dataset card with license, citation, changelog
```

### Video Format

| Property         | Value                                       |
| ---------------- | ------------------------------------------- |
| Container        | MP4 (H.264)                                 |
| Resolution       | 1080p–4K (varies by device)                 |
| Frame rate       | 30 fps (standard), 60 fps (where supported) |
| Color space      | sRGB                                        |
| Audio            | Stripped during anonymization               |
| Typical duration | 10–120 seconds per clip                     |

All videos are anonymized before inclusion. Faces, text, license plates, and device identifiers have been removed or blurred on-device.

### Annotation Schema

Each clip's annotation file contains multiple layers. Not every campaign includes all layers — see the layer availability table below.

#### Per-Clip Annotation Structure

```json
{
  "clip_id": "egograsp_clip_001",
  "campaign": "egograsp",
  "schema_version": "2.1",
  "duration_sec": 24.5,
  "fps": 30,
  "total_frames": 735,
  "annotations": {
    "temporal_segmentation": [...],
    "object_bounding": [...],
    "hand_pose": [...],
    "gaze_direction": [...],
    "spatial_layout": {...},
    "interaction_classification": [...]
  },
  "quality": {
    "overall_score": 0.92,
    "lighting_score": 0.95,
    "stability_score": 0.88,
    "completeness_score": 0.93
  }
}
```

#### Annotation Layer Details

**Temporal Segmentation**

Action start/end boundaries with labeled phases. Each segment includes a label from the campaign's action vocabulary.

```json
{
  "segments": [
    {
      "start_frame": 0,
      "end_frame": 120,
      "label": "approach",
      "confidence": 0.94
    },
    {
      "start_frame": 121,
      "end_frame": 340,
      "label": "grasp",
      "confidence": 0.91
    }
  ]
}
```

**Object Bounding Boxes**

Per-frame object detection with bounding box coordinates and category labels.

```json
{
  "frame": 150,
  "objects": [
    {
      "object_id": "obj_01",
      "category": "mug",
      "bbox": [120, 340, 280, 510],
      "confidence": 0.96
    }
  ]
}
```

Bounding box format: `[x_min, y_min, x_max, y_max]` in pixel coordinates.

**Hand Pose**

21-keypoint hand skeleton per frame, following the MediaPipe Hand Landmarks topology.

```json
{
  "frame": 150,
  "hands": [
    {
      "hand_id": "right",
      "keypoints": [
        {"joint": "wrist", "x": 0.45, "y": 0.62, "z": 0.01, "confidence": 0.97},
        {"joint": "thumb_tip", "x": 0.48, "y": 0.55, "z": 0.03, "confidence": 0.93}
      ],
      "grip_state": "power_grasp"
    }
  ]
}
```

Keypoint coordinates are normalized to `[0, 1]` relative to frame dimensions.

**Gaze Direction**

Estimated gaze vector per frame.

```json
{
  "frame": 150,
  "gaze": {
    "x": 0.52,
    "y": 0.41,
    "confidence": 0.85
  }
}
```

Coordinates represent the estimated fixation point as normalized `[0, 1]` within the frame.

**Spatial Layout**

Scene structure and surface annotations per clip (not per-frame).

```json
{
  "environment": "kitchen",
  "surfaces": [
    {"type": "counter", "area_fraction": 0.35},
    {"type": "floor", "area_fraction": 0.25}
  ],
  "room_dimensions_estimate": {"width_m": 3.2, "depth_m": 4.1}
}
```

**Interaction Classification**

Action-object pair labels per temporal segment.

```json
{
  "interactions": [
    {
      "action": "pick_up",
      "object": "mug",
      "grasp_type": "power",
      "outcome": "success",
      "segment_ref": [121, 340]
    }
  ]
}
```

### Layer Availability by Campaign

| Layer                      | EgoGrasp | EgoScene | EgoNav | EgoDaily | EgoSocial |
| -------------------------- | -------- | -------- | ------ | -------- | --------- |
| Temporal segmentation      | Yes      | —        | Yes    | Yes      | TBD       |
| Object bounding            | Yes      | Yes      | —      | Yes      | TBD       |
| Hand pose                  | Yes      | —        | —      | Yes      | TBD       |
| Gaze direction             | Yes      | —        | Yes    | Yes      | TBD       |
| Spatial layout             | —        | Yes      | Yes    | —        | TBD       |
| Interaction classification | Yes      | —        | —      | Yes      | TBD       |

### Metadata Schema (Parquet)

The `metadata.parquet` file contains one row per clip with the following columns:

| Column              | Type          | Description                                             |
| ------------------- | ------------- | ------------------------------------------------------- |
| `clip_id`           | string        | Unique clip identifier                                  |
| `campaign`          | string        | Campaign name                                           |
| `duration_sec`      | float         | Clip duration in seconds                                |
| `fps`               | int           | Frame rate                                              |
| `resolution`        | string        | Video resolution (e.g. "1920x1080")                     |
| `environment`       | string        | Environment category (kitchen, office, warehouse, etc.) |
| `device_model`      | string        | Anonymized device model identifier                      |
| `quality_score`     | float         | Overall quality score (0–1)                             |
| `annotation_layers` | list\[string] | Available annotation layers for this clip               |
| `object_categories` | list\[string] | Object categories detected in the clip                  |
| `schema_version`    | string        | Annotation schema version                               |
| `created_at`        | timestamp     | Clip creation date                                      |

### Versioning

Datasets are versioned by release date using the format `YYYY.MM.patch` (e.g., `2026.03.1`). The `manifest.json` file in each dataset root tracks the current version and schema version.

When annotation pipelines are updated, new layers may be applied retroactively to existing clips. The schema version in each annotation file indicates which pipeline produced it. See the Changelog for release history.
