Data Format

Dataset Structure

Every RoboX dataset is organized by campaign. Each campaign dataset contains clips, annotations, and metadata.

robox-egograsp/
├── manifest.json               # Dataset version, schema version, stats
├── metadata.parquet             # Per-clip metadata (environment, duration, quality score)
├── clips/
│   ├── clip_001.mp4            # Egocentric video
│   ├── clip_002.mp4
│   └── ...
├── annotations/
│   ├── clip_001.json           # All annotation layers for this clip
│   ├── clip_002.json
│   └── ...
└── README.md                   # Dataset card with license, citation, changelog

Video Format

Property

Value

Container

MP4 (H.264)

Resolution

1080p–4K (varies by device)

Frame rate

30 fps (standard), 60 fps (where supported)

Color space

sRGB

Audio

Stripped during anonymization

Typical duration

10–120 seconds per clip

All videos are anonymized before inclusion. Faces, text, license plates, and device identifiers have been removed or blurred on-device.

Annotation Schema

Each clip's annotation file contains multiple layers. Not every campaign includes all layers — see the layer availability table below.

Per-Clip Annotation Structure

{
  "clip_id": "egograsp_clip_001",
  "campaign": "egograsp",
  "schema_version": "2.1",
  "duration_sec": 24.5,
  "fps": 30,
  "total_frames": 735,
  "annotations": {
    "temporal_segmentation": [...],
    "object_bounding": [...],
    "hand_pose": [...],
    "gaze_direction": [...],
    "spatial_layout": {...},
    "interaction_classification": [...]
  },
  "quality": {
    "overall_score": 0.92,
    "lighting_score": 0.95,
    "stability_score": 0.88,
    "completeness_score": 0.93
  }
}

Annotation Layer Details

Temporal Segmentation

Action start/end boundaries with labeled phases. Each segment includes a label from the campaign's action vocabulary.

{
  "segments": [
    {
      "start_frame": 0,
      "end_frame": 120,
      "label": "approach",
      "confidence": 0.94
    },
    {
      "start_frame": 121,
      "end_frame": 340,
      "label": "grasp",
      "confidence": 0.91
    }
  ]
}

Object Bounding Boxes

Per-frame object detection with bounding box coordinates and category labels.

{
  "frame": 150,
  "objects": [
    {
      "object_id": "obj_01",
      "category": "mug",
      "bbox": [120, 340, 280, 510],
      "confidence": 0.96
    }
  ]
}

Bounding box format: [x_min, y_min, x_max, y_max] in pixel coordinates.

Hand Pose

21-keypoint hand skeleton per frame, following the MediaPipe Hand Landmarks topology.

{
  "frame": 150,
  "hands": [
    {
      "hand_id": "right",
      "keypoints": [
        {"joint": "wrist", "x": 0.45, "y": 0.62, "z": 0.01, "confidence": 0.97},
        {"joint": "thumb_tip", "x": 0.48, "y": 0.55, "z": 0.03, "confidence": 0.93}
      ],
      "grip_state": "power_grasp"
    }
  ]
}

Keypoint coordinates are normalized to [0, 1] relative to frame dimensions.

Gaze Direction

Estimated gaze vector per frame.

{
  "frame": 150,
  "gaze": {
    "x": 0.52,
    "y": 0.41,
    "confidence": 0.85
  }
}

Coordinates represent the estimated fixation point as normalized [0, 1] within the frame.

Spatial Layout

Scene structure and surface annotations per clip (not per-frame).

{
  "environment": "kitchen",
  "surfaces": [
    {"type": "counter", "area_fraction": 0.35},
    {"type": "floor", "area_fraction": 0.25}
  ],
  "room_dimensions_estimate": {"width_m": 3.2, "depth_m": 4.1}
}

Interaction Classification

Action-object pair labels per temporal segment.

{
  "interactions": [
    {
      "action": "pick_up",
      "object": "mug",
      "grasp_type": "power",
      "outcome": "success",
      "segment_ref": [121, 340]
    }
  ]
}

Layer Availability by Campaign

Layer

EgoGrasp

EgoScene

EgoNav

EgoDaily

EgoSocial

Temporal segmentation

Yes

—

Yes

TBD

Object bounding

Yes

—

Yes

TBD

Hand pose

Yes

—

Yes

TBD

Gaze direction

Yes

—

Yes

TBD

Spatial layout

—

Yes

—

TBD

Interaction classification

Yes

—

Yes

TBD

Metadata Schema (Parquet)

The metadata.parquet file contains one row per clip with the following columns:

Column

Type

Description

clip_id

string

Unique clip identifier

campaign

string

Campaign name

duration_sec

float

Clip duration in seconds

fps

int

Frame rate

resolution

string

Video resolution (e.g. "1920x1080")

environment

string

Environment category (kitchen, office, warehouse, etc.)

device_model

string

Anonymized device model identifier

quality_score

float

Overall quality score (0–1)

annotation_layers

list[string]

Available annotation layers for this clip

object_categories

list[string]

Object categories detected in the clip

schema_version

string

Annotation schema version

created_at

timestamp

Clip creation date

Versioning

Datasets are versioned by release date using the format YYYY.MM.patch (e.g., 2026.03.1). The manifest.json file in each dataset root tracks the current version and schema version.

When annotation pipelines are updated, new layers may be applied retroactively to existing clips. The schema version in each annotation file indicates which pipeline produced it. See the Changelog for release history.

PreviousEgocentric Data Samples NextHugging Face Dataset (Coming Soon)

Last updated 10 days ago

hashtagDataset Structure

hashtagVideo Format

hashtagAnnotation Schema

hashtagPer-Clip Annotation Structure

hashtagAnnotation Layer Details

hashtagLayer Availability by Campaign

hashtagMetadata Schema (Parquet)

hashtagVersioning

Dataset Structure

Video Format

Annotation Schema

Per-Clip Annotation Structure

Annotation Layer Details

Layer Availability by Campaign

Metadata Schema (Parquet)

Versioning