HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Full-width image

News

May 2025: Eye-Gaze Priming data has now been released! Annotations link
April 2025: VQA Challenge Benchmark is online now! Challenge link.
April 2025: Masks and object association annotations have now been released.
Feb 2025: HD-EPIC accepted at CVPR 2025!

Full-width image Diversity in HD-EPIC, which is filmed over 3 days in-the-wild, resulting in many objects, activities and recipes.

Recipe steps and ingredients

Full-width image For the “Carbonara” recipe, we visualise the prep and step time segments for three consecutive steps (left), along with sample frames with corresponding action narrations (top). The interleaving of different preps/steps is evident in the annotations

Unlike typical short recipe videos, which are trimmed, edited, or sped up, HD-EPIC captures a broader range of real-world cooking activities, including:

Fetching ingredients
Prepping ingredients
Weighing & adding ingredients (for full nutritional tracking)

To fully capture recipes, we introduce:

Prep & Step Pairs: Detailed segmentation of preparation and execution.
Temporal Ingredient Tracking: Enables monitoring of nutrition as ingredients are added.

Fine-grained actions

Full-width image

Frequency of verb clusters (top) and noun clusters (bottom) in narrated sentences by category, shown on a logarithmic scale.

Transcriptions: We automatically transcribe and manually check and correct all audio narrations provided by participants, to obtain detailed action descriptions.
Action boundaries: For all narrations, we label precise start and end times. In total, we obtain segments for 59,454 actions, with a mean duration of 2.0s (±3.4s).
Parsing and clustering: We parse verbs, nouns, and hands from open vocabulary narrations so they can be used for closed vocabulary tasks, such as action recognition. We also extract how and why clauses from 16,004 and 11,540 narrations, respectively. For example:

Turn the salt container clockwise by pushing it with my left hand so that the lid is aligned with the container opening.

Sound annotations: We collect audio annotations. These capture start-end times of audio events along with a class name (e.g. “click”, “rustle”, “metal-plastic collision”, “running water”). Overall, we have 50,968 audio annotations from 44 classes.

Digital twin: Scene & Object Movements

Digital Twin: from point cloud (left), to surfaces (middle) and labelled fixtures (right). We show two moved objects (masks on top) at fixtures: cheese and pan.

We reconstruct digital copies of participants’ kitchens, manually curating:

Surfaces
Fixtures (e.g., cupboards, drawers)
Storage spaces (e.g., shelves, hooks)
Large appliances (e.g., fridge, microwave)

Unlike digital twins based on predefined replicas, our reconstructions are built from real environments using:

Multi-video SLAM point clouds from recordings
Manual curation in Blender

Scene

Full-width image

All kitchens with their fixture annotations (coloured randomly).

Each kitchen contains an average of 44.9 labeled fixtures (min: 32, max: 58):

11.1 counters/surfaces
11.8 cupboards
7.7 drawers
3.3 appliances

We also associate narrations with annotated fixtures to describe scene interactions.

Hand Mask Annotations: We annotate a subset of frames per video for both hands, ensuring coverage across: Various actions and Different kitchen locations. Segmentation Process is automatic with manual correction for a selected subset. We have 7.7M total hand masks in total.

3D Object Movement Annotations

We annotate object movements by labeling Temporal segments from pick-up to placement and 2D bounding boxes at movement onset and end. Tracks include even slight shifts/pushes, ensuring full coverage of movements. Every object movement is annotated, providing a rich dataset for analysis.

Statistics:

19.9K object movement tracks
36.9K bounding boxes
9.2 objects taken per minute (on average)
9.0 objects placed per minute (on average)
Average track length: 9.0s (max: 461.5s)

Object Masks: We generate pixel-level segmentations from each bounding box by Initializing with iterative SAM2 and Manually correcting the masks. During this process, 74% of masks were manually corrected and IoU between SAM2 and manual masks: 0.82. Finally, We lift object masks to 3D using dense depth estimates and 2D-to-3D sparse correspondences provided by MPS.

Object-Scene Interactions: With the 3D object locations, we associate locations with the closest fixture. We manually verify assignments and find the accuracy is 98%. On average objects move between 1.8 different fixtures per video.

Priming Object Movement

Priming Object Interaction Through Gaze. Top: Camera position with projected eye-gaze and object positions in 3D. Middle: 2D gaze location. Bottom: Timeline for priming object movement e.g. the glass is primed 8.3s before taking

We combine eye-gaze and 3D object locations to detect when an object is primed. Pick-up priming: Gaze attends to an object before picking it up. Put-down priming: Gaze attends to the future location before placement. This Excludes objects taken/placed off-screen. 94.8% of feasible objects are primed 4.0s before pick-up and 88.5% of feasible objects are primed 2.6s before placement.

Long Term Object Tracking: We connect object movements and form longer trajectories, i.e. object itineraries, to capture sequences of an object’s movement. Our efficient pipeline utilises our lifted 3D locations and allows annotating a 1-hour long video in minute.

Annotation Density in HD-EPIC

Annotation Type	Total annotations	Annotations/min
Narrations	59,454	24.0
Parsing (Verbs + Nouns + Hands + How + Why)	303,968	122.7
Recipes (Preps + Steps)	4,052	1.6
Sound	50,968	20.6
Action boundaries	59,454	24.0
Object Motion (Pick up + Put down + Fixtures + Bboxes + Masks)	153,480	62.0
Object Itinerary	4,881	2.0
Object Priming (Starts + Ends)	18,264	7.4
Total		263.2

VQA Benchmark

VQA Challenge @ CVPR 2025: Challenge link: CodaLab
Deadline: 27th May 2025. The deadline to submit report is 30th May (11:55 PM AoE).
Results will be announced at the Second Joint Egocentric Vision (EgoVis) Workshop at CVPR 2025 on 12th June.

Benchmark creation

We construct a VQA benchmark leveraging the dense annotations in our dataset, covering 7 key annotation types:

VQA Question Types

Recipe – Identify, retrieve, and localize recipes and steps.
Ingredient – Track ingredient usage, weight, timing, and order.
Nutrition – Analyze ingredient nutrition and its evolution in recipes.
Fine-Grained Action – Understand the what, how, and why of actions.
3D Perception – Reason about object positions in 3D space.
Object Motion – Track object movements across long video sequences.
Gaze – Estimate fixation points and anticipate future interactions.

Benchmark Structure

5-way multiple choice for each question type
30 question prototypes, generating 26,650 multiple-choice questions
Hard negatives sampled from the dataset for increased difficulty

Scalability & Impact

One of the largest VQA video benchmarks, yet tractable for closed-source VLMs
Estimated upper bound of 100,000 unique questions due to annotation density

VQA Question Prototypes. We show our 30 question prototypes by category alongside the number of questions. Outer bars indicate the distribution over input lengths for each question.

VQA Challenge

We are hosting the VQA Challenge to evaluate the performance of VLMs on our benchmark. The challenge will be hosted on CodaLab and will include a leaderboard for tracking progress. The challenge will be open to all participants, and we encourage everyone to submit their results. The challenge leaderboard closes on 27th May 2025. The deadline to submit report is 30th May (11:55 PM AoE).

The results with be announced in the Second Joint Egocentric Vision (EgoVis) Workshop at CVPR 2025 on 12th June.

VLM models

We use 4 representative models as baselines:

Llama 3.2 90B. We use this as a strong open-source text-only baseline, as LLMs can perform well on visual QA benchmarks without any visual input.
VideoLlama 2 7B. Strong open-source short context model.
LongVA. Longest context open-source model.
Gemini Pro. Closed source, longest context of any model, and state-of-the-art on long-video.

Full-width image

VQA Results per Question Prototype. Our benchmark contains many challenging questions for current models.

Full-width image

VQA Qualitative Results. We mark ground truth answers with a green background, and predictions from different models, i.e., LLaMA 3.2, VideoLLaMA 2, LongVA, Gemini Pro, with coloured dots alongside the corresponding prediction.

Download

Dataset Link: Link.

Alternate Download Link (without VRS files): OneDrive

Note: The link above contains the following files:

Audio-HDF5: 27 GB
Digital-Twin: 1.35 GB
Hands-Masks: 1.95 GB
SLAM-and-Gaze: 349 GB
VRS: 1.9 TB
Videos (mp4): 115.5 GB

VRS files consume large storage space, so, you can only download the videos. This Github repo offers a python script to download various parts of the dataset.
Download Annotations: GitHub; Download Tool
Paper: arXiv

Copyright

All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.

For commercial licenses of EPIC-KITCHENS and any of its annotations, email us at uob-epic-kitchens@bristol.ac.uk.

Disclaimer

HD-EPIC dataset was collected as a tool for research in computer vision. The dataset may have unintended biases (including those of a societal, gender or racial nature).

BibTeX

Cite our paper if you find the HD-EPIC dataset useful for your research:

@InProceedings{perrett2025hdepic,
  author    = {Perrett, Toby and Darkhalil, Ahmad and Sinha, Saptarshi and Emara, Omar and Pollard, Sam and Parida, Kranti and Liu, Kaiting and Gatti, Prajwal and Bansal, Siddhant and Flanagan, Kevin and Chalk, Jacob and Zhu, Zhifan and Guerrier, Rhodri and Abdelazim, Fahd and Zhu, Bin and Moltisanti, Davide and Wray, Michael and Doughty, Hazel and Damen, Dima},
  title     = {HD-EPIC: A Highly-Detailed Egocentric Video Dataset},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
  month     = {June}
}

Team

Toby Perrett	University of Bristol	Ahmad Darkhalil	University of Bristol	Saptarshi Sinha	University of Bristol
Omar Emara	University of Bristol	Sam Pollard	University of Bristol	Kranti Kumar Parida	University of Bristol
Kaiting Liu	Leiden University	Prajwal Gatti	University of Bristol	Siddhant Bansal	University of Bristol
Kevin Flanagan	University of Bristol	Jacob Chalk	University of Bristol	Zhifan Zhu	University of Bristol
Rhodri Guerrier	University of Bristol	Fahd Abdelazim	University of Bristol	Bin Zhu	Singapore Management University
Davide Moltisanti	University of Bath	Michael Wray	University of Bristol	Hazel Doughty	Leiden University
		Dima Damen	University of Bristol

Acknowledgements

The project was supported by a uncharitable donation from Meta (Aria Project Partnership) to the University of Bristol. Gemini Pro results are supported by a research credits grant from Google DeepMind.

Research at Bristol is supported by EPSRC Fellowship UMPIRE (EP/T004991/1), EPSRC Program Grant Visual AI (EP/T028572/1) and EPSRC Doctoral Training Program.

Research at Leiden is supported by the Dutch Research Council. (NWO) under a Veni grant (VI.Veni.222.160).

Research at Singapore is supported by Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant (No. MSS23C018).

We thank Rajan from Elancer and his team, for their huge assistance with temporal and audio annotation. We thank Srdjan Delic and his team for their assistance with mask annotations and object associations. We also thank Owen Tyley for the 3D Digital Twin of the kitchen environments using Blender. We thank David Fouhey and Evangelos Kazakos for early feedback on the project. We thank Pierre Moulon, Vijay Baiyya and Cheng Peng from the Aria team for technical assistance in using the MPS code and services.

We acknowledge the usage of Ismbard AI Phase 1 and EPSRC Tier-2 Jade clusters.

News

Recipe steps and ingredients

Fine-grained actions

Digital twin: Scene & Object Movements

Scene

3D Object Movement Annotations

Priming Object Movement

Annotation Density in HD-EPIC

VQA Benchmark

Benchmark creation

VQA Question Types

Benchmark Structure

Scalability & Impact

VQA Challenge

VLM models

Download

Copyright

Disclaimer

BibTeX

Team

Acknowledgements

Templates (for web app):

Error