HD-EPIC: A Highly-Detailed Egocentric Video Dataset
Diversity in HD-EPIC, which is filmed over 3 days in-the-wild, resulting in many objects, activities and recipes.
Recipe steps and ingredients
For the “Carbonara” recipe, we visualise the prep and step time segments for three consecutive steps (left), along with sample frames with corresponding action narrations (top). The interleaving of different preps/steps is evident in the annotations
Unlike typical short recipe videos, which are trimmed, edited, or sped up, HD-EPIC captures a broader range of real-world cooking activities, including:
- Fetching ingredients
- Prepping ingredients
- Weighing & adding ingredients (for full nutritional tracking)
To fully capture recipes, we introduce:
- Prep & Step Pairs: Detailed segmentation of preparation and execution.
- Temporal Ingredient Tracking: Enables monitoring of nutrition as ingredients are added.
Fine-grained actions
Frequency of verb clusters (top) and noun clusters (bottom) in narrated sentences by category, shown on a logarithmic scale.
Transcriptions: We automatically transcribe and manually check and correct all audio narrations provided by participants, to obtain detailed action descriptions.
Action boundaries: For all narrations, we label precise start and end times. In total, we obtain segments for 59,454 actions, with a mean duration of 2.0s (±3.4s).
Parsing and clustering: We parse verbs, nouns, and hands from open vocabulary narrations so they can be used for closed vocabulary tasks, such as action recognition. We also extract how and why clauses from 16,004 and 11,540 narrations, respectively. For example:
Turn the salt container clockwise by pushing it with my left hand so that the lid is aligned with the container opening.
- Sound annotations: We collect audio annotations. These capture start-end times of audio events along with a class name (e.g. “click”, “rustle”, “metal-plastic collision”, “running water”). Overall, we have 50,968 audio annotations from 44 classes.
Digital twin: Scene & Object Movements
Digital Twin: from point cloud (left), to surfaces (middle) and labelled fixtures (right). We show two moved objects (masks on top) at fixtures: cheese and pan.
We reconstruct digital copies of participants’ kitchens, manually curating:
- Surfaces
- Fixtures (e.g., cupboards, drawers)
- Storage spaces (e.g., shelves, hooks)
- Large appliances (e.g., fridge, microwave)
Unlike digital twins based on predefined replicas, our reconstructions are built from real environments using:
- Multi-video SLAM point clouds from recordings
- Manual curation in Blender
Scene
All kitchens with their fixture annotations (coloured randomly).
Each kitchen contains an average of 44.9 labeled fixtures (min: 32, max: 58):
- 11.1 counters/surfaces
- 11.8 cupboards
- 7.7 drawers
- 3.3 appliances
We also associate narrations with annotated fixtures to describe scene interactions.
Hand Mask Annotations: We annotate a subset of frames per video for both hands, ensuring coverage across: Various actions and Different kitchen locations. Segmentation Process is automatic with manual correction for a selected subset. We have 7.7M total hand masks in total.
3D Object Movement Annotations
We annotate object movements by labeling Temporal segments from pick-up to placement and 2D bounding boxes at movement onset and end. Tracks include even slight shifts/pushes, ensuring full coverage of movements. Every object movement is annotated, providing a rich dataset for analysis.
Statistics:
- 19.9K object movement tracks
- 36.9K bounding boxes
- 9.2 objects taken per minute (on average)
- 9.0 objects placed per minute (on average)
- Average track length: 9.0s (max: 461.5s)
Object Masks: We generate pixel-level segmentations from each bounding box by Initializing with iterative SAM2 and Manually correcting the masks. During this process, 74% of masks were manually corrected and IoU between SAM2 and manual masks: 0.82. Finally, We lift object masks to 3D using dense depth estimates and 2D-to-3D sparse correspondences provided by MPS.
Object-Scene Interactions: With the 3D object locations, we associate locations with the closest fixture. We manually verify assignments and find the accuracy is 98%. On average objects move between 1.8 different fixtures per video.
Priming Object Movement
Priming Object Interaction Through Gaze. Top: Camera position with projected eye-gaze and object positions in 3D. Middle: 2D gaze location. Bottom: Timeline for priming object movement e.g. the glass is primed 8.3s before taking
We combine eye-gaze and 3D object locations to detect when an object is primed. Pick-up priming: Gaze attends to an object before picking it up. Put-down priming: Gaze attends to the future location before placement. This Excludes objects taken/placed off-screen. 94.8% of feasible objects are primed 4.0s before pick-up and 88.5% of feasible objects are primed 2.6s before placement.
Long Term Object Tracking: We connect object movements and form longer trajectories, i.e. object itineraries, to capture sequences of an object’s movement. Our efficient pipeline utilises our lifted 3D locations and allows annotating a 1-hour long video in minute.
Annotation Density in HD-EPIC
Annotation Type | Total annotations | Annotations/min |
---|---|---|
Narrations | 59,454 | 24.0 |
Parsing (Verbs + Nouns + Hands + How + Why) | 303,968 | 122.7 |
Recipes (Preps + Steps) | 4,052 | 1.6 |
Sound | 50,968 | 20.6 |
Action boundaries | 59,454 | 24.0 |
Object Motion (Pick up + Put down + Fixtures + Bboxes + Masks) | 153,480 | 62.0 |
Object Itinerary | 4,881 | 2.0 |
Object Priming (Starts + Ends) | 18,264 | 7.4 |
Total | 263.2 |
VQA Benchmark
Benchmark creation
We construct a VQA benchmark leveraging the dense annotations in our dataset, covering 7 key annotation types:
VQA Question Types
- Recipe – Identify, retrieve, and localize recipes and steps.
- Ingredient – Track ingredient usage, weight, timing, and order.
- Nutrition – Analyze ingredient nutrition and its evolution in recipes.
- Fine-Grained Action – Understand the what, how, and why of actions.
- 3D Perception – Reason about object positions in 3D space.
- Object Motion – Track object movements across long video sequences.
- Gaze – Estimate fixation points and anticipate future interactions.
Benchmark Structure
- 5-way multiple choice for each question type
- 30 question prototypes, generating 26,650 multiple-choice questions
- Hard negatives sampled from the dataset for increased difficulty
Scalability & Impact
- One of the largest VQA video benchmarks, yet tractable for closed-source VLMs
- Estimated upper bound of 100,000 unique questions due to annotation density
VQA Question Prototypes. We show our 30 question prototypes by category alongside the number of questions. Outer bars indicate the distribution over input lengths for each question.
VLM models
We use 4 representative models as baselines:
- Llama 3.2 90B. We use this as a strong open-source text-only baseline, as LLMs can perform well on visual QA benchmarks without any visual input.
- VideoLlama 2 7B. Strong open-source short context model.
- LongVA. Longest context open-source model.
- Gemini Pro. Closed source, longest context of any model, and state-of-the-art on long-video.
VQA Results per Question Prototype. Our benchmark contains many challenging questions for current models.
VQA Qualitative Results. We mark ground truth answers with a green background, and predictions from different models, i.e., LLaMA 3.2, VideoLLaMA 2, LongVA, Gemini Pro, with coloured dots alongside the corresponding prediction.
Download
Early Access Data available here: OneDrive
Download Annotations: GitHub
Paper: arXiv
Coming:
- VRS files (soon)
- Object Associations
- Object Masks
Copyright
All datasets and benchmarks on this page are copyright by us and published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.
For commercial licenses of EPIC-KITCHENS and any of its annotations, email us at uob-epic-kitchens@bristol.ac.uk.
Disclaimer
HD-EPIC dataset was collected as a tool for research in computer vision. The dataset may have unintended biases (including those of a societal, gender or racial nature).
BibTeX
Cite our paper if you find the HD-EPIC dataset useful for your research:
@article{perrett2025hdepic,
author = {Perrett, Toby and Darkhalil, Ahmad and Sinha, Saptarshi and Emara, Omar and Pollard, Sam and Parida, Kranti and Liu, Kaiting and Gatti, Prajwal and Bansal, Siddhant and Flanagan, Kevin and Chalk, Jacob and Zhu, Zhifan and Guerrier, Rhodri and Abdelazim, Fahd and Zhu, Bin and Moltisanti, Davide and Wray, Michael and Doughty, Hazel and Damen, Dima},
title = {HD-EPIC: A Highly-Detailed Egocentric Video Dataset},
journal = {arXiv preprint},
volume = {arXiv:2502.04144},
year = {2025},
month = {February},
}
Team
![]() | ![]() | ![]() | ![]() |
![]() | Toby Perrett | University of Bristol | ![]() | Ahmad Darkhalil | University of Bristol | ![]() | Saptarshi Sinha | University of Bristol |
![]() | Omar Emara | University of Bristol | ![]() | Sam Pollard | University of Bristol | ![]() | Kranti Kumar Parida | University of Bristol |
![]() | Kaiting Liu | Leiden University | ![]() | Prajwal Gatti | University of Bristol | ![]() | Siddhant Bansal | University of Bristol |
![]() | Kevin Flanagan | University of Bristol | ![]() | Jacob Chalk | University of Bristol | ![]() | Zhifan Zhu | University of Bristol |
![]() | Rhodri Guerrier | University of Bristol | ![]() | Fahd Abdelazim | University of Bristol | ![]() | Bin Zhu | Singapore Management University |
![]() | Davide Moltisanti | University of Bath | ![]() | Michael Wray | University of Bristol | ![]() | Hazel Doughty | Leiden University |
![]() | Dima Damen | University of Bristol |
Acknowledgements
The project was supported by a uncharitable donation from Meta (Aria Project Partnership) to the University of Bristol. Gemini Pro results are supported by a research credits grant from Google DeepMind.
Research at Bristol is supported by EPSRC Fellowship UMPIRE (EP/T004991/1), EPSRC Program Grant Visual AI (EP/T028572/1) and EPSRC Doctoral Training Program.
Research at Leiden is supported by the Dutch Research Council. (NWO) under a Veni grant (VI.Veni.222.160).
Research at Singapore is supported by Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant (No. MSS23C018).
We thank Rajan from Elancer and his team, for their huge assistance with temporal and audio annotation. We thank Srdjan Delic and his team for their assistance with mask annotations and object associations. We also thank Owen Tyley for the 3D Digital Twin of the kitchen environments using Blender. We thank David Fouhey and Evangelos Kazakos for early feedback on the project. We thank Pierre Moulon, Vijay Baiyya and Cheng Peng from the Aria team for technical assistance in using the MPS code and services.
We acknowledge the usage of Ismbard AI Phase 1 and EPSRC Tier-2 Jade clusters.