Database Credentialed Access

Harvard-Emory ECG Database

Zuzana Koscova Valdery Moura Junior Matthew Reyna Shenda Hong Aditya Gupta Manohar Ghanta Reza Sameni Aaron Aguirre Qiao Li Sahar Zafar Gari Clifford M Brandon Westover

Published: July 28, 2025. Version: 4.0


When using this resource, please cite: (show more options)
Koscova, Z., Moura Junior, V., Reyna, M., Hong, S., Gupta, A., Ghanta, M., Sameni, R., Aguirre, A., Li, Q., Zafar, S., Clifford, G., & Westover, M. B. (2025). Harvard-Emory ECG Database (version 4.0). Brain Data Science Platform. https://doi.org/10.60508/rv6h-7d10.

Additionally, please cite the original publication:

The Harvard-Emory ECG Database Zuzana Koscova, Qiao Li, Chad Robichaux, Valdery Moura Junior, Manohar Ghanta, Aditya Gupta, Jonathan Rosand, Aaron Aguirre, Shenda Hong, David E. Albert, Joel Xue, Aarya Parekh, Reza Sameni, Matthew A. Reyna, M. Brandon Westover, Gari D. Cliford medRxiv 2024.09.27.24314503; doi: https://doi.org/10.1101/2024.09.27.24314503

Abstract

The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.

In version 1.0 of the database, these ECGs from Massachusetts General Brigham hospital sites were provided without labels or metadata, to enable pre-training of ECG analysis models.

In version 2.0, metadata is included.

In version 3.0, Emory ECGs are included together with metadata, labels from the 12SL ECG analysis program (GE Healthcare ) and ICD-9/10 codes.

In version 4.0, typos were corrected in the data description.

HEEDB is published as part of the Human Sleep Project (HSP), funded by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center.

 


Background

These ECG data include clinical ECGs captured during routine clinical care over several decades. These are intended to be used to determine associations between cardiac abnormalities (e.g. abnormal rhythms) and sleep, sleep-related medical conditions, and health outcomes.


Methods

The dataset consists of standard 12-lead ECG recordings, each 10 seconds long, acquired at sampling rates of 250 or 500 Hz. Collection began in the 1980s and continues to the present day. Version 3 of the database includes 10,608,417 ECGs from 1,818,247 unique patients at institution I0001 and 1,061,598 ECGs from 349,548 patients at institution I0006. All recordings were obtained as part of routine clinical care.

Data preprocessing: Data was de-identified following the Safe Harbor method.


Data Description

ECG data is stored in WFDB (Waveform Database) and Matlab (V4) compatible format. Each ECG recording  includes one waveform data file (.mat for I0001 and .dat for I0006) and one header file (.hea). The waveform data file can be read by WFDB library functions, applications, Toolbox, or be loaded to Matlab directly. Waveform files are 12-lead ECG signals recorded at 250 and 500 Hz for 10 s encoded in 16 bits. The header file specifies the name of the associated waveform file and its attributes including sampling rate, units, channel names and the signal gain. It contains line-oriented and field-oriented ASCII text and can be read by the WFDB library or generic text editors.

The directory structure of the HEEDB is organized as follows:

ECG/
├── I0006/
│   ├── 12SL_diagnoses/
│   │   ├── diagnoses.csv
│   │   ├── diagnoses_dictionary.csv
│   │   └── README
│   ├── ICD_codes/
│   │   ├── icd9_codes.csv
│   │   ├── icd10_codes.csv
│   │   └── README
│   ├── metadata/
│   │   ├── metadata.csv
│   │   └── README
│   └── WFDB/
│       ├── 2010/
│       ├── 2011/
│       ├── ...
│       └── 2018/
├── I0001/
│   ├── 12SL_diagnoses/
│   │   ├── diagnoses.csv
│   │   ├── diagnoses_dictionary.csv
│   │   └── README
│   ├── ICD_codes/
│   │   ├── icd9_codes.csv
│   │   ├── icd10_codes.csv
│   │   └── README
│   ├── metadata/
│   │   ├── metadata.csv
│   │   └── README
│   └── WFDB/
│       ├── S0001/
│       ├── S0002/
│       ├── S0003/
│       └── S0004/

 

Each institution (I0001 and I0006) maintains its own subfolders for diagnoses, ICD codes, metadata, and waveform files. The WFDB/ directory contains the ECG waveform data organized either by year (I0006) or by session identifier (I0001).

12SL Diagnoses Description

The 12SL_diagnoses/ folder contains diagnostic outputs from 12SL (GE Healthcare) software, version 1.

File: diagnoses.csv
This file contains two columns:

  • FileName – Path to the corresponding WFDB file
  • codes – Diagnostic codes, which can be mapped to text labels using diagnoses_dictionary.csv

File: diagnoses_dictionary.csv
This file provides human-readable mappings for 12SL diagnostic codes. It contains the following columns:

  • codes – Integer codes for diagnoses
  • acronym – Abbreviated diagnosis labels
  • diagnoses – Full textual descriptions of diagnoses

 

ICD Codes Description

The ICD_codes/ folder contains diagnostic information extracted from Electronic Health Records (EHR) for each patient.

File: icd10_codes.csv
This file contains diagnostic codes from the 10th revision of the International Classification of Diseases (ICD-10), developed by the World Health Organization (WHO). These alphanumeric codes represent diagnoses and health conditions.

Columns:

  • BDSPPatientID – Brain Data Science Platform Patient ID
  • RECORDED_DT – Shifted date of the diagnosis
  • DIAGNOSIS_ICD10_CD – Full ICD-10 diagnosis code
  • DIAGNOSIS_ICD10_DESC – Description of the ICD-10 diagnosis code

File: icd9_codes.csv
This file contains diagnostic codes from the 9th revision of the International Classification of Diseases (ICD-9), also developed by the WHO. These codes are also sourced from the EHR system.

Columns:

  • BDSPPatientID – Brain Data Science Platform Patient ID
  • RECORDED_DT – Shifted date of the diagnosis
  • DIAGNOSIS_ICD9_CD – Full ICD-9 diagnosis code
  • DIAGNOSIS_ICD9_DESC – Description of the ICD-9 diagnosis code

 

Metadata Description

The metadata/ folder contains demographic and temporal information associated with each ECG recording, including ECG acquisition time, date of birth, date of death, and derived age-related fields.

File: metadata.csv

Columns:

  • BDSPPatientID – Patient ID 
  • FileName – Path to the WFDB file
  • FileID – Basename of the WFDB file
  • PatientRace
  • EthnicGroupDSC
  • MaritalStatusDSC
  • ReligionDSC
  • LanguageDSC
  • VeteranStatusDSC
  • SexDSC
  • PrimaryCauseOfDeathDSC
  • PrimaryCauseOfDeathUNOS
  • FirstContributoryCauseOfDeathDSC
  • FirstContributoryCauseOfDeathUNOS
  • SecondContributoryCauseOfDeathDSC
  • SecondContributoryCauseOfDeathUNOS
  • EducationLevelDSC
  • GenderIdentityDSC
  • SexAssignedAtBirthDSC
  • DateOfDeath
  • DateOfDeathMARegistryData – Massachusetts (MA) state death registry date of death
  • LastKnownVisitDate – Last time the patient had contact with the hospital system
  • ECGAcquisitionTime – Time of ECG acquisition
  • DateOfBirth
  • AgeAtAcquisition – Age at ECG acquisition
  • AgeAtDeath – Age at time of death
  • AgeAtDeathMA – Age at death according to MA state registry
  • AgeAtLastVisit – Age at the last hospital contact

For I0006, the following columns are missing from the metadata.csv file: EthnicGroupDSC, MaritalStatusDSC, ReligionDSC, LanguageDSC, VeteranStatusDSC, PrimaryCauseOfDeathDSC, PrimaryCauseOfDeathUNOS, FirstContributoryCauseOfDeathDSC, FirstContributoryCauseOfDeathUNOS, SecondContributoryCauseOfDeathDSC, SecondContributoryCauseOfDeathUNOS, EducationLevelDSC, GenderIdentityDSC, SexAssignedAtBirthDSC, and DateOfDeathMARegistryData.

 


Usage Notes

HEEDB is intended to support a wide range of ECG studies, in particular those exploring the relationship between ECG conditions and sleep.


 


Release Notes

  • v1.0: Initial release containing 10,608,417 ECGs from 1,818,247 subjects (I0001 site only).
  • v2.0: Added additional data files
  • v3.0: Expanded to include two ECG institutions — I0001 (10,608,417 ECGs from 1,818,247 subjects) and I0006 (1,061,598 ECGs from 349,548 patients). Metadata, 12SL diagnostic codes, and ICD-9/10 diagnosis codes were also added. Duplicate ECGs were removed from the I0001 site, and incorrect sampling frequencies in header files were corrected.
  • v4.0: Corrected typos in the data description

Ethics

The study protocol was approved by the Institutional Review Boards of the Massachusetts General Hospital (protocol # 2013P001024) and Beth Israel Deaconess Medical Center (protocol # 2022P000417). The written informed consents were waived, because of the retrospective study design. The study also complied with the Declaration of Helsinki.


Acknowledgements

Publication of HEEDB is supported by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center


Conflicts of Interest

Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals. The other authors declare that they have no conflicts of interest.


Parent Projects
Harvard-Emory ECG Database was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
BDSP Credentialed Health Data License 1.5.0

Data Use Agreement:
BDSP Credentialed Health Data Use Agreement

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.
Versions
  • 1.0 - Sept. 8, 2023
  • 2.0 - Nov. 6, 2024
  • 3.0 - July 28, 2025
  • 4.0 - July 28, 2025

Files