Database Credentialed Access
Harvard-Emory ECG Database
Zuzana Koscova , Valdery Moura Junior , Matthew Reyna , Shenda Hong , Aditya Gupta , Manohar Ghanta , Reza Sameni , Aaron Aguirre , Qiao Li , Sahar Zafar , Gari Clifford , M Brandon Westover
Published: Feb. 25, 2026. Version: 5.0
When using this resource, please cite:
(show more options)
Koscova, Z., Moura Junior, V., Reyna, M., Hong, S., Gupta, A., Ghanta, M., Sameni, R., Aguirre, A., Li, Q., Zafar, S., Clifford, G., & Westover, M. B. (2026). Harvard-Emory ECG Database (version 5.0). Brain Data Science Platform. https://doi.org/10.60508/rv6h-7d10.
Abstract
The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.
In version 1.0 of the database, these ECGs from Massachusetts General Brigham hospital sites were provided without labels or metadata, to enable pre-training of ECG analysis models.
In version 2.0, metadata is included.
In version 3.0, Emory ECGs are included together with metadata, labels from the 12SL ECG analysis program (GE Healthcare ) and ICD-9/10 codes.
In version 4.0, typos were corrected in the data description.
Version 5.0 introduces a new set of 12SL labels containing human expert overreads, along with updated ICD codes for I0006.
HEEDB is published as part of the Human Sleep Project (HSP), funded by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center.
Background
These ECG data include clinical ECGs captured during routine clinical care over several decades. These are intended to be used to determine associations between cardiac abnormalities (e.g. abnormal rhythms) and sleep, sleep-related medical conditions, and health outcomes.
Methods
The dataset consists of standard 12-lead ECG recordings, each 10 seconds long, acquired at sampling rates of 250 or 500 Hz. Collection began in the 1980s and continues to the present day. Version 3 of the database includes 10,608,417 ECGs from 1,818,247 unique patients at Massachusetts General Hospital (MGH) and 998,844 ECGs from 349,548 patients at Emory University Hospital (EUH). All recordings were obtained as part of routine clinical care.
Data preprocessing: Data was de-identified following the Safe Harbor method.
Data Description
ECG data is stored in WFDB (Waveform Database) and Matlab (V4) compatible format. Each ECG recording includes one waveform data file (.mat for MGH and .dat for EUH) and one header file (.hea). The waveform data file can be read by WFDB library functions, applications, Toolbox, or be loaded to Matlab directly. Waveform files are 12-lead ECG signals recorded at 250 and 500 Hz for 10 s encoded in 16 bits. The header file specifies the name of the associated waveform file and its attributes including sampling rate, units, channel names and the signal gain. It contains line-oriented and field-oriented ASCII text and can be read by the WFDB library or generic text editors.
The directory structure of the HEEDB is organized as follows:
ECG/
├── I0001/
│ ├── 12SL_diagnoses/
│ │ ├── diagnoses_v24.csv
│ │ ├── diagnoses_acquisition.csv
│ │ ├── diagnoses_dictionary.csv
│ │ └── README
│ ├── ICD_codes/
│ │ ├── icd9_codes.csv
│ │ ├── icd10_codes.csv
│ │ └── README
│ ├── metadata/
│ │ ├── metadata.csv
│ │ └── README
│ └── WFDB/
│ ├── S0001/
│ │ ├── 1987/
│ │ │ ├── 02/
│ │ │ ├── 09/
│ │ │ └── 12/
│ │ └── ...
│ ├── S0002/
│ │ └── .../
│ ├── S0003/
│ │ └── .../
│ └── S0004/
│ └── .../
├── I0006/
│ ├── 12SL_diagnoses/
│ │ ├── diagnoses_v24.csv
│ │ ├── diagnoses_acquisition.csv
│ │ ├── diagnoses_dictionary.csv
│ │ └── README
│ ├── ICD_codes/
│ │ ├── icd9_codes.csv
│ │ ├── icd10_codes.csv
│ │ └── README
│ ├── metadata/
│ │ ├── metadata.csv
│ │ └── README
│ └── WFDB/
│ ├── 2010/
│ ├── 2011/
│ ├── ...
│ └── 2018/
Each site (EUH and MGH) maintains its own subfolders for diagnoses, ICD codes, metadata, and waveform files. The WFDB/ directory contains the ECG waveform data organized either by year (EUH) or by session identifier (MGH).
12SL Diagnoses Description
The 12SL_diagnoses/ folder contains diagnostic outputs from 12SL (GE Healthcare) software and human overreads of the diagnoses.
File: diagnoses_v24.csv
This file contains batch reprocessed diagnostic labels generated using the latest available 12SL software (version 24). It contains the following columns:
- FileName – Path to the corresponding WFDB file
- codes – Diagnostic codes, which can be mapped to text labels using
diagnoses_dictionary.csv
File: diagnoses_acquisition.csv
This file contains the original 12SL outputs from the time of ECG acquisition; and the corresponding physician overreads. It contains the following columns:
- FileName – Path to the corresponding WFDB file
- codes_software – Diagnostic codes generated by the 12SL software during acquisition (can be mapped to
diagnoses_dictionary.csv) - codes_physician – Diagnostic codes corresponding to the human expert overreads of the ECG and original 12SL software labels (can be mapped to
diagnoses_dictionary.csv)
File: diagnoses_dictionary.csv
This file provides human-readable mappings for 12SL diagnostic codes. It contains the following columns:
- codes – Integer codes for diagnoses
- acronym – Abbreviated diagnosis labels
- diagnoses – Full textual descriptions of diagnoses
ICD Codes Description
The ICD_codes/ folder contains diagnostic information extracted from Electronic Health Records (EHR) for each patient.
File: icd10_codes.csv
This file contains diagnostic codes from the 10th revision of the International Classification of Diseases (ICD-10), developed by the World Health Organization (WHO). These alphanumeric codes represent diagnoses and health conditions.
Columns:
- BDSPPatientID – Brain Data Science Platform Patient ID
- RECORDED_DT – Shifted date of the diagnosis
- DIAGNOSIS_ICD10_CD – Full ICD-10 diagnosis code
- DIAGNOSIS_ICD10_DESC – Description of the ICD-10 diagnosis code
File: icd9_codes.csv
This file contains diagnostic codes from the 9th revision of the International Classification of Diseases (ICD-9), also developed by the WHO. These codes are also sourced from the EHR system.
Columns:
- BDSPPatientID – Brain Data Science Platform Patient ID
- RECORDED_DT – Shifted date of the diagnosis
- DIAGNOSIS_ICD9_CD – Full ICD-9 diagnosis code
- DIAGNOSIS_ICD9_DESC – Description of the ICD-9 diagnosis code
Metadata Description
The metadata/ folder contains demographic and temporal information associated with each ECG recording, including ECG acquisition time, date of birth, date of death, and derived age-related fields.
File: metadata.csv
Columns:
- BDSPPatientID – Patient ID
- FileName – Path to the WFDB file
- FileID – Basename of the WFDB file
- PatientRace
- EthnicGroupDSC
- MaritalStatusDSC
- ReligionDSC
- LanguageDSC
- VeteranStatusDSC
- SexDSC
- PrimaryCauseOfDeathDSC
- PrimaryCauseOfDeathUNOS
- FirstContributoryCauseOfDeathDSC
- FirstContributoryCauseOfDeathUNOS
- SecondContributoryCauseOfDeathDSC
- SecondContributoryCauseOfDeathUNOS
- EducationLevelDSC
- GenderIdentityDSC
- SexAssignedAtBirthDSC
- DateOfDeath
- DateOfDeathMARegistryData – Massachusetts (MA) state death registry date of death
- LastKnownVisitDate – Last time the patient had contact with the hospital system
- ECGAcquisitionTime – Time of ECG acquisition
- DateOfBirth
- AgeAtAcquisition – Age at ECG acquisition
- AgeAtDeath – Age at time of death
- AgeAtDeathMA – Age at death according to MA state registry
- AgeAtLastVisit – Age at the last hospital contact
For the EUH site, the following columns are missing from the metadata.csv file: EthnicGroupDSC, MaritalStatusDSC, ReligionDSC, LanguageDSC, VeteranStatusDSC, PrimaryCauseOfDeathDSC, PrimaryCauseOfDeathUNOS, FirstContributoryCauseOfDeathDSC, FirstContributoryCauseOfDeathUNOS, SecondContributoryCauseOfDeathDSC, SecondContributoryCauseOfDeathUNOS, EducationLevelDSC, GenderIdentityDSC, SexAssignedAtBirthDSC, and DateOfDeathMARegistryData.
Usage Notes
HEEDB is intended to support a wide range of ECG studies, in particular those exploring the relationship between ECG conditions and sleep.
Release Notes
- v1.0: Initial release containing 10,608,417 ECGs from 1,818,247 subjects (I0001 site only).
- v2.0: Added additional data files
- v3.0: Expanded to include two ECG institutions — I0001 (10,608,417 ECGs from 1,818,247 subjects) and I0006 (1,061,598 ECGs from 349,548 patients). Metadata, 12SL diagnostic codes, and ICD-9/10 diagnosis codes were also added. Duplicate ECGs were removed from the I0001 site, and incorrect sampling frequencies in header files were corrected.
- v4.0: Corrected typos in the data description
- v5.0: Includes new set of 12SL labels containing human expert overreads, along with updated ICD codes for I0006
Ethics
This project was conducted under IRB protocol number (BIDMC: # 2016P000058, MGH: # 2013P001024, EUH # STUDY00005810), with the MGH, BIDMC and EUH IRBs granting a waiver of consent on the basis that the study involved retrospective analysis of routinely collected clinical data and posed minimal risk to participants. Although direct identifiers were removed, the dataset contains multiple indirect identifiers and sensitive diagnostic information; therefore, the IRBs approved the publication of the dataset in a de-identified form with access restricted by a data usage agreement prohibiting attempts at re-identification. The study also complied with the Declaration of Helsinki.
Acknowledgements
Publication of HEEDB is supported by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center
Conflicts of Interest
Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals. The other authors declare that they have no conflicts of interest.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
Corresponding Author
Versions
- 5.0 - Feb. 25, 2026
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project