Database Credentialed Access
Harvard-Emory ECG Database
Zuzana Koscova , Valdery Moura Junior , Matthew Reyna , Shenda Hong , Aditya Gupta , Manohar Ghanta , Reza Sameni , Aaron Aguirre , Qiao Li , Sahar Zafar , Gari Clifford , M Brandon Westover
Published: July 28, 2025. Version: 4.0
When using this resource, please cite:
(show more options)
Koscova, Z., Moura Junior, V., Reyna, M., Hong, S., Gupta, A., Ghanta, M., Sameni, R., Aguirre, A., Li, Q., Zafar, S., Clifford, G., & Westover, M. B. (2025). Harvard-Emory ECG Database (version 4.0). Brain Data Science Platform. https://doi.org/10.60508/rv6h-7d10.
Abstract
The Harvard-Emory ECG database (HEEDB) is a large collection of 12-lead electrocardiography (ECG) recordings, prepared through a collaboration between Harvard University and Emory University investigators.
In version 1.0 of the database, these ECGs from Massachusetts General Brigham hospital sites were provided without labels or metadata, to enable pre-training of ECG analysis models.
In version 2.0, metadata is included.
In version 3.0, Emory ECGs are included together with metadata, labels from the 12SL ECG analysis program (GE Healthcare ) and ICD-9/10 codes.
In version 4.0, typos were corrected in the data description.
HEEDB is published as part of the Human Sleep Project (HSP), funded by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center.
Background
These ECG data include clinical ECGs captured during routine clinical care over several decades. These are intended to be used to determine associations between cardiac abnormalities (e.g. abnormal rhythms) and sleep, sleep-related medical conditions, and health outcomes.
Methods
The dataset consists of standard 12-lead ECG recordings, each 10 seconds long, acquired at sampling rates of 250 or 500 Hz. Collection began in the 1980s and continues to the present day. Version 3 of the database includes 10,608,417 ECGs from 1,818,247 unique patients at institution I0001 and 1,061,598 ECGs from 349,548 patients at institution I0006. All recordings were obtained as part of routine clinical care.
Data preprocessing: Data was de-identified following the Safe Harbor method.
Data Description
ECG data is stored in WFDB (Waveform Database) and Matlab (V4) compatible format. Each ECG recording includes one waveform data file (.mat for I0001 and .dat for I0006) and one header file (.hea). The waveform data file can be read by WFDB library functions, applications, Toolbox, or be loaded to Matlab directly. Waveform files are 12-lead ECG signals recorded at 250 and 500 Hz for 10 s encoded in 16 bits. The header file specifies the name of the associated waveform file and its attributes including sampling rate, units, channel names and the signal gain. It contains line-oriented and field-oriented ASCII text and can be read by the WFDB library or generic text editors.
The directory structure of the HEEDB is organized as follows:
ECG/
├── I0006/
│ ├── 12SL_diagnoses/
│ │ ├── diagnoses.csv
│ │ ├── diagnoses_dictionary.csv
│ │ └── README
│ ├── ICD_codes/
│ │ ├── icd9_codes.csv
│ │ ├── icd10_codes.csv
│ │ └── README
│ ├── metadata/
│ │ ├── metadata.csv
│ │ └── README
│ └── WFDB/
│ ├── 2010/
│ ├── 2011/
│ ├── ...
│ └── 2018/
├── I0001/
│ ├── 12SL_diagnoses/
│ │ ├── diagnoses.csv
│ │ ├── diagnoses_dictionary.csv
│ │ └── README
│ ├── ICD_codes/
│ │ ├── icd9_codes.csv
│ │ ├── icd10_codes.csv
│ │ └── README
│ ├── metadata/
│ │ ├── metadata.csv
│ │ └── README
│ └── WFDB/
│ ├── S0001/
│ ├── S0002/
│ ├── S0003/
│ └── S0004/
Each institution (I0001 and I0006) maintains its own subfolders for diagnoses, ICD codes, metadata, and waveform files. The WFDB/ directory contains the ECG waveform data organized either by year (I0006) or by session identifier (I0001).
12SL Diagnoses Description
The 12SL_diagnoses/ folder contains diagnostic outputs from 12SL (GE Healthcare) software, version 1.
File: diagnoses.csv
This file contains two columns:
- FileName – Path to the corresponding WFDB file
- codes – Diagnostic codes, which can be mapped to text labels using diagnoses_dictionary.csv
File: diagnoses_dictionary.csv
This file provides human-readable mappings for 12SL diagnostic codes. It contains the following columns:
- codes – Integer codes for diagnoses
- acronym – Abbreviated diagnosis labels
- diagnoses – Full textual descriptions of diagnoses
ICD Codes Description
The ICD_codes/ folder contains diagnostic information extracted from Electronic Health Records (EHR) for each patient.
File: icd10_codes.csv
This file contains diagnostic codes from the 10th revision of the International Classification of Diseases (ICD-10), developed by the World Health Organization (WHO). These alphanumeric codes represent diagnoses and health conditions.
Columns:
- BDSPPatientID – Brain Data Science Platform Patient ID
- RECORDED_DT – Shifted date of the diagnosis
- DIAGNOSIS_ICD10_CD – Full ICD-10 diagnosis code
- DIAGNOSIS_ICD10_DESC – Description of the ICD-10 diagnosis code
File: icd9_codes.csv
This file contains diagnostic codes from the 9th revision of the International Classification of Diseases (ICD-9), also developed by the WHO. These codes are also sourced from the EHR system.
Columns:
- BDSPPatientID – Brain Data Science Platform Patient ID
- RECORDED_DT – Shifted date of the diagnosis
- DIAGNOSIS_ICD9_CD – Full ICD-9 diagnosis code
- DIAGNOSIS_ICD9_DESC – Description of the ICD-9 diagnosis code
Metadata Description
The metadata/ folder contains demographic and temporal information associated with each ECG recording, including ECG acquisition time, date of birth, date of death, and derived age-related fields.
File: metadata.csv
Columns:
- BDSPPatientID – Patient ID
- FileName – Path to the WFDB file
- FileID – Basename of the WFDB file
- PatientRace
- EthnicGroupDSC
- MaritalStatusDSC
- ReligionDSC
- LanguageDSC
- VeteranStatusDSC
- SexDSC
- PrimaryCauseOfDeathDSC
- PrimaryCauseOfDeathUNOS
- FirstContributoryCauseOfDeathDSC
- FirstContributoryCauseOfDeathUNOS
- SecondContributoryCauseOfDeathDSC
- SecondContributoryCauseOfDeathUNOS
- EducationLevelDSC
- GenderIdentityDSC
- SexAssignedAtBirthDSC
- DateOfDeath
- DateOfDeathMARegistryData – Massachusetts (MA) state death registry date of death
- LastKnownVisitDate – Last time the patient had contact with the hospital system
- ECGAcquisitionTime – Time of ECG acquisition
- DateOfBirth
- AgeAtAcquisition – Age at ECG acquisition
- AgeAtDeath – Age at time of death
- AgeAtDeathMA – Age at death according to MA state registry
- AgeAtLastVisit – Age at the last hospital contact
For I0006, the following columns are missing from the metadata.csv file: EthnicGroupDSC, MaritalStatusDSC, ReligionDSC, LanguageDSC, VeteranStatusDSC, PrimaryCauseOfDeathDSC, PrimaryCauseOfDeathUNOS, FirstContributoryCauseOfDeathDSC, FirstContributoryCauseOfDeathUNOS, SecondContributoryCauseOfDeathDSC, SecondContributoryCauseOfDeathUNOS, EducationLevelDSC, GenderIdentityDSC, SexAssignedAtBirthDSC, and DateOfDeathMARegistryData.
Usage Notes
HEEDB is intended to support a wide range of ECG studies, in particular those exploring the relationship between ECG conditions and sleep.
Release Notes
- v1.0: Initial release containing 10,608,417 ECGs from 1,818,247 subjects (I0001 site only).
- v2.0: Added additional data files
- v3.0: Expanded to include two ECG institutions — I0001 (10,608,417 ECGs from 1,818,247 subjects) and I0006 (1,061,598 ECGs from 349,548 patients). Metadata, 12SL diagnostic codes, and ICD-9/10 diagnosis codes were also added. Duplicate ECGs were removed from the I0001 site, and incorrect sampling frequencies in header files were corrected.
- v4.0: Corrected typos in the data description
Ethics
The study protocol was approved by the Institutional Review Boards of the Massachusetts General Hospital (protocol # 2013P001024) and Beth Israel Deaconess Medical Center (protocol # 2022P000417). The written informed consents were waived, because of the retrospective study design. The study also complied with the Declaration of Helsinki.
Acknowledgements
Publication of HEEDB is supported by a grant (R01HL161253) from the National Heart Lung and Blood Institute (NHLBI) of the NIH to Massachusetts General Hospital, Emory University, Stanford University, Kaiser Permanente, Boston Children's Hospital, and Beth Israel Deaconess Medical Center
Conflicts of Interest
Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals. The other authors declare that they have no conflicts of interest.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project