Database Restricted Access
Harvard Electroencephalography Database
Sahar Zafar , Tobias Loddenkemper , Jong Woo Lee , Andrew Cole , Daniel Goldenholz , Jurriaan Peters , Alice Lam , Edilberto Amorim , Catherine Chu , Sydney Cash , Valdery Moura Junior , Aditya Gupta , Manohar Ghanta , Marta Fernandes , Haoqi Sun , Jin Jing , M Brandon Westover
Published: Feb. 10, 2025. Version: 4.1
When using this resource, please cite:
(show more options)
Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., Lam, A., Amorim, E., Chu, C., Cash, S., Moura Junior, V., Gupta, A., Ghanta, M., Fernandes, M., Sun, H., Jing, J., & Westover, M. B. (2025). Harvard Electroencephalography Database (version 4.1). Brain Data Science Platform. https://doi.org/10.60508/k85b-fc87.
Abstract
The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University: Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH). The EEG data includes three types:
- rEEG: "routine EEGs" recorded in the outpatient setting.
- EMU: recordings obtained in the inpatient setting, within the Epilepsy Monitoring Unit (EMU).
- ICU/LTM: recordings obtained from acutely and critically ill patients within the intensive care unit (ICU).
Background
Electroencephalography (EEG) is a cornerstone technology in clinical neurology, enabling the diagnosis of epilepsy, assessment of encephalopathy, prediction of neurological recovery after cardiac arrest, monitoring of consciousness during anesthesia, and evaluation of sleep disorders. This repository contains over 280,000 clinical EEG recordings from over 100,000 patients.
Complementing the EEG recordings, the database includes comprehensive electronic health record (EHR) data from participating institutions. The structured EHR data encompasses ICD diagnostic codes, laboratory results, vital signs, medications, procedure codes, and demographics, while the unstructured data includes detailed clinical notes from physicians, nurses, and other healthcare personnel, as well as diagnostic reports from imaging studies and EEG interpretations. Though currently maintained in site-specific formats, future releases will provide harmonized EHR data across all participating institutions, enabling more comprehensive multi-site analyses.
By making these rich clinical resources available to researchers worldwide, we aim to accelerate the development of advanced EEG analysis methods, improve the accuracy of automated interpretation, and ultimately expand access to high-quality neurological care in resource-limited settings. The database's scope and diversity create unprecedented opportunities for advancing our understanding of brain disorders and developing more effective diagnostic tools.
Methods
Most EEG data in this repository is recorded using the International 10-20 system for scalp electrode placement. Sampling rates of recordings are provided in the EEG header files.
Data Description
As of February 9, 2025, the Harvard Electroencephalography Database includes 284,343 EEG studies conducted on 109,178 distinct patients across four sites. The data encompasses:
- Site S0001: 98,932 EEG files from 37,394 unique patients
- Site S0002: 66,604 EEG files from 29,161 unique patients
- Site I0003: 61,255 EEG files from 25,045 unique patients
- Site I0002: 57,552 EEG files from 17,578 unique patients
EEG Dataset Folder Structure:
The folder structure follows the BIDS (Brain Imaging Data Structure) specification version 1.7.0 for organizing EEG (electroencephalogram) data collected from multiple sites.
There are four main levels of the folder hierarchy, these are:
bids -> sub-ID -> ses-ID -> eeg
Bids-root-folder/
└── dataset_description.json
└── participants.json
└── participants.tsv
└── README
└── sub-Id/
└── ses-01/
└── sub-SiteIdPatientId_ses-01_scans.tsv
└── eeg
└── sub-Id_ses-1_task-eeg_annotations.tsv
└── sub-Id_ses-1_task-eeg_channels.tsv
└── sub-Id_ses-1_task-eeg_eeg.edf
└── sub-Id_ses-1_task-eeg_eeg.json
└── sub-Id_ses-1_task-eeg_pre.csv
Description:
1. Top Level: BIDS (root-folder)
The top-level files provide metadata and general information about the dataset:
- dataset_description.json: A description of the dataset.
- participants.json: Metadata definitions for columns in participants.tsv.
- participants.tsv: A list of participants with demographic and physical details.
- README: General information and notes about the dataset.
2. Subject Level: sub-SiteIdPatientId
Each folder at this level represents a distinct patient. The subject ID is a combination of the study site ID and the patient's unique ID. All studies related to a specific patient can be found within their corresponding folder.
3. Session Level: ses-XX
Within each participant's folder, individual sessions correspond to separate EEG studies, labeled in chronological order.
- sub-SiteIdPatientId _ses-01_scans.tsv: lists all EEG file names and their acquisition time for a session.
4. EEG Data Level: eeg/
The EEG sub-directory within each session contains:
- Annotations: e.g., sub-SiteIdPatientId_ses-01_task-eeg_annotations.csv.
- Data File: e.g., sub-SiteIdPatientId_ses-01_task-eeg_eeg.edf.
- Metadata: e.g., sub-SiteIdPatientId_ses-01_task-eeg_eeg.json.
- Channels Description: e.g., sub-SiteIdPatientId_ses-01_task-eeg_channels.tsv.
Metadata File
Along with the dataset, a CSV file is provided to assist in identifying the locations of specific studies. This CSV can be found in the Files section below and is structured as follows:
Column Name |
Description |
---|---|
SiteID | Unique identifier of the hospital where the EEG was recorded. |
BDSPPatientID | Unique identifier of the patient. |
BidsFolder | Folder where studies for a specific patient are available in the BDSP OpenData Repository. |
SessionID | Folder in the BDSP OpenData Repository containing a specific study and its auxiliary files for a particular patient. |
CreationTime | De-identified timestamp indicating when the EEG was recorded. |
StartTime | De-identified timestamp indicating when the EEG started. |
EndTime | De-identified timestamp indicating when the EEG finished. |
DurationInSecond | Duration of the EEG recording in seconds |
HasXLTEKAnnotations | Flag indicating if the study has annotations created on Natus/XLTEK. |
HasPersystAnnotations | Flag indicating if the study has annotations created on Persyst. |
ServiceName | EEG type, can be Routine, LTM or EMU. |
AgeAtVisit | Age of the patient at the time of the study. |
SexDSC | Patient informed gender. |
BDSPLastModifiedDTS | The last time the record was updated. |
Electronic Health Records (EHR) Data
The EHR data from each site follows a standardized directory structure with three main branches: data_Imaging, data_Structured, and data_Unstructured. The data_Structured branch contains Parquet files organized by data type (e.g., demographics, medications, lab results, vital signs, problem lists, ICD codes, and CPT codes) and split into manageable partitions. The data_Unstructured branch contains clinical notes and reports organized chronologically, with both raw data files (.tar and .zip format) and corresponding metadata (.csv files). Clinical notes are separated into distinct periods and organized by year. The data_Imaging branch contains imaging metadata and reports.
Usage Notes
Code for loading the EEG data is available in the associated GitHub repository (https://github.com/bdsp-core/Harvard-EEG-Database-Tools).
Release Notes
In this new release, the data has been converted to the EEG-BIDS format.
Ethics
In this dataset, all data were anonymized with all identifiable patient information removed.
Acknowledgements
Thanks to the EEG technologists, attending physicians, and fellows who provide EEG diagnostic services.
Conflicts of Interest
This work was supported by grants from the NIH (R01NS102190, R01NS102574, R01NS107291, RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598).
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
BDSP Restricted Health Data License 1.0.0
Data Use Agreement:
BDSP Restricted Health Data Use Agreement
Discovery
Corresponding Author
Files
- sign the data use agreement for the project