Database Restricted Access

Harvard Electroencephalography Database

Sahar Zafar Tobias Loddenkemper Jong Woo Lee Andrew Cole Daniel Goldenholz Jurriaan Peters Alice Lam Edilberto Amorim Catherine Chu Sydney Cash Valdery Moura Junior Aditya Gupta Manohar Ghanta Marta Fernandes Haoqi Sun Jin Jing M Brandon Westover

Published: Feb. 10, 2025. Version: 4.1


When using this resource, please cite: (show more options)
Zafar, S., Loddenkemper, T., Lee, J. W., Cole, A., Goldenholz, D., Peters, J., Lam, A., Amorim, E., Chu, C., Cash, S., Moura Junior, V., Gupta, A., Ghanta, M., Fernandes, M., Sun, H., Jing, J., & Westover, M. B. (2025). Harvard Electroencephalography Database (version 4.1). Brain Data Science Platform. https://doi.org/10.60508/k85b-fc87.

Abstract

The Harvard EEG Database will encompass data gathered from four hospitals affiliated with Harvard University: Massachusetts General Hospital (MGH), Brigham and Women's Hospital (BWH), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH). The EEG data includes three types:

  • rEEG: "routine EEGs" recorded in the outpatient setting.
  • EMU: recordings obtained in the inpatient setting, within the Epilepsy Monitoring Unit (EMU).
  • ICU/LTM: recordings obtained from acutely and critically ill patients within the intensive care unit (ICU).

Background

Electroencephalography (EEG) is a cornerstone technology in clinical neurology, enabling the diagnosis of epilepsy, assessment of encephalopathy, prediction of neurological recovery after cardiac arrest, monitoring of consciousness during anesthesia, and evaluation of sleep disorders. This repository contains over 280,000 clinical EEG recordings from over 100,000 patients.

Complementing the EEG recordings, the database includes comprehensive electronic health record (EHR) data from participating institutions. The structured EHR data encompasses ICD diagnostic codes, laboratory results, vital signs, medications, procedure codes, and demographics, while the unstructured data includes detailed clinical notes from physicians, nurses, and other healthcare personnel, as well as diagnostic reports from imaging studies and EEG interpretations. Though currently maintained in site-specific formats, future releases will provide harmonized EHR data across all participating institutions, enabling more comprehensive multi-site analyses.

By making these rich clinical resources available to researchers worldwide, we aim to accelerate the development of advanced EEG analysis methods, improve the accuracy of automated interpretation, and ultimately expand access to high-quality neurological care in resource-limited settings. The database's scope and diversity create unprecedented opportunities for advancing our understanding of brain disorders and developing more effective diagnostic tools.
 


Methods

Most EEG data in this repository is recorded using the International 10-20 system for scalp electrode placement. Sampling rates of recordings are provided in the EEG header files. 


Data Description

As of February 9, 2025, the Harvard Electroencephalography Database includes 284,343 EEG studies conducted on 109,178 distinct patients across four sites. The data encompasses:

  • Site S0001: 98,932 EEG files from 37,394 unique patients 
  • Site S0002: 66,604 EEG files from 29,161 unique patients 
  • Site I0003: 61,255 EEG files from 25,045 unique patients 
  • Site I0002: 57,552 EEG files from 17,578 unique patients 
     

EEG Dataset Folder Structure:

The folder structure follows the BIDS (Brain Imaging Data Structure) specification version 1.7.0 for organizing EEG (electroencephalogram) data collected from multiple sites.

There are four main levels of the folder hierarchy, these are:

bids -> sub-ID -> ses-ID -> eeg

Bids-root-folder/
	└── dataset_description.json
	└── participants.json
	└── participants.tsv
	└── README
	└── sub-Id/
		└── ses-01/
			└── sub-SiteIdPatientId_ses-01_scans.tsv
			└── eeg
				└── sub-Id_ses-1_task-eeg_annotations.tsv
				└── sub-Id_ses-1_task-eeg_channels.tsv
				└── sub-Id_ses-1_task-eeg_eeg.edf
				└── sub-Id_ses-1_task-eeg_eeg.json 
				└── sub-Id_ses-1_task-eeg_pre.csv 

 

Description:

1. Top Level: BIDS (root-folder)

The top-level files provide metadata and general information about the dataset:

  • dataset_description.json: A description of the dataset.
  • participants.json: Metadata definitions for columns in participants.tsv.
  • participants.tsv: A list of participants with demographic and physical details.
  • README: General information and notes about the dataset.

2. Subject Level: sub-SiteIdPatientId

Each folder at this level represents a distinct patient. The subject ID is a combination of the study site ID and the patient's unique ID. All studies related to a specific patient can be found within their corresponding folder.

3. Session Level: ses-XX

Within each participant's folder, individual sessions correspond to separate EEG studies, labeled in chronological order.

  • sub-SiteIdPatientId _ses-01_scans.tsv: lists all EEG file names and their acquisition time for a session.

4. EEG Data Level: eeg/

The EEG sub-directory within each session contains:

  • Annotations: e.g., sub-SiteIdPatientId_ses-01_task-eeg_annotations.csv.
  • Data File: e.g., sub-SiteIdPatientId_ses-01_task-eeg_eeg.edf.
  • Metadata: e.g., sub-SiteIdPatientId_ses-01_task-eeg_eeg.json.
  • Channels Description: e.g., sub-SiteIdPatientId_ses-01_task-eeg_channels.tsv.

Metadata File

Along with the dataset, a CSV file is provided to assist in identifying the locations of specific studies. This CSV can be found in the Files section below and is structured as follows:

Column Name

Description

SiteID Unique identifier of the hospital where the EEG was recorded.
BDSPPatientID Unique identifier of the patient.
BidsFolder Folder where studies for a specific patient are available in the BDSP OpenData Repository.
SessionID Folder in the BDSP OpenData Repository containing a specific study and its auxiliary files for a particular patient.
CreationTime De-identified timestamp indicating when the EEG was recorded.
StartTime De-identified timestamp indicating when the EEG started.
EndTime De-identified timestamp indicating when the EEG finished.
DurationInSecond Duration of the EEG recording in seconds
HasXLTEKAnnotations Flag indicating if the study has annotations created on Natus/XLTEK.
HasPersystAnnotations Flag indicating if the study has annotations created on Persyst.
ServiceName EEG type, can be Routine, LTM or EMU.
AgeAtVisit Age of the patient at the time of the study.
SexDSC Patient informed gender.
BDSPLastModifiedDTS The last time the record was updated.

 

Electronic Health Records (EHR) Data

The EHR data from each site follows a standardized directory structure with three main branches: data_Imaging, data_Structured, and data_Unstructured. The data_Structured branch contains Parquet files organized by data type (e.g., demographics, medications, lab results, vital signs, problem lists, ICD codes, and CPT codes) and split into manageable partitions. The data_Unstructured branch contains clinical notes and reports organized chronologically, with both raw data files (.tar and .zip format) and corresponding metadata (.csv files). Clinical notes are separated into distinct periods and organized by year. The data_Imaging branch contains imaging metadata and reports. 
 


Usage Notes

Code for loading the EEG data is available in the associated GitHub repository (https://github.com/bdsp-core/Harvard-EEG-Database-Tools). 
 


Release Notes

In this new release, the data has been converted to the EEG-BIDS format.


Ethics

In this dataset, all data were anonymized with all identifiable patient information removed.


Acknowledgements

Thanks to the EEG technologists, attending physicians, and fellows who provide EEG diagnostic services. 


Conflicts of Interest

This work was supported by grants from the NIH (R01NS102190, R01NS102574, R01NS107291, RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598). 
 


Share
Access

Access Policy:
Only registered users who sign the specified data use agreement can access the files.

License (for files):
BDSP Restricted Health Data License 1.0.0

Data Use Agreement:
BDSP Restricted Health Data Use Agreement

Corresponding Author
You must be logged in to view the contact information.
Versions
  • 1.0 - June 15, 2023
  • 2.0 - Nov. 7, 2023
  • 3.0 - Dec. 12, 2024
  • 4.0 - Feb. 10, 2025
  • 4.1 - Feb. 10, 2025

Files