Database Restricted Access

The Human Sleep Project

M Brandon Westover Valdery Moura Junior Robert Thomas Sydney Cash Samaneh Nasiri Haoqi Sun Aditya Gupta Jonathan Rosand Manohar Ghanta Wolfgang Ganglberger Umakanth Katwa Katie Stone Zhiyong Zhang Gauri Ganjoo Thijs E Nassi PhD Candidate Ruoqi Wei Dennis Hwang Lynn Marie Trotti Ankit Parekh ErikJan Meulenbrugge Emmanuel Mignot Rhoda Au Gari Clifford David Rapoport

Published: Nov. 1, 2023. Version: 2.0

When using this resource, please cite: (show more options)
Westover, M. B., Moura Junior, V., Thomas, R., Cash, S., Nasiri, S., Sun, H., Gupta, A., Rosand, J., Ghanta, M., Ganglberger, W., Katwa, U., Stone, K., Zhang, Z., Ganjoo, G., Nassi PhD Candidate, T. E., Wei, R., Hwang, D., Trotti, L. M., Parekh, A., ... Rapoport, D. (2023). The Human Sleep Project (version 2.0). Brain Data Science Platform.


The Human Sleep Project (HSP) sleep physiology dataset is a growing collection of clinical polysomnography (PSG) recordings. Beginning with PSG recordings from from ~19K patients evaluated at the Massachusetts General Hospital, the HSP will grow over the coming years to include data from >200K patients, as well as people evaluated outside of the clinical setting.


The HSP dataset is being used to develop CAISR (Complete AI Sleep Report), a collection of deep neural networks,  rule-based algorithms, and signal processing approaches designed to provide better-than-human detection of conventional PSG scoring metrics, including sleep stages, arousals, apnea and hypopnea events and their subtypes, and periodic limb movements.

Beyond conventional scoring, the HSP dataset is intended to support research seeking to identify "hidden" information within the brain's activity during sleep that can be used to directly measure brain health. These brain health indicators include measures of risk for common neurologic diseases, including cerebrovascular disease, Alzheimer's disease, and related neurodegenerative diseases of aging; indicators of response to therapies, including lifestyle interventions (e.g. diet, meditation, exercise) and pharmacologic interventions.

Over time we will be adding additional data to enable further research on the relationships between sleep and health, including medical diagnoses, medical testing and imaging results, brain images (MRI, CT, PET), genetics, and omics data. 


As of 4/1/2023, the dataset includes 25,941 PSG recordings from the Massachusetts General Hospital’s (MGH) Sleep Lab in the Sleep Division. The PSG recordings were captured following the AASM standards, which included thirteen signals. These signals comprise six channels of electroencephalography (EEG) at F3-M2, F4-M1, C3-M2, C4-M1, O1-M2, and O2-M1, based on the International 10/20 System; electroculography (EOG) on the left side (EEG and EOG referenced to the contralateral ear lobe); electromyography (EMG) measured at the chin; two channels of respiration signals from the abdomen and chest; airflow and oxygen saturation (SaO2); and one ECG channel recorded below the right clavicle near the sternum and over the left lateral chest wall. All signals, except the SaO2, are measured with (or resampled to) a sampling frequency of 200Hz. SaO2 signals have been upsampled using sample and hold to 200Hz to synchronize all signals. All signals are measured in microvolts.

All HSP data is shared under protocols reviewed by appropriate local Institutional Review Boards (IRBs). Data is deidentified following the Safe Harbor Method

Data Description

The Human Sleep Project dataset includes 26,200 PSG studies conducted on 19,492 distinct patients, as outlined below:

  • Sleep stages were annotated by certified sleep technologists as part of routine clinical care, according to the American Academy of Sleep Medicine (AASM) manual for the scoring of sleep. Stages were annotated in 30 second contiguous intervals, and include: wakefulness, (W) non-REM stage 1 (N1), non-REM stage 2 (N2), non-REM stage 3 (N3), and rapid eye movement (REM) sleep. 
  • Arousals are annotated, and classified as either spontaneous or respiratory effort related arousals (RERA), or arousals associated with a variety of other events including bruxisms (teeth grinding), hypoventilations, hypopneas, apneas (central, obstructive and mixed), vocalizations, snores, periodic leg movements. 
  • Respiratory events are scored as obstructive apnea, central apnea, mixed apnea, hypopnea, and respiratory effort-related arousal. 
  • Periodic limb movements and isolated limb movements are scored. 


Dataset Folder Structure:

The folder structure follows the BIDS (Brain Imaging Data Structure) specification version 1.7.0 for organizing EEG (electroencephalogram) data collected from multiple sites.

There are four main levels of the folder hierarchy, these are:

bids -> sub-ID -> ses-ID -> eeg

	└── dataset_description.json
	└── participants.json
	└── participants.tsv
	└── sub-Id/
		└── ses-01/
			└── sub-SiteIdPatientId_ses-01_scans.tsv
			└── eeg
				└── sub-Id_ses-1_task-psg_annotations.tsv
				└── sub-Id_ses-1_task-psg_channels.tsv
				└── sub-Id_ses-1_task-psg_eeg.edf
				└── sub-Id_ses-1_task-psg_eeg.json 
				└── sub-Id_ses-1_task-psg_pre.csv 



1. Top Level: BIDS (root-folder)

The top-level files provide metadata and general information about the dataset:

  • dataset_description.json: A description of the dataset.
  • participants.json: Metadata definitions for columns in participants.tsv.
  • participants.tsv: A list of participants with demographic and physical details.
  • README: General information and notes about the dataset.

2. Subject Level: sub-SiteIdPatientId

Each folder at this level represents a distinct patient. The subject ID is a combination of the study site ID and the patient's unique ID. All studies related to a specific patient can be found within their corresponding folder.

3. Session Level: ses-XX

Within each participant's folder, individual sessions correspond to separate PSG studies, labeled in chronological order.

  • sub-SiteIdPatientId _ses-01_scans.tsv: lists all PSG file names and their acquisition time for a session.

4. PSG Data Level: eeg/

The EEG sub-directory within each session contains:

  • Annotations: e.g., sub-SiteIdPatientId_ses-01_task-psg_annotations.csv.
  • Data File: e.g., sub-SiteIdPatientId_ses-01_task-psg_eeg.edf.
  • Metadata: e.g., sub-SiteIdPatientId_ses-01_task-psg_eeg.json.
  • Channels Description: e.g., sub-SiteIdPatientId_ses-01_task-psg_channels.tsv.


Metadata File

Along with the dataset, a CSV file is provided to assist in identifying the locations of specific studies. This CSV can be found in the Files section below and is structured as follows:

Column Name


SiteID Unique identifier of the hospital where the PSG was recorded.
BDSPPatientID Unique identifier of the patient.
CreationTime De-identified timestamp when the PSG was recorded.
BidsFolder Folder where studies for a specific patient are available in the BDSP OpenData Repository.
SessionID Folder in the BDSP OpenData Repository containing a specific study and its auxiliary files for a particular patient.
PreSleepQuestionnaire Flag indicating if the study has a pre-sleep questionnaire.
HasAnnotations Flag indicating if the study has annotations.
HasStaging Flag indicating if the study has recorded sleep stages.
StudyType Type of PSG study.
AgeAtVisit Age of the patient at the time of the study.
SexDSC Patient gender.
BDSPLastModifiedDTS The last time the record was updated.

Usage Notes

The PSG signal data is available in .edf files, and the annotations are available as .csv files. We have ensured the de-identification of all files on the platform, with no names or real dates included. Dates have been shifted to protect the privacy of the participants.

We have provided detailed documentation on how to read and work with the files, which can be found on our GitHub repository. 

Code for automated scoring of events will be available on the CAISR github repository. 

Release Notes

By requesting access to the data, you agree not to download, copy, repost or to publish or otherwise share any work that uses the data, in full or in part, without written consent from the BDSP leadership team. This condition will be loosened at a later date. 


Data collection and sharing for the HSP is performed under Institutional Review Board (IRB) approvals and data sharing agreements among participating hospitals, with waiver of the requirement for informed consent. HSP data is generated as part of usual patient care. All data is deidentified. 


The Human Sleep Project has received support from the Glenn Foundation and the American Federation of Aging Research (AFAR) through the 2018 Glenn / AFAR Award for Medical Research Breakthroughs in Gerontology (BIG) (2018), the American Academy of Sleep Medicine (AASM) through a 2019 Strategic Research Award, the National Institutes of Health (NIH) (R01NS102190, R01NS102574, R01NS107291, RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598), the National Science Foundation (NSF 2014431), and through the Henry and Allison McCance Center for Brain Health.

Conflicts of Interest

MBW is a co-founder of Beacon Biosignals. Beacon Biosignals did not contribute funding and played no role in this work.


Access Policy:
Only registered users who sign the specified data use agreement can access the files.

License (for files):
BDSP Restricted Health Data License 1.0.0

Data Use Agreement:
BDSP Restricted Health Data Use Agreement

Corresponding Author
You must be logged in to view the contact information.
  • 1.0 - May 23, 2023
  • 2.0 - Nov. 1, 2023