Database Credentialed Access
Narcolepsy Risk Estimation from Clinical Notes
Niels Turley , Haoqi Sun , M Brandon Westover
Published: March 2, 2026. Version: 1.0
When using this resource, please cite:
(show more options)
Turley, N., Sun, H., & Westover, M. B. (2026). Narcolepsy Risk Estimation from Clinical Notes (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/ac4f-q855.
Abstract
Narcolepsy is a chronic neurological disorder that is often underdiagnosed and subject to long diagnostic delays. We developed and validated machine learning models to phenotype narcolepsy type 1 (NT1) and narcolepsy type 2/idiopathic hypersomnia (NT2/IH) using electronic health record (EHR) data from five sites within the Brain Data Science Platform (BDSP): Mass General Brigham, Beth Israel Deaconess Medical Center, Boston Children's Hospital, Stanford University, and Emory University. Clinical notes were manually annotated by physician reviewers following a standardized protocol, and model features were derived from ICD codes, medication orders, and natural language keyword extraction. For cross-sectional classification, we trained logistic regression, random forest, gradient boosting, and XGBoost models using nested leave-one-site-out cross-validation. NT1 classification achieved mean AUROCs of 0.991–0.994 and AUPRCs of 0.906–0.935; NT2/IH classification was more challenging, with mean AUROCs of 0.967–0.984 and AUPRCs of 0.692–0.778. For longitudinal prediction, we trained regularized logistic regression models (SGD with L1 penalty) using cumulative NLP features from pre-diagnostic notes, with a 6-month horizon exclusion to prevent learning from diagnostic-workup features. Leave-one-site-out validation achieved AUROCs of 0.80 (any narcolepsy) and 0.79 (NT1), enabling identification of at-risk patients prior to clinical diagnosis. We here release the associated data and code to support reproducible research in narcolepsy phenotyping from large-scale EHR data.
Background
Narcolepsy is a chronic neurological disorder characterized by excessive daytime sleepiness, with narcolepsy type 1 (NT1) additionally defined by cataplexy and/or hypocretin deficiency. Narcolepsy type 2 (NT2) and idiopathic hypersomnia (IH) share overlapping clinical features and are often grouped together as diagnoses of exclusion. Both conditions are substantially underdiagnosed, with long delays between symptom onset and clinical diagnosis. Electronic health records (EHR) contain rich clinical information—including clinical notes, ICD codes, and medication records—that can support automated identification of these patients at scale. However, manual chart review remains the gold standard for diagnostic ascertainment and is impractical for large populations. This project develops machine learning models that integrate structured and unstructured EHR data to phenotype narcolepsy subtypes, enabling both cross-sectional classification and longitudinal prediction of narcolepsy risk prior to clinical diagnosis.
Methods
We analyzed data from 6,498 patients (8,990 clinical encounters) across five sites within the Brain Data Science Platform (BDSP): Mass General Brigham (MGB), Beth Israel Deaconess Medical Center (BIDMC), Boston Children's Hospital (BCH), Stanford University (SU), and Emory University (EU). Ground truth diagnoses were established through manual chart review by six physician annotators using a standardized operating procedure that incorporated CSF hypocretin levels, MSLT/PSG results, cataplexy documentation, and clinical assessment. Features were extracted from three data sources: ICD codes (3 features), medication orders (27 features), and keywords/phrases from clinical notes (894 features), yielding 924 features after filtering.
For cross-sectional classification, we trained logistic regression, random forest, gradient boosting, and XGBoost models to distinguish NT1 vs. others, NT2/IH vs. others, and combined narcolepsy vs. others. Model selection and evaluation used nested cross-validation with leave-one-site-out (LOSO) in the outer loop and 5-fold patient-level splits with Bayesian hyperparameter optimization in the inner loop.
For longitudinal prediction, we trained regularized logistic regression models using stochastic gradient descent (SGD) with L1 (lasso) penalty. Features were the same 924 NLP features as in the cross-sectional task, accumulated as cumulative counts over time. We applied chi-squared feature selection (top 100 features) and trained using balanced minibatches (one visit per case patient matched with equal control visits, 200 epochs). The training window was restricted to [-2.5 years, -0.5 years] before diagnosis to prevent the model from learning diagnostic-workup features. L1 regularization strength was tuned via inner 3-fold cross-validation. Primary validation used stratified 5-fold CV; secondary validation used leave-one-site-out (LOSO) CV across all five sites.
This study was approved by the BIDMC Institutional Review Board.
Data Description
The dataset is organized into the following directories on S3:
Root files (legacy format):
feat.parquet— Extracted NLP features per clinical noteicd.parquet— ICD diagnosis codes per patient visitmed.parquet— Medication order records per patient visitnote.parquet— Clinical note text with patient identifiers and dates
discriminative-modeling/ — Data for the cross-sectional classification task:
notes.parquet— Clinical notes with patient IDs and datesicd.parquet— ICD diagnosis codesmed.parquet— Medication recordsfeatures.parquet— Extracted 924-dimensional feature vectors per visitbdsp_narco_pts.parquet— Patient-level metadata and cohort membershipbdsp_narco_swimmer.parquet— Swimmer plot data (patient clinical timelines)predictive_annotation.parquet— Ground truth diagnostic annotations from physician chart review
predictive-modeling/ — Data for the longitudinal risk prediction task:
features.parquet,icd.parquet,med.parquet,notes.parquet— Input clinical datadiagnosis_annotation.parquet— Diagnosis timing annotationsgroupings.parquet— Feature grouping definitionsnt1/— Cumulative feature matrices for NT1 prediction (narcolepsy-positive, narcolepsy-negative, and control cohorts)nt2ih/— Cumulative feature matrices for NT2/IH prediction (same structure as nt1/)
results/ — Full evaluation results for two classification tasks:
nt1_vs_others/— NT1 vs. non-narcolepsy: trained fold models (Logistic Regression, Random Forest, Gradient Boosting, XGBoost), confusion matrices, per-fold and average performance CSVsnt2ih_vs_others/— NT2/IH vs. non-narcolepsy: same structure as nt1_vs_others/
All tabular data is in Apache Parquet format. Trained models are serialized as Python pickle (.pkl) files. All data has been de-identified. Data were collected from five BDSP sites: Mass General Brigham, Beth Israel Deaconess Medical Center, Boston Children's Hospital, Stanford University, and Emory University.
Usage Notes
Code to reproduce all results is available at https://github.com/bdsp-core/NAX-Narcolepsy. The repository contains the following components:
discriminative-modeling/ — Cross-sectional note-level classification. Includes the NarcolepsyModel class for feature extraction (ICD codes, medications, NLP keywords with negation detection) and prediction, a model training/evaluation framework with leave-one-site-out cross-validation, and pre-trained Random Forest classifiers. See discriminative-model.ipynb for a complete walkthrough.
predictive-modeling/ — Longitudinal pre-diagnostic risk scoring. The primary pipeline (risk_score_v2/) uses SGD logistic regression with L1 penalty for cumulative risk estimation. See predictive-model.ipynb for usage instructions. An alternative pooled logistic regression approach is also included (pooled-logistic-regression/).
paper_figures/ — Jupyter notebooks to reproduce the main figures from the manuscript, including confusion matrices, ROC/precision-recall curves, and swimmer plots of patient clinical timelines.
timeline-viewer/ — A git submodule linking to the timeline-viewer web application for reviewing and annotating patient clinical timelines.
Dependencies are listed in discriminative-modeling/env.toml. Key packages: polars, pandas, scikit-learn, NLTK, Ray (for parallel NLP processing), xgboost, matplotlib, seaborn.
Ethics
This study was conducted under IRB protocols approved and overseen by the BIDMC ethics committee (protocols 2024P000807, 2022P000417, 2024P000804), which granted a waiver of consent for retrospective analysis of de-identified EHR data.
Conflicts of Interest
Dr. Westover is a co-founder, serves as a scientific advisor and consultant to, and has a personal equity interest in Beacon Biosignals.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
DOI:
https://doi.org/10.60508/ac4f-q855
Project Website:
https://github.com/bdsp-core/NAX-Narcolepsy
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project