Name: Narcolepsy Risk Estimation from Clinical Notes
Published: March 2, 2026
License: https://github.com/bdsp-core/bdsp-license-and-dua

Database Credentialed Access

Niels Turley , Haoqi Sun , M Brandon Westover

Published: March 2, 2026. Version: 1.0

When using this resource, please cite: (show more options)
Turley, N., Sun, H., & Westover, M. B. (2026). Narcolepsy Risk Estimation from Clinical Notes (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/ac4f-q855.

MLA	Turley, Niels, et al. "Narcolepsy Risk Estimation from Clinical Notes" (version 1.0). Brain Data Science Platform (2026), https://doi.org/10.60508/ac4f-q855.
APA	Turley, N., Sun, H., & Westover, M. B. (2026). Narcolepsy Risk Estimation from Clinical Notes (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/ac4f-q855.
Chicago	Turley, Niels, Sun, Haoqi, and M Brandon Westover. "Narcolepsy Risk Estimation from Clinical Notes" (version 1.0). Brain Data Science Platform (2026). https://doi.org/10.60508/ac4f-q855.
Harvard	Turley, N., Sun, H., and Westover, M. B. (2026) 'Narcolepsy Risk Estimation from Clinical Notes' (version 1.0), Brain Data Science Platform. Available at: https://doi.org/10.60508/ac4f-q855.
Vancouver	Turley N, Sun H, Westover M B. Narcolepsy Risk Estimation from Clinical Notes (version 1.0). Brain Data Science Platform. 2026. Available from: https://doi.org/10.60508/ac4f-q855.

Abstract

Narcolepsy is a chronic neurological disorder that is often underdiagnosed and subject to long diagnostic delays. We developed and validated machine learning models to phenotype narcolepsy type 1 (NT1) and narcolepsy type 2/idiopathic hypersomnia (NT2/IH) using electronic health record (EHR) data from five sites within the Brain Data Science Platform (BDSP): Mass General Brigham, Beth Israel Deaconess Medical Center, Boston Children's Hospital, Stanford University, and Emory University. Clinical notes were manually annotated by physician reviewers following a standardized protocol, and model features were derived from ICD codes, medication orders, and natural language keyword extraction. For cross-sectional classification, we trained logistic regression, random forest, gradient boosting, and XGBoost models using nested leave-one-site-out cross-validation. NT1 classification achieved mean AUROCs of 0.991–0.994 and AUPRCs of 0.906–0.935; NT2/IH classification was more challenging, with mean AUROCs of 0.967–0.984 and AUPRCs of 0.692–0.778. For longitudinal prediction, we trained regularized logistic regression models (SGD with L1 penalty) using cumulative NLP features from pre-diagnostic notes, with a 6-month horizon exclusion to prevent learning from diagnostic-workup features. Leave-one-site-out validation achieved AUROCs of 0.80 (any narcolepsy) and 0.79 (NT1), enabling identification of at-risk patients prior to clinical diagnosis. We here release the associated data and code to support reproducible research in narcolepsy phenotyping from large-scale EHR data.

Background

Narcolepsy is a chronic neurological disorder characterized by excessive daytime sleepiness, with narcolepsy type 1 (NT1) additionally defined by cataplexy and/or hypocretin deficiency. Narcolepsy type 2 (NT2) and idiopathic hypersomnia (IH) share overlapping clinical features and are often grouped together as diagnoses of exclusion. Both conditions are substantially underdiagnosed, with long delays between symptom onset and clinical diagnosis. Electronic health records (EHR) contain rich clinical information—including clinical notes, ICD codes, and medication records—that can support automated identification of these patients at scale. However, manual chart review remains the gold standard for diagnostic ascertainment and is impractical for large populations. This project develops machine learning models that integrate structured and unstructured EHR data to phenotype narcolepsy subtypes, enabling both cross-sectional classification and longitudinal prediction of narcolepsy risk prior to clinical diagnosis.

Methods

We analyzed data from 6,498 patients (8,990 clinical encounters) across five sites within the Brain Data Science Platform (BDSP): Mass General Brigham (MGB), Beth Israel Deaconess Medical Center (BIDMC), Boston Children's Hospital (BCH), Stanford University (SU), and Emory University (EU). Ground truth diagnoses were established through manual chart review by six physician annotators using a standardized operating procedure that incorporated CSF hypocretin levels, MSLT/PSG results, cataplexy documentation, and clinical assessment. Features were extracted from three data sources: ICD codes (3 features), medication orders (27 features), and keywords/phrases from clinical notes (894 features), yielding 924 features after filtering.

For cross-sectional classification, we trained logistic regression, random forest, gradient boosting, and XGBoost models to distinguish NT1 vs. others, NT2/IH vs. others, and combined narcolepsy vs. others. Model selection and evaluation used nested cross-validation with leave-one-site-out (LOSO) in the outer loop and 5-fold patient-level splits with Bayesian hyperparameter optimization in the inner loop.

For longitudinal prediction, we trained regularized logistic regression models using stochastic gradient descent (SGD) with L1 (lasso) penalty. Features were the same 924 NLP features as in the cross-sectional task, accumulated as cumulative counts over time. We applied chi-squared feature selection (top 100 features) and trained using balanced minibatches (one visit per case patient matched with equal control visits, 200 epochs). The training window was restricted to [-2.5 years, -0.5 years] before diagnosis to prevent the model from learning diagnostic-workup features. L1 regularization strength was tuned via inner 3-fold cross-validation. Primary validation used stratified 5-fold CV; secondary validation used leave-one-site-out (LOSO) CV across all five sites.

This study was approved by the BIDMC Institutional Review Board.

Data Description

The dataset is organized into the following directories on S3:

Root files (legacy format):

feat.parquet — Extracted NLP features per clinical note
icd.parquet — ICD diagnosis codes per patient visit
med.parquet — Medication order records per patient visit
note.parquet — Clinical note text with patient identifiers and dates

discriminative-modeling/ — Data for the cross-sectional classification task:

notes.parquet — Clinical notes with patient IDs and dates
icd.parquet — ICD diagnosis codes
med.parquet — Medication records
features.parquet — Extracted 924-dimensional feature vectors per visit
bdsp_narco_pts.parquet — Patient-level metadata and cohort membership
bdsp_narco_swimmer.parquet — Swimmer plot data (patient clinical timelines)
predictive_annotation.parquet — Ground truth diagnostic annotations from physician chart review

predictive-modeling/ — Data for the longitudinal risk prediction task:

features.parquet, icd.parquet, med.parquet, notes.parquet — Input clinical data
diagnosis_annotation.parquet — Diagnosis timing annotations
groupings.parquet — Feature grouping definitions
nt1/ — Cumulative feature matrices for NT1 prediction (narcolepsy-positive, narcolepsy-negative, and control cohorts)
nt2ih/ — Cumulative feature matrices for NT2/IH prediction (same structure as nt1/)

results/ — Full evaluation results for two classification tasks:

nt1_vs_others/ — NT1 vs. non-narcolepsy: trained fold models (Logistic Regression, Random Forest, Gradient Boosting, XGBoost), confusion matrices, per-fold and average performance CSVs
nt2ih_vs_others/ — NT2/IH vs. non-narcolepsy: same structure as nt1_vs_others/

All tabular data is in Apache Parquet format. Trained models are serialized as Python pickle (.pkl) files. All data has been de-identified. Data were collected from five BDSP sites: Mass General Brigham, Beth Israel Deaconess Medical Center, Boston Children's Hospital, Stanford University, and Emory University.

Usage Notes

Code to reproduce all results is available at https://github.com/bdsp-core/NAX-Narcolepsy. The repository contains the following components:

discriminative-modeling/ — Cross-sectional note-level classification. Includes the NarcolepsyModel class for feature extraction (ICD codes, medications, NLP keywords with negation detection) and prediction, a model training/evaluation framework with leave-one-site-out cross-validation, and pre-trained Random Forest classifiers. See discriminative-model.ipynb for a complete walkthrough.

predictive-modeling/ — Longitudinal pre-diagnostic risk scoring. The primary pipeline (risk_score_v2/) uses SGD logistic regression with L1 penalty for cumulative risk estimation. See predictive-model.ipynb for usage instructions. An alternative pooled logistic regression approach is also included (pooled-logistic-regression/).

paper_figures/ — Jupyter notebooks to reproduce the main figures from the manuscript, including confusion matrices, ROC/precision-recall curves, and swimmer plots of patient clinical timelines.

timeline-viewer/ — A git submodule linking to the timeline-viewer web application for reviewing and annotating patient clinical timelines.

Dependencies are listed in discriminative-modeling/env.toml. Key packages: polars, pandas, scikit-learn, NLTK, Ray (for parallel NLP processing), xgboost, matplotlib, seaborn.

Ethics

This study was conducted under IRB protocols approved and overseen by the BIDMC ethics committee (protocols 2024P000807, 2022P000417, 2024P000804), which granted a waiver of consent for retrospective analysis of de-identified EHR data.