Database Credentialed Access

PRediction Of Disease PHEnoTypes (PROPHET)

Niels Turley Marta Fernandes Shadi Sartipi Han Wu Alice Lam Lydia Petersen Catherine Clive Daniel Sumsion Ruoqi Wei Bram Overmeer Jaden Searle Gregory Hooke Spencer Boris Wan-Yee Kong Arjun Singh Marjan Sarami Alihan Yaramis Imad Akbar Rebecca Milde Jet Veltink Elijah Davis Aditya Gupta Manohar Ghanta Aidan McDonald Wojciechowski Shibani Mukerji Haoqi Sun M Brandon Westover Sahar Zafar

Published: March 31, 2026. Version: 1.0


When using this resource, please cite: (show more options)
Turley, N., Fernandes, M., Sartipi, S., Wu, H., Lam, A., Petersen, L., Clive, C., Sumsion, D., Wei, R., Overmeer, B., Searle, J., Hooke, G., Boris, S., Kong, W., Singh, A., Sarami, M., Yaramis, A., Akbar, I., Milde, R., ... Zafar, S. (2026). PRediction Of Disease PHEnoTypes (PROPHET) (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/zkvt-gt36.

Additionally, please cite the original publication:

Sun C, Jing J, Turley N, Alcott C, Kang WY, Cole AJ, Goldenholz DM, Lam A, Amorim E, Chu C, Cash S, Junior VM, Gupta A, Ghanta M, Nearing B, Nascimento FA, Struck A, Kim J, Sartipi S, Tauton AM, Fernandes M, Sun H, Bayas G, Gallagher K, Wagenaar JB, Sinha N, Lee-Messer C, Silvers CT, Gunapati B, Rosand J, Peters J, Loddenkemper T, Lee JW, Zafar S, Westover MB. Harvard Electroencephalography Database: A comprehensive clinical electroencephalographic resource from four Boston hospitals. Epilepsia. 2025 Sep;66(9):3411-3425. doi: 10.1111/epi.18487. Epub 2025 Jun 4. PMID: 40464151; PMCID: PMC12455399.

Abstract

Large-scale electronic health record (EHR) phenotyping is essential for epidemiology, outcomes research, and clinical-trial recruitment, yet existing resources are largely single-center, limited to binary diagnostic labels, and lack computational efficiency for deployment across millions of clinical notes. No publicly available, multicenter annotated EHR resource exists for neurological phenotyping spanning diagnoses, severity scales, and outcomes.

We assembled a multicenter, de-identified EHR resource spanning six U.S. health systems (2010–2023) with expert-annotated reference standards. We developed a high-throughput phenotyping framework—Prophet (PRediction Of Disease PHEnoTypes)—combining machine learning and natural language processing (NLP) across routinely available EHR data types. Prophet is architected for scale and speed, enabling analysis of millions of clinical notes in hours at low marginal cost. The modular design supports rapid integration of additional phenotypes. We evaluated generalizability using leave-one-site-out cross-validation with nested hyperparameter optimization.

The resource covers 17 neurological phenotypes across 18,282 unique patients and 34,162 annotated clinical visits, encompassing acute (e.g., traumatic brain injury, stroke) and chronic (e.g., Parkinson’s disease, epilepsy) diagnoses, as well as severity and outcomes (NIH Stroke Scale, modified Rankin Scale). We release this large, multicenter, expert-annotated EHR dataset and the validated, open-source phenotyping framework to enable scalable neurological EHR research.


Background

Large-scale observational research in neurology depends on accurate patient identification, disease severity characterization, and outcome tracking across heterogeneous clinical populations. Electronic health records (EHR) offer an unprecedented opportunity to do this at scale, spanning millions of patient encounters across diverse health systems. Realizing this potential requires reliable automated methods to extract clinically meaningful phenotypes from both structured administrative data and unstructured clinical notes.

EHR phenotyping has advanced considerably over the past decade. Rule-based methods using ICD and CPT codes are efficient but have well-documented accuracy limitations, particularly where administrative coding is inconsistent or nuanced clinical distinctions are important. Machine learning incorporating structured and unstructured data improves performance and generalizability, and natural language processing (NLP) can extract phenotypes that coded data miss. However, most published phenotyping work has been single-center, validated only on internal test sets, and released without underlying annotated data, blocking independent replication, cross-institutional validation, and methodological extension.

Because annotated resources are not shared, researchers must recreate annotation pipelines from scratch and cannot assess cross-institutional generalizability. As large language models (LLMs) emerge as promising phenotyping tools, the lack of high-quality annotated EHR data creates a bottleneck for evaluation and cost-efficient deployment at scale. This project addresses this gap by providing a large, multicenter, de-identified EHR dataset with expert annotations, together with the Prophet open-source phenotyping framework.


Methods

The study was conducted under Institutional Review Board (IRB) protocols approved by Beth Israel Deaconess Medical Center (BIDMC), Mass General Brigham (MGB), Boston Children’s Hospital (BCH), Stanford University, Emory University, and Kaiser Permanente. Informed consent was waived as this was a secondary analysis of de-identified data.

We obtained EHR data from all participating sites from 2010 to 2023. EHR data included demographic data, ICD and CPT codes, medications, clinical notes, visit type (inpatient vs. outpatient), and admission type (emergency, urgent, elective). Ground truth diagnoses were established through manual chart review by trained physician annotators using standardized operating procedures specific to each phenotype.

Features were extracted from multiple EHR data sources: ICD diagnosis codes, CPT procedure codes, medication orders, and keywords/phrases from clinical notes using NLP with negation detection. The Prophet framework uses a modular architecture where each phenotype has a configurable feature extraction pipeline and classifier. Models were trained using nested leave-one-site-out cross-validation with hyperparameter optimization in the inner loop.

We additionally evaluated a two-step hybrid approach in which Prophet serves as a high-throughput filter, routing a small fraction of high-yield notes to a large language model (LLM) for a semantic second pass, reducing LLM compute costs while preserving classification accuracy.


Data Description

The dataset is organized into 17 subdirectories, one per neurological phenotype. Each subdirectory contains de-identified, visit-level data in Apache Parquet format:

Phenotype directories:

  • brain_tumor/ — Brain tumor diagnosis
  • cardiac_arrest/ — Cardiac arrest
  • congestive_heart_failure/ — Congestive heart failure
  • epilepsy/ — Epilepsy diagnosis
  • epilepsy_subtypes/ — Epilepsy subtype classification
  • intracranial_hemorrhage/ — Intracranial hemorrhage
  • ischemic_stroke/ — Ischemic stroke
  • mild_cognitive_impairment_alzhiemers_disease/ — Mild cognitive impairment and Alzheimer’s disease
  • modified_rankin_score/ — Modified Rankin Scale (functional outcome)
  • narcolepsy/ — Narcolepsy (NT1 and NT2/IH)
  • neuroinfectious_diseases/ — Neuroinfectious diseases
  • nih_stroke_scale/ — NIH Stroke Scale (severity)
  • parkinsons_disease/ — Parkinson’s disease
  • subarachnoid_hemorrhage/ — Subarachnoid hemorrhage
  • subdural_hematoma/ — Subdural hematoma
  • traumatic_brain_injury/ — Traumatic brain injury
  • withdrawal_of_life_sustaining_therapy/ — Withdrawal of life-sustaining therapy

Common file types within each phenotype directory:

  • annot.parquet — Expert-annotated ground truth labels from physician chart review
  • demo.parquet — De-identified patient demographics (age, sex, race, ethnicity, site)
  • feat.parquet — Extracted NLP features (keyword counts, negation flags) per clinical note
  • note.parquet — De-identified clinical note text with patient identifiers and dates
  • train_input.parquet — Combined feature matrix used as model input for training and evaluation
  • icd.parquet — ICD diagnosis codes per patient visit (present for most phenotypes)
  • med.parquet — Medication records per patient visit (present for select phenotypes)
  • cpt.parquet — CPT procedure codes (present for subdural hematoma)

The dataset spans 18,282 unique patients and 34,162 annotated clinical visits from six U.S. academic medical centers: Beth Israel Deaconess Medical Center, Massachusetts General Hospital, Boston Children’s Hospital, Stanford University, Emory University, and Kaiser Permanente. All data has been de-identified. Total size: approximately 451 MB (102 files).


Usage Notes

Code to reproduce all results is available at https://github.com/bdsp-core/prophet. The Prophet (PRediction Of Disease PHEnoTypes) framework provides a modular, configurable pipeline for EHR phenotyping that can be applied to the data in this repository or extended to new phenotypes.

Each phenotype directory is self-contained: load the parquet files for a given phenotype to access the clinical notes, extracted features, and expert annotations for that condition. The train_input.parquet file in each directory provides the combined feature matrix ready for model training and evaluation.

The data can be loaded using standard Python libraries (pandas, polars, pyarrow). Example:

import pandas as pd
annot = pd.read_parquet('epilepsy/annot.parquet')
feat = pd.read_parquet('epilepsy/feat.parquet')
notes = pd.read_parquet('epilepsy/note.parquet')

For cross-site validation experiments, the demo.parquet file contains site identifiers enabling leave-one-site-out evaluation.


Ethics

This study was conducted under IRB protocols approved by Beth Israel Deaconess Medical Center (BIDMC), Mass General Brigham (MGB), Boston Children’s Hospital (BCH), Stanford University, Emory University, and Kaiser Permanente (BIDMC IRB #s: 2024P000804, 2024P000807, 2022P000417, 2022P000481; MGB IRB #s: 2023P000487, 2013P001024; sites other than BIDMC and MGB ceded review to the BIDMC IRB). Informed consent was waived as this was a secondary analysis of de-identified data.


Acknowledgements

MBW’s laboratory is supported by grants from the NIH (R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119) and AWS.


Conflicts of Interest

MBW is a co-founder, serves as a scientific advisor and consultant to, and has a personal equity interest in Beacon Biosignals. Beacon Biosignals did not contribute funding and played no role in this work. SFZ received royalties from Springer and Wolters Kluwer.


Parent Projects
PRediction Of Disease PHEnoTypes (PROPHET) was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
BDSP Credentialed Health Data License 1.5.0

Data Use Agreement:
BDSP Credentialed Health Data Use Agreement

Required training:

Corresponding Author
You must be logged in to view the contact information.

Files