Database Credentialed Access
PRediction Of Disease PHEnoTypes (PROPHET)
Niels Turley , Marta Fernandes , Shadi Sartipi , Han Wu , Alice Lam , Lydia Petersen , Catherine Clive , Daniel Sumsion , Ruoqi Wei , Bram Overmeer , Jaden Searle , Gregory Hooke , Spencer Boris , Wan-Yee Kong , Arjun Singh , Marjan Sarami , Alihan Yaramis , Imad Akbar , Rebecca Milde , Jet Veltink , Elijah Davis , Aditya Gupta , Manohar Ghanta , Aidan McDonald Wojciechowski , Shibani Mukerji , Haoqi Sun , M Brandon Westover , Sahar Zafar
Published: March 31, 2026. Version: 1.0
When using this resource, please cite:
(show more options)
Turley, N., Fernandes, M., Sartipi, S., Wu, H., Lam, A., Petersen, L., Clive, C., Sumsion, D., Wei, R., Overmeer, B., Searle, J., Hooke, G., Boris, S., Kong, W., Singh, A., Sarami, M., Yaramis, A., Akbar, I., Milde, R., ... Zafar, S. (2026). PRediction Of Disease PHEnoTypes (PROPHET) (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/zkvt-gt36.
Abstract
Large-scale electronic health record (EHR) phenotyping is essential for epidemiology, outcomes research, and clinical-trial recruitment, yet existing resources are largely single-center, limited to binary diagnostic labels, and lack computational efficiency for deployment across millions of clinical notes. No publicly available, multicenter annotated EHR resource exists for neurological phenotyping spanning diagnoses, severity scales, and outcomes.
We assembled a multicenter, de-identified EHR resource spanning six U.S. health systems (2010–2023) with expert-annotated reference standards. We developed a high-throughput phenotyping framework—Prophet (PRediction Of Disease PHEnoTypes)—combining machine learning and natural language processing (NLP) across routinely available EHR data types. Prophet is architected for scale and speed, enabling analysis of millions of clinical notes in hours at low marginal cost. The modular design supports rapid integration of additional phenotypes. We evaluated generalizability using leave-one-site-out cross-validation with nested hyperparameter optimization.
The resource covers 17 neurological phenotypes across 18,282 unique patients and 34,162 annotated clinical visits, encompassing acute (e.g., traumatic brain injury, stroke) and chronic (e.g., Parkinson’s disease, epilepsy) diagnoses, as well as severity and outcomes (NIH Stroke Scale, modified Rankin Scale). We release this large, multicenter, expert-annotated EHR dataset and the validated, open-source phenotyping framework to enable scalable neurological EHR research.
Background
Large-scale observational research in neurology depends on accurate patient identification, disease severity characterization, and outcome tracking across heterogeneous clinical populations. Electronic health records (EHR) offer an unprecedented opportunity to do this at scale, spanning millions of patient encounters across diverse health systems. Realizing this potential requires reliable automated methods to extract clinically meaningful phenotypes from both structured administrative data and unstructured clinical notes.
EHR phenotyping has advanced considerably over the past decade. Rule-based methods using ICD and CPT codes are efficient but have well-documented accuracy limitations, particularly where administrative coding is inconsistent or nuanced clinical distinctions are important. Machine learning incorporating structured and unstructured data improves performance and generalizability, and natural language processing (NLP) can extract phenotypes that coded data miss. However, most published phenotyping work has been single-center, validated only on internal test sets, and released without underlying annotated data, blocking independent replication, cross-institutional validation, and methodological extension.
Because annotated resources are not shared, researchers must recreate annotation pipelines from scratch and cannot assess cross-institutional generalizability. As large language models (LLMs) emerge as promising phenotyping tools, the lack of high-quality annotated EHR data creates a bottleneck for evaluation and cost-efficient deployment at scale. This project addresses this gap by providing a large, multicenter, de-identified EHR dataset with expert annotations, together with the Prophet open-source phenotyping framework.
Methods
The study was conducted under Institutional Review Board (IRB) protocols approved by Beth Israel Deaconess Medical Center (BIDMC), Mass General Brigham (MGB), Boston Children’s Hospital (BCH), Stanford University, Emory University, and Kaiser Permanente. Informed consent was waived as this was a secondary analysis of de-identified data.
We obtained EHR data from all participating sites from 2010 to 2023. EHR data included demographic data, ICD and CPT codes, medications, clinical notes, visit type (inpatient vs. outpatient), and admission type (emergency, urgent, elective). Ground truth diagnoses were established through manual chart review by trained physician annotators using standardized operating procedures specific to each phenotype.
Features were extracted from multiple EHR data sources: ICD diagnosis codes, CPT procedure codes, medication orders, and keywords/phrases from clinical notes using NLP with negation detection. The Prophet framework uses a modular architecture where each phenotype has a configurable feature extraction pipeline and classifier. Models were trained using nested leave-one-site-out cross-validation with hyperparameter optimization in the inner loop.
We additionally evaluated a two-step hybrid approach in which Prophet serves as a high-throughput filter, routing a small fraction of high-yield notes to a large language model (LLM) for a semantic second pass, reducing LLM compute costs while preserving classification accuracy.
Data Description
The dataset is organized into 17 subdirectories, one per neurological phenotype. Each subdirectory contains de-identified, visit-level data in Apache Parquet format:
Phenotype directories:
brain_tumor/— Brain tumor diagnosiscardiac_arrest/— Cardiac arrestcongestive_heart_failure/— Congestive heart failureepilepsy/— Epilepsy diagnosisepilepsy_subtypes/— Epilepsy subtype classificationintracranial_hemorrhage/— Intracranial hemorrhageischemic_stroke/— Ischemic strokemild_cognitive_impairment_alzhiemers_disease/— Mild cognitive impairment and Alzheimer’s diseasemodified_rankin_score/— Modified Rankin Scale (functional outcome)narcolepsy/— Narcolepsy (NT1 and NT2/IH)neuroinfectious_diseases/— Neuroinfectious diseasesnih_stroke_scale/— NIH Stroke Scale (severity)parkinsons_disease/— Parkinson’s diseasesubarachnoid_hemorrhage/— Subarachnoid hemorrhagesubdural_hematoma/— Subdural hematomatraumatic_brain_injury/— Traumatic brain injurywithdrawal_of_life_sustaining_therapy/— Withdrawal of life-sustaining therapy
Common file types within each phenotype directory:
annot.parquet— Expert-annotated ground truth labels from physician chart reviewdemo.parquet— De-identified patient demographics (age, sex, race, ethnicity, site)feat.parquet— Extracted NLP features (keyword counts, negation flags) per clinical notenote.parquet— De-identified clinical note text with patient identifiers and datestrain_input.parquet— Combined feature matrix used as model input for training and evaluationicd.parquet— ICD diagnosis codes per patient visit (present for most phenotypes)med.parquet— Medication records per patient visit (present for select phenotypes)cpt.parquet— CPT procedure codes (present for subdural hematoma)
The dataset spans 18,282 unique patients and 34,162 annotated clinical visits from six U.S. academic medical centers: Beth Israel Deaconess Medical Center, Massachusetts General Hospital, Boston Children’s Hospital, Stanford University, Emory University, and Kaiser Permanente. All data has been de-identified. Total size: approximately 451 MB (102 files).
Usage Notes
Code to reproduce all results is available at https://github.com/bdsp-core/prophet. The Prophet (PRediction Of Disease PHEnoTypes) framework provides a modular, configurable pipeline for EHR phenotyping that can be applied to the data in this repository or extended to new phenotypes.
Each phenotype directory is self-contained: load the parquet files for a given phenotype to access the clinical notes, extracted features, and expert annotations for that condition. The train_input.parquet file in each directory provides the combined feature matrix ready for model training and evaluation.
The data can be loaded using standard Python libraries (pandas, polars, pyarrow). Example:
import pandas as pd
annot = pd.read_parquet('epilepsy/annot.parquet')
feat = pd.read_parquet('epilepsy/feat.parquet')
notes = pd.read_parquet('epilepsy/note.parquet')
For cross-site validation experiments, the demo.parquet file contains site identifiers enabling leave-one-site-out evaluation.
Ethics
This study was conducted under IRB protocols approved by Beth Israel Deaconess Medical Center (BIDMC), Mass General Brigham (MGB), Boston Children’s Hospital (BCH), Stanford University, Emory University, and Kaiser Permanente (BIDMC IRB #s: 2024P000804, 2024P000807, 2022P000417, 2022P000481; MGB IRB #s: 2023P000487, 2013P001024; sites other than BIDMC and MGB ceded review to the BIDMC IRB). Informed consent was waived as this was a secondary analysis of de-identified data.
Acknowledgements
MBW’s laboratory is supported by grants from the NIH (R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119) and AWS.
Conflicts of Interest
MBW is a co-founder, serves as a scientific advisor and consultant to, and has a personal equity interest in Beacon Biosignals. Beacon Biosignals did not contribute funding and played no role in this work. SFZ received royalties from Springer and Wolters Kluwer.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
Discovery
DOI:
https://doi.org/10.60508/zkvt-gt36
Project Website:
https://github.com/bdsp-core/prophet
Corresponding Author
Files
- be a credentialed user
- sign the data use agreement for the project