Database Credentialed Access
Identification of patients with epilepsy using automated electronic health records phenotyping - Data and Code
Marta Fernandes , Sahar Zafar , M Brandon Westover
Published: June 5, 2025. Version: 1.0
When using this resource, please cite:
(show more options)
Fernandes, M., Zafar, S., & Westover, M. B. (2025). Identification of patients with epilepsy using automated electronic health records phenotyping - Data and Code (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/9erc-bs61.
Abstract
Unstructured data in electronic health records (EHR) provide valuable medical insights but require significant manual effort for abstraction. We developed an automated EHR phenotyping (AEP) model to identify epilepsy patients using clinical notes, antiseizure medications (ASMs), and International Classification of Diseases (ICD) codes. The model, trained on structured annotations and manual chart review, achieved high accuracy in distinguishing epilepsy from non-epilepsy cases. This approach facilitates large-scale epilepsy research by enabling efficient EHR-based patient identification.
Background
Epilepsy affects millions worldwide and is commonly identified using ICD codes and medication prescriptions, but these methods can be inaccurate. Manual review of clinical notes provides better diagnostic certainty but is labor-intensive. Our study introduces a machine learning-based AEP model that integrates structured and unstructured EHR data to accurately classify epilepsy cases.
Methods
This study used 3903 patients from outpatient clinics at nine hospitals (2015–2022). The ground truth was established through structured physician questionnaires, manual chart review, and computational keyword searches. We trained logistic regression and gradient boosting models using features from text-based clinical notes, ASMs, and ICD codes. Model performance was evaluated using area under the receiver operating curve (AUROC) and precision-recall curve (AUPRC).
IRB Approval: This study was conducted under Mass General Brigham IRB approval #2012P001929, with a waiver of informed consent.
Data Description
The dataset includes anonymized EHR-derived features, including:
- Text features: Keywords and phrases extracted from unstructured clinical notes
- Structured features: Prescription records of ASMs and epilepsy-related ICD codes
- Demographic information: Age, sex, and number of epilepsy-related encounters
The final dataset used for model training and validation consists of 8415 encounters across 3903 patients.
The names of the data files, available on AWS (after signing the DUA and providing proof of CITI training) are:
- dataset_demographics.csv
- dataset_icds_encounter.csv
- dataset_meds.csv
- enc_diag_all.csv
- icds_encounter_meds_after_build.csv
- meds_all.csv
- notes.csv
- problist_diag_all.csv
Usage Notes
Data and code to generate all results and figures from the publication are provided here: https://github.com/bdsp-core/NAX-Epilepsy
Ethics
This study was approved by the Mass General Brigham Institutional Review Board (IRB #2012P001929), including review of EEG and other clinical data. The Partners Healthcare Human Research Committee provided a waiver of written consent. All data used were deidentified prior to analysis.
Acknowledgements
This research was supported by the National Institutes of Health (NIH), the Epilepsy Foundation of America, and the Centers for Disease Control and Prevention. We thank the Epilepsy Learning Healthcare System for additional help.
Conflicts of Interest
MBW is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals.
References
- Fernandes M, Cardall A, Jing J, Ge W, Moura LMVR, Jacobs C, McGraw C, Zafar SF, Westover MB. Identification of patients with epilepsy using automated electronic health records phenotyping. Epilepsia. 2023 Jun;64(6):1472-1481. doi: 10.1111/epi.17589. Epub 2023 Apr 4. PMID: 36934317; PMCID: PMC10239346.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
DOI:
https://doi.org/10.60508/9erc-bs61
Project Website:
https://github.com/bdsp-core/NAX-Epilepsy
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project