Model Credentialed Access

Automated extraction of post-stroke functional outcomes from unstructured electronic health records

Marta Fernandes Kaileigh Gallagher Niels Turley Aditya Gupta M Brandon Westover Aneesh Singhal Sahar Zafar

Published: Oct. 2, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Fernandes, M., Gallagher, K., Turley, N., Gupta, A., Westover, M. B., Singhal, A., & Zafar, S. (2025). Automated extraction of post-stroke functional outcomes from unstructured electronic health records (version 1.0.0). Brain Data Science Platform. https://doi.org/10.60508/zksv-mq70.

Additionally, please cite the original publication:

Fernandes M, Gallagher K, Turley N, et al. Automated extraction of post-stroke functional outcomes from unstructured electronic health records. European Stroke Journal. 2025;10(3):829-836. doi:10.1177/23969873251314340

Abstract

Purpose:

Population level tracking of post-stroke functional outcomes is critical to guide interventions that reduce the burden of stroke-related disability. However, functional outcomes are often missing or documented in unstructured notes. We developed a natural language processing (NLP) model that reads electronic health records (EHR) notes to automatically determine the modified Rankin Scale (mRS).

Method:

We included consecutive patients (⩾18 years) with acute stroke admitted to our center (2015–2024). mRS scores were obtained from the Get With the Guidelines registry and clinical notes (if documented), and used as the gold standard to compare against NLP-generated scores. We used text-based features from notes, along with age, sex, discharge status, and outpatient follow-up to train a logistic regression for prediction of good (0–2) versus poor (3–6) mRS, and a linear regression for the full range of mRS scores. The models were trained for prediction of mRS at hospital discharge and post-discharge. The models were externally validated in a dataset of patients with brain injuries from a different healthcare center.

Findings:

We included 5307 patients, 5006 in train and test and 301 in validation; average age was 69 (SD 15) and 65 (SD 17) years, respectively; 47% female. The logistic regression achieved an area under the receiver operating curve (AUROC) of 0.94 [CI 0.93–0.95] (test) and 0.94 [0.91–0.96] (validation), and the linear model a root mean squared error (RMSE) of 0.91 [0.87–0.94] (test) and 1.17 [1.06–1.28] (validation).

Discussion and Conclusion:

The NLP-based model is suitable for use in large-scale phenotyping of stroke functional outcomes and population health research.

Background

We selected each note type (physical therapy, occupational therapy, discharge summary, and other types) from the calendar date closest to the time of gold standard mRS measurement. Models predicting discharge mRS used notes closest to the day of discharge, and models predicting post discharge mRS used notes documented closest to the day of post-discharge gold standard measurement. The data from our center was split into train (70%) and test (30%) sets, with unique patients in each set. With the train set we developed a logistic regression model for prediction of good (mRS 0–2) versus poor (mRS 3–6) mRS and a linear regression model for prediction of the full range of mRS 0–6.

Model Description

Our final model had three-stages: (stage 1) for patients with a discharge status of deceased we automatically assigned mRS 6 as the predicted score; (stage 2) for any encounter where mRS was documented by clinicians, regular expressions were used to extract the score; (stage 3) for all other encounters (patients alive at discharge and those without mRS documentation) LASSO models were used for prediction.


Technical Implementation

Both models used the least absolute shrinkage and selection operator (LASSO) to select informative text-based features, age, sex, patient discharge status, and outpatient follow-up flag (yes/no) to predict the mRS scores. Age values were normalized using min–max normalization. For each model, we performed 100 iterations of five-fold cross validation in the training data to determine the best regularization parameter. 
 
 

Installation and Requirements

Python


Usage Notes

Python


Ethics

Ethical approval

In this dataset, all data were anonymized with all identifiable patient information removed. Scans were identified retrospectively from IRB-approved chart review under protocols approved by the BIDMC IRB (protocols #2022P000481, #2022P000417) and MGB IRB (protocol #2013P001024), which provided a waiver of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed. 
 

Informed consent

A waiver of informed consent was obtained for this observational study.

Acknowledgements

This project was supported by NIH R01NS131347 (PI Sahar F. Zafar).


Conflicts of Interest

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Dr. Zafar is a clinical neurophysiologist for Corticare, received speaking honoraria from Marinus, and received royalties from Springer publishing, unrelated to this work. Dr. Westover is a co-founder, scientific advisor, and consultant to Beacon Biosignals and has a personal equity interest in the company. None of these interests played any role in the present work.


References

  1. Fernandes M, Gallagher K, Turley N, et al. Automated extraction of post-stroke functional outcomes from unstructured electronic health records. European Stroke Journal. 2025;10(3):829-836. doi:10.1177/23969873251314340

Parent Projects
Automated extraction of post-stroke functional outcomes from unstructured electronic health records was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
BDSP Credentialed Health Data License 1.5.0

Data Use Agreement:
BDSP Credentialed Health Data Use Agreement

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files