Model Credentialed Access

Automated extraction of stroke severity from unstructured electronic health records using natural language processing

Marta Fernandes M Brandon Westover Aneesh Singhal Sahar Zafar

Published: Oct. 2, 2025. Version: 1.0.0


When using this resource, please cite: (show more options)
Fernandes, M., Westover, M. B., Singhal, A., & Zafar, S. (2025). Automated extraction of stroke severity from unstructured electronic health records using natural language processing (version 1.0.0). Brain Data Science Platform. https://doi.org/10.60508/gbcr-e844.

Additionally, please cite the original publication:

Fernandes M, Westover MB, Singhal AB, Zafar SF. Automated Extraction of Stroke Severity From Unstructured Electronic Health Records Using Natural Language Processing. J Am Heart Assoc. 2024 Nov 5;13(21):e036386. doi: 10.1161/JAHA.124.036386. Epub 2024 Oct 25. PMID: 39450737; PMCID: PMC11935650.

Abstract

Background

Multicenter electronic health records can support quality improvement and comparative effectiveness research in stroke. However, limitations of electronic health record–based research include challenges in abstracting key clinical variables, including stroke severity, along with missing data. We developed a natural language processing model that reads electronic health record notes to directly extract the National Institutes of Health Stroke Scale score when documented and predict the score from clinical documentation when missing.

Methods and Results

The study included notes from patients with acute stroke (aged ≥18 years) admitted to Massachusetts General Hospital (2015–2022). The Massachusetts General Hospital data were divided into training/holdout test (70%/30%) sets. We developed a 2‐stage model to predict the admission National Institutes of Health Stroke Scale, obtained from the GWTG (Get With The Guidelines) stroke registry. We trained a model with the least absolute shrinkage and selection operator. For test notes with documented National Institutes of Health Stroke Scale, scores were extracted using regular expressions (stage 1); when not documented, least absolute shrinkage and selection operator was used for prediction (stage 2). The 2‐stage model was tested on the holdout test set and validated in the Medical Information Mart for Intensive Care (2001–2012) version 1.4, using root mean squared error and Spearman correlation. We included 4163 patients (Massachusetts General Hospital, 3876; Medical Information Mart for Intensive Care, 287); average age, 69 (SD, 15) years; 53% men, and 72% White individuals. The model achieved a root mean squared error of 2.89 (95% CI, 2.62–3.19) and Spearman correlation of 0.92 (95% CI, 0.91–0.93) in the Massachusetts General Hospital test set, and 2.20 (95% CI, 1.69–2.66) and 0.96 (95% CI, 0.94–0.97) in the MIMIC validation set, respectively.

Conclusions

The automatic natural language processing–based model can enable large‐scale stroke severity phenotyping from the electronic health record and support real‐world quality improvement and comparative effectiveness studies in stroke.

Background

The EHR data in our study comprised free text admission notes (MGH) and discharge summaries (MIMIC). The MGH notes were extracted for the first and second dates of admission, given our goal to measure and predict admission stroke severity. For model external validation, we used discharge summaries from the MIMIC data set, because the admission NIHSS scores are mainly stored in this type of notes
 
We split the MGH data randomly into a training set (70%) and a holdout test set (30%). With the training data, we developed a linear regression model to predict the patients' NIHSS scores, from 0 to 42. 
 
We also developed an ordinal logistic regression model within the training data for each of 4 classes (NIHSS scores): minor stroke (0–4), moderate stroke (5–15), moderate to severe stroke (16–20), and severe stroke (21–42).
 
 

Model Description

Our final model was a 2‐stage model, applied on the MGH holdout test set and externally validated on the MIMIC validation set: (1) In stage 1, notes were checked for the NIHSS and, if detected, the score was directly extracted. This stage used simple hard‐coded regular expressions. (2) In stage 2, for notes in which the NIHSS score was not detected/documented, we applied the LASSO model to estimate the NIHSS from information contained in the note.


Technical Implementation

The linear regression model used the least absolute shrinkage and selection operator (LASSO) for selection of text‐based features from the notes to predict the patients' NIHSS scores. We performed 100 iterations within the training data of 5‐fold cross‐validation to determine the best regularization parameter.

The ordinal regression consisted of fitting a parallel adjacent category probability model with logit link, using the ordinalNet R package. Similar to the linear regression model, the ordinal regression used LASSO regularization utilizing the text‐based features from the notes to predict the patients' NIHSS scores for each of the four classes. We also performed 100 iterations within the training data of 5‐fold cross‐validation to determine the best regularization parameter.


Installation and Requirements

Python and R studio


Usage Notes

Python and R studio


Ethics

All data were anonymized with all identifiable patient information removed. Scans were identified retrospectively from IRB-approved chart review under protocols approved by the BIDMC IRB (protocols #2022P000481, #2022P000417) and MGB IRB (protocol #2013P001024), which provided a waiver of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed. A waiver of informed consent was granted by the IRB for this observational study.


Acknowledgements

M.B.W. was supported by grants from the National Institutes of Health (RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119), and the National Science Foundation (2014431). S.F.Z. was supported by the National Institutes of Health (K23NS114201, R01NS126282, R01AG082693, R01NS131347).


Conflicts of Interest

Dr Zafar is a clinical neurophysiologist for Corticare, received speaking honoraria from Marinus, and received royalties from Springer Publishing, unrelated to this work. Dr Westover is a cofounder, scientific advisor, and consultant to Beacon Biosignals and has a personal equity interest in the company. He receives royalties for authoring Pocket Neurology from Wolters Kluwer and Atlas of Intensive Care Quantitative EEG by Demos Medical. None of these interests played any role in the present work. The remaining authors have no disclosures to report.


Parent Projects
Automated extraction of stroke severity from unstructured electronic health records using natural language processing was derived from: Please cite them when using this project.
Share
Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
BDSP Credentialed Health Data License 1.5.0

Data Use Agreement:
BDSP Credentialed Health Data Use Agreement

Required training:
CITI Data or Specimens Only Research

Corresponding Author
You must be logged in to view the contact information.

Files