Model Credentialed Access
Automated extraction of stroke severity from unstructured electronic health records using natural language processing
Marta Fernandes , M Brandon Westover , Aneesh Singhal , Sahar Zafar
Published: Oct. 2, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Fernandes, M., Westover, M. B., Singhal, A., & Zafar, S. (2025). Automated extraction of stroke severity from unstructured electronic health records using natural language processing (version 1.0.0). Brain Data Science Platform. https://doi.org/10.60508/gbcr-e844.
Abstract
Background
Methods and Results
Conclusions
Background
Model Description
Our final model was a 2‐stage model, applied on the MGH holdout test set and externally validated on the MIMIC validation set: (1) In stage 1, notes were checked for the NIHSS and, if detected, the score was directly extracted. This stage used simple hard‐coded regular expressions. (2) In stage 2, for notes in which the NIHSS score was not detected/documented, we applied the LASSO model to estimate the NIHSS from information contained in the note.
Technical Implementation
The linear regression model used the least absolute shrinkage and selection operator (LASSO) for selection of text‐based features from the notes to predict the patients' NIHSS scores. We performed 100 iterations within the training data of 5‐fold cross‐validation to determine the best regularization parameter.
The ordinal regression consisted of fitting a parallel adjacent category probability model with logit link, using the ordinalNet R package. Similar to the linear regression model, the ordinal regression used LASSO regularization utilizing the text‐based features from the notes to predict the patients' NIHSS scores for each of the four classes. We also performed 100 iterations within the training data of 5‐fold cross‐validation to determine the best regularization parameter.
Installation and Requirements
Python and R studio
Usage Notes
Python and R studio
Ethics
All data were anonymized with all identifiable patient information removed. Scans were identified retrospectively from IRB-approved chart review under protocols approved by the BIDMC IRB (protocols #2022P000481, #2022P000417) and MGB IRB (protocol #2013P001024), which provided a waiver of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed. A waiver of informed consent was granted by the IRB for this observational study.
Acknowledgements
M.B.W. was supported by grants from the National Institutes of Health (RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119), and the National Science Foundation (2014431). S.F.Z. was supported by the National Institutes of Health (K23NS114201, R01NS126282, R01AG082693, R01NS131347).
Conflicts of Interest
Dr Zafar is a clinical neurophysiologist for Corticare, received speaking honoraria from Marinus, and received royalties from Springer Publishing, unrelated to this work. Dr Westover is a cofounder, scientific advisor, and consultant to Beacon Biosignals and has a personal equity interest in the company. He receives royalties for authoring Pocket Neurology from Wolters Kluwer and Atlas of Intensive Care Quantitative EEG by Demos Medical. None of these interests played any role in the present work. The remaining authors have no disclosures to report.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project