Model Credentialed Access
Automated Prediction of Glasgow Coma Scale Scores from Unstructured Electronic Health Records: a Natural Language Processing Approach
Marta Fernandes , Niels Turley , Haoqi Sun , Shibani Mukerji , Lidia M. V. R. Moura , M Brandon Westover , Sahar Zafar
Published: April 17, 2026. Version: 1.0.0
When using this resource, please cite:
(show more options)
Fernandes, M., Turley, N., Sun, H., Mukerji, S., Moura, L. M. V. R., Westover, M. B., & Zafar, S. (2026). Automated Prediction of Glasgow Coma Scale Scores from Unstructured Electronic Health Records: a Natural Language Processing Approach (version 1.0.0). Brain Data Science Platform. https://doi.org/10.60508/y6ra-1g90.
Abstract
Background: Multicenter electronic health records (EHR) can support quality improvement and comparative effectiveness research in critical care. However, limitations of EHR-based research include challenges in abstracting key clinical variables, including a patient’s level of consciousness.
Objective: The objective of our study was to develop a natural language processing (NLP) model to predict the Glasgow Coma Scale (GCS) scores from daily EHR notes.
Methods: The study included adult patients (≥18 years) admitted to Mass General Brigham (MGB) hospitals (2017-2024) and patients from the MIMIC-III database (Medical Information Mart for Intensive Care-MIMIC III 2001-2012) v1.4. A dataset of all patients from both institutions was split into train/hold-out test (70%/30%) sets. Variables consisted of daily notes, age, sex and admission type. We trained a pooled ordinal regression model (ordinalNet) with an elastic net penalty to predict the lowest daily level of consciousness across three classes: severe (GCS 3-8), moderate (GCS 9-12) and mild (GCS 13-15), and a pooled linear model to predict continuous GCS scores (3–15). Gold standard GCS was obtained from structured flowsheet data. External generalizability was assessed using a single-institution ordinal model trained on MGB and tested on MIMIC. Following post-hoc calibration, ordinal and linear model performance was evaluated on the hold-out test sets using the areas under the receiver characteristic curve (AUROC) and precision-recall curve (AUPRC); and root mean square error (RMSE) and Pearson correlation, respectively.
Results: Our modeling cohort included 145,897 patients (MGB = 123,257; MIMIC = 22,640) with 1,446,965 days of hospitalization, between training and testing sets; average age 62 [SD 18] years and balanced sex distribution. The pooled ordinalNet achieved AUROC and AUPRC [95% CI] of 0.96 [0.96-0.96] and 0.77 [0.76-0.77]. The single-institution ordinal model achieved AUROC 0.90 [0.89-0.90] and AUPRC 0.80 [0.79-0.80]. The pooled linear model achieved RMSE 2.30 [2.30-2.30] and correlation 0.76 [0.76-0.76]. Predictions for severe GCS were driven by terms indicating unresponsiveness and critical interventions, moderate GCS by intermediate alertness descriptors, and mild GCS by mentions of normal or awake behavior.
Conclusions: Pooled ordinal and linear models can accurately predict GCS from unstructured data and can support large-scale phenotyping of neurological assessments for future critical care research.
Background
Structured variables included age, sex and admission types (emergency, urgent, elective). Text-based variables were extracted from preprocessed daily clinical notes and binarized. Our outcome was the lowest daily GCS score for each day of hospital admission. For analysis, the lowest daily GCS was categorized as severe (GCS 3-8), moderate (GCS 9-12) and mild (GCS 13-15). Data from each institution was randomly split by patient into train (70%) and hold-out test (30%) sets. The training sets from both MGB and MIMIC, as well as their respective test sets, were combined to create pooled training and testing sets for a single, multi-institution model. We developed an ordinal regression model with elastic net penalty (ordinalNet) within the training data to predict the three classes of GCS scores. We also developed a linear regression model to predict the full range of GCS scores 3-15. We additionally evaluated cross-institution generalizability by training a single-institution model using data from MGB and testing it on MIMIC data.
Model Description
The data package is hosted on S3 at s3://bdsp-opendata-repository/NAX/nax-gcs/ and is organized into the following directories:
data/deid_notes/ — De-identified sample clinical notes provided for reproducibility testing of the preprocessing pipeline:
deid_notes_main.parquet— primary sampledeid_notes_sens.parquet— sensitivity analysis samplenotes_deid_mgb.parquet— MGB-specific de-identified notes
data/main/ — Primary analysis artifacts (features, labels, trained models, and prediction probabilities) for three modeling strategies:
Xtrain_*.parquet/Xtest_*.parquet/x_train_*.parquet/x_test_*.parquet— feature matricesy_train_*.parquet/y_test_*.parquet/y_mimic_*.parquet— outcome labels (train, held-out test, MIMIC external validation)clf_linear_pooled_main.pkl— trained linear pooled classifierordinalNet_ordinal_pooled_main.rds/ordinalNet_ordinal_single_main.rds— RordinalNetmodel objectscoefs_ordinal_*_main.csv— fitted coefficientsvect_*_main.pkl— fitted vectorizersprobs_train_*_main.csv/probs_test_*_main.csv— per-class predicted probabilities
data/sens/ — Sensitivity analysis artifacts with the same structure as data/main/.
code/ — An archived copy of the code in the associated GitHub repository (https://github.com/bdsp-core/nax-gcs), split into data_preprocessing/ (6 scripts for notes cleaning and feature engineering) and data_modeling/ (4 scripts for linear and ordinal regression models in Python and R).
utils/ — 12 shared Python helper modules (used by both preprocessing and modeling scripts): abbreviation expansion, stopword removal, n-gram extraction, train/test encoding, data loaders, linear-model helpers, ordinal performance metrics, calibration, and plotting routines for feature importance and performance curves.
Three modeling strategies are provided:
- Linear pooled — linear pooled regression models predicting the total GCS score (3–15) (trained on combined MGB + MIMIC data).
- Ordinal pooled — ordinal pooled regression models predicting GCS ordinal scores.
- Ordinal single — ordinal single-institution models predicting GCS ordinal scores.
All tabular data is stored in Apache Parquet format. Trained models are stored as Python pickle (.pkl) or R data (.rds) files. All clinical notes have been de-identified in accordance with HIPAA Safe Harbor. Total dataset size: approximately 1.94 GB across 86 files.
Technical Implementation
We developed an ordinal regression model where we fitted a parallel adjacent category probability model with logit link, using the ordinalNet R package. For each daily prediction, we selected the maximum probability out of the three model probabilities. To find the best regularization parameter, we performed five-fold cross-validation within the training data using the ordinalNetTune R function. The regularization parameter yielding the minimum error was selected to fit the ordinalNet model in the full training set. We also developed a linear regression model with the training data using the least absolute shrinkage and selection operator (LASSO) to predict the daily lowest GCS scores 3-15. For this model, we also performed five-fold cross validation within the training data to determine the best regularization parameter. The regularization parameter yielding the minimum error was selected to fit the LASSO model in the training set. A combination of under-sampling and over-sampling strategies was applied within the training set to prevent one institution from dominating the pooled model and to address class imbalance and mitigate potential bias towards the majority class.
Installation and Requirements
The pipelines use both Python and R. Versions used in the original experiments:
Python 3.10.15:
pandas==2.3.3
numpy==1.26.0
scikit-learn==1.5.2
nltk==3.8.1
imbalanced-learn==0.12.3
scipy==1.11.3
pingouin==0.5.5
matplotlib==3.9.2
R 4.4.1:
readxl 1.4.3
ordinalNet 2.12
MASS 7.3.64
Install Python dependencies with pip install -r requirements.txt (see the GitHub repository for the pinned dependency list).
Usage Notes
Code to reproduce all results is available at https://github.com/bdsp-core/nax-gcs. An archived copy is also included under code/ in the S3 data package.
Quick start: after downloading the data package, run the preprocessing on the de-identified sample, then run the modeling scripts:
# 1. Download the data package (requires credentialed access)
aws s3 sync s3://bdsp-opendata-repository/NAX/nax-gcs/data/ ./data/
# 2. Preprocess the de-identified sample notes
python code/data_preprocessing/Notes_preprocessing.py
# 3. Train and evaluate models
python code/data_modeling/Linear_pooled_model.py
python code/data_modeling/Ordinal_pooled_model.py
python code/data_modeling/Ordinal_single_model.py
For the ordinal elastic-net models, open code/data_modeling/OrdinalNet_models.Rmd in RStudio and knit.
All experiments use fixed random seeds for reproducibility.
Ethics
Ethical approval
The study was approved by the Mass General Brigham Institutional Review Board, Protocol 2013P001024; a waiver of informed consent was obtained for this observational study.
Informed consent
Acknowledgements
This work was funded by National Institutes of Health (NIH) R01NS131347 SFZ.
Conflicts of Interest
Dr. Zafar is a clinical neurophysiologist for Corticare, received speaking honoraria from Marinus, and received royalties from Springer publishing, unrelated to this work. Dr. Westover is a co-founder, scientific advisor, and consultant to Beacon Biosignals and has a personal equity interest in the company. He receives royalties for authoring Pocket Neurology from Wolters Kluwer and Atlas of Intensive Care Quantitative EEG by Demos Medical. None of these interests played any role in the present work. Dr. Moura has no significant financial relationship with any commercial or proprietary entity that produces healthcare-related products and/or services relevant to the content of this manuscript. The authors declare no other conflicts of interest.
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
DOI:
https://doi.org/10.60508/y6ra-1g90
Programming Languages:
Topics:
glasgow coma scale
natural language processing
ordinal regression
electronic health records
clinical notes
Project Website:
https://github.com/bdsp-core/nax-gcs
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project