Model Open Access

Automated phenotyping of mild cognitive impairment and Alzheimer's disease and related dementias using electronic health records

Haoqi Sun Ruoqi Wei Niels Turley Aditya Gupta Manohar Ghanta Robert Thomas Sahar Zafar M Brandon Westover

Published: Sept. 25, 2025. Version: 1.0 <View latest version>


When using this resource, please cite: (show more options)
Sun, H., Wei, R., Turley, N., Gupta, A., Ghanta, M., Thomas, R., Zafar, S., & Westover, M. B. (2025). Automated phenotyping of mild cognitive impairment and Alzheimer's disease and related dementias using electronic health records (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/2m08-nf79.

Abstract

Objectives: Unstructured and structured data in electronic health records (EHR) are a rich source of information for research and quality improvement studies. However, extracting accurate information from EHR is labor-intensive. Timely and accurate identification of patients with Alzheimer's Disease, related dementias (ADRD), or mild cognitive impairment (MCI) is critical for improving patient outcomes through early intervention, optimizing care plans, and reducing healthcare system burdens. Here we introduce an automated EHR phenotyping model to streamline this process and enable efficient identification of these conditions.

Methods: We analyzed data from 3,626 outpatients seen at two hospitals between February 2015 and June 2022. Through manual chart review, we established ground truth labels for the presence or absence of MCI/ADRD diagnoses. Our model combined three types of data: (1) unstructured clinical notes, from which we extracted single words, two-word phrases (bigrams), and three-word phrases (trigrams) as features, weighted using Term Frequency-Inverse Document Frequency (TF-IDF) to capture their relative importance, (2) International Classification of Diseases (ICD) codes, and (3) medication prescriptions related to MCI/ADRD. We trained a regularized logistic regression model to predict MCI/ADRD diagnoses and evaluated its performance using standard metrics including area under the receiver operating curve (AUROC), area under the precision-recall curve (AUPRC), accuracy, specificity, precision, recall, and F1 score.

Results: Thirty percent of patients in the cohort carried diagnoses of MCI/ADRD based on manual review. When evaluated on a held-out test set, the best model using clinical notes, ICDs, and medications, achieved an AUROC of 0.98, an AUPRC of 0.98, an accuracy of 0.93, a sensitivity (recall) of 0.91, a specificity of 0.96, a precision of 0.96, and an F1 score of 0.93 The estimated overall accuracy for patients randomly selected from EHRs was 99.88%.

Conclusion: Automated EHR phenotyping accurately identifies patients with MCI/ADRD based on clinical notes, ICD codes, and medication records. This approach holds potential for large-scale MCI/ADRD research utilizing EHR databases.


Background

Globally, 12 % to 18 % of people aged 60 or older are living with mild cognitive impairment (MCI), and 10 % to 15 % of individuals living with MCI develop dementia each year. About 1/3 of people living with MCI due to Alzheimer’s disease (AD) develop dementia within five years. Meanwhile, observational data from Electronic Health Records (EHRs) are an increasingly important resource for research on risk factors and potential interventions for MCI and AD. A key challenge in scaling up EHR-based research is the accurate phenotyping of patients with a diagnosis of MCI and ADRD. However, the current ICD-based approaches often struggle with data incompleteness and lack the flexibility to capture complex relationships and contextual nuances within clinical language, particularly when unstructured data, such as clinical notes, is involved. Here, we propose a phenotyping model that integrates rule-based methods for structured data, NLP techniques for unstructured data, and machine learning (ML) to dynamically learn patterns and interactions.


Model Description

Input data included unstructured (i.e., free text) clinical notes and structured data. Structured data included International Classification of Diseases (ICD) codes for MCI/ADRD, and dementia-related medications. The model is a TF-IDF transformation, followed by logistic regression.


Technical Implementation

This retrospective cohort study aimed to develop and evaluate a ML-based phenotyping model to identify patients with MCI and ADRD by integrating structured data, including ICD codes and medications, with unstructured data from clinical notes. The study utilized data from Massachusetts General Hospital (MGH) and Beth Israel Deaconess Medical Center (BIDMC). Data were collected for visits occurring between January 3, 2012, and November 3, 2017.

The primary outcome was a confirmed diagnosis of MCI or ADRD, determined through manual chart review.

Alzheimer's disease – ICD-10 F00, G30.0, G30.1, G30.8, G30.9 and ICD-9 290.0, 290.2x, 290.3, 331.0; Vascular Dementia – ICD-10 F01.X and ICD-9 290.4X; Lewy Body Dementia – ICD-10 G31.83 and ICD-9 331.82; Frontotemporal Dementia – ICD-10 G31.0, G31.01, G31.09 and ICD-9 311.11, 331.19; Unspecified Dementias – ICD-10 F02.8x, F03.9x and ICD-9 294.1x, 294.2x; Mild Cognitive Impairment – ICD-10 F06.7 and ICD-9 331.83.

Medications: Aricept, donepezil, Exelon, rivastigmine, memantine, Namenda, Namzaric, Razadyne, and galantamine.

To evaluate model performance and assess generalizability, 5-fold stratified nested cross-validation was implemented. 


Installation and Requirements

python 3.9+, sklearn and joblib package


Usage Notes

Please load the model "mci_model.joblib" using joblib package. Inspect its pipeline. Provide the right inputs: a list of notes (string), a list of 1/0 indicators of if MCI/ADRD-related ICDs are present within 6 months before and after the date of note, and a list of 1/0 indicators if MCI/ADRD-related medications are present within 6 months before and after the date of note.


Release Notes

This version might introduce false positives due to family history or negations.


Ethics

The study protocol was approved by the Institutional Review Board of Beth Israel Deaconess Medical Center (protocol # 2022P000417). The written informed consents were waived, because of the retrospective study design with minimal risk to participants. The study also complied with the Declaration of Helsinki.


Acknowledgements

This work was supported by grants from the National Institutes of Health (NIH): R01NS102190, R01NS102574, R01NS107291, RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, and National Science Foundation (NSF):2014431.


Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.


Share
Access

Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.

License (for files):
BDSP non-commercial use

Corresponding Author
You must be logged in to view the contact information.
Versions
  • 1.0 - Sept. 25, 2025
  • 1.1 - Sept. 25, 2025

Files

Total uncompressed size: 0 B.

Access the files
Folder Navigation: <base>
Name Size Modified
LICENSE.txt (download) 0 B 2025-09-25
mci_model.joblib (download) 87.3 MB 2025-09-25