Database Credentialed Access
Automated phenotyping of mild cognitive impairment and dementias using electronic health records - Data and Code
Ruoqi Wei , Robert Thomas , M Brandon Westover , Haoqi Sun
Published: June 5, 2025. Version: 1.0
When using this resource, please cite:
(show more options)
Wei, R., Thomas, R., Westover, M. B., & Sun, H. (2025). Automated phenotyping of mild cognitive impairment and dementias using electronic health records - Data and Code (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/z6ap-6g20.
Abstract
This repository contains data and code for the paper "Automated phenotyping of mild cognitive impairment and Alzheimer's disease and related dementias using electronic health records." We developed and evaluated a machine learning model that integrates unstructured clinical notes, ICD codes, and medication data to accurately identify patients with mild cognitive impairment (MCI) and Alzheimer's disease and related dementias (ADRD). Our model achieved high performance metrics and demonstrated good generalizability across two hospital systems.
Background
Timely and accurate identification of patients with Alzheimer's Disease, related dementias (ADRD), or mild cognitive impairment (MCI) is critical for improving patient outcomes. Electronic health records (EHR) contain rich information for research, but manual chart review is labor-intensive and impractical at scale. Our automated EHR phenotyping model streamlines this process by analyzing clinical notes, ICD codes, and medication data to efficiently identify patients with MCI/ADRD conditions.
Methods
We analyzed data from 3,626 outpatients seen at Massachusetts General Hospital (MGH) and Beth Israel Deaconess Medical Center (BIDMC) between February 2015 and June 2022. Through manual chart review, we established ground truth labels for MCI/ADRD diagnoses. We trained a regularized logistic regression model using three types of data: (1) unstructured clinical notes with TF-IDF weighted features, (2) ICD codes, and (3) medication prescriptions related to MCI/ADRD. Data were extracted under protocols approved by the MGH and BIDMC Institutional Review Boards (IRB protocol numbers 2013P001024 and 2023P000487 for MGH, and 2022P000417 for BIDMC) with waivers of informed consent.
Data Description
The repository contains deidentified model features and outputs used in our analysis. The data includes TF-IDF features extracted from clinical notes, binary flags for relevant ICD codes and medications, and the ground truth labels from manual chart review. We provide separate datasets for the MGH and BIDMC cohorts, enabling reproducibility of our cross-institution validation experiments.
Usage Notes
Data and code to generate all results and figures from the publication are provided here. The repository includes notebooks for data preprocessing, model training, evaluation, and visualization of results as presented in the paper.
Ethics
This study of human subjects was approved by the Massachusetts General Hospital and Beth Israel Deaconess Medical Center Institutional Review Boards (IRB protocol numbers 2013P001024 and 2023P000487 for MGH, and 2022P000417 for BIDMC). The IRBs provided waivers of informed consent for this study. All data is deidentified in accordance with HIPAA guidelines.
Acknowledgements
This work was supported by grants from the National Institutes of Health (NIH): R01NS102190, R01NS102574, R01NS107291, RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, and National Science Foundation (NSF): 2014431.
Conflicts of Interest
M.B.W. is a co-founder of and holds equity in Beacon Biosignals. Beacon Biosignals did not contribute funding nor played any role in the study. The remaining authors declare no competing interests.
References
- Wei R, Buss SS, Milde R, Fernandes M, Sumsion D, Davis E, Kong WY, Xiong Y, Veltink J, Rao S, Westover TM, Petersen L, Turley N, Singh A, Das S, Junior VM, Ghanta M, Gupta A, Kim J, Lam AD, Stone KL, Mignot E, Hwang D, Trotti LM, Clifford GD, Katwa U, Thomas RJ, Mukerji S, Zafar SF, Westover MB, Sun H. Automated phenotyping of mild cognitive impairment and Alzheimer's disease and related dementias using electronic health records. Int J Med Inform. 2025 Aug;200:105917. doi: 10.1016/j.ijmedinf.2025.105917. Epub 2025 Apr 11. PMID: 40222334.
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
DOI:
https://doi.org/10.60508/z6ap-6g20
Project Website:
https://github.com/bdsp-core/NAX-MCI-AD
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project