Database Restricted Access
NIDX: A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records
Arjun Singh , Shadi Sartipi , Haoqi Sun , Niels Turley , Sahar Zafar , Sudeshna Das , Marta Fernandes , M Brandon Westover , Shibani Mukerji
Published: May 31, 2025. Version: 1.0
When using this resource, please cite:
(show more options)
Singh, A., Sartipi, S., Sun, H., Turley, N., Zafar, S., Das, S., Fernandes, M., Westover, M. B., & Mukerji, S. (2025). NIDX: A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/k3z8-cd24.
Abstract
We developed a machine learning model to identify neuroinfectious diseases (NID) from electronic health record notes. Using 3,000 notes from Mass General Brigham, we trained an XGBoost model on text features extracted from clinical documentation. Our model achieved excellent performance with an AUROC of 0.977 and AUPRC of 0.894 on internal validation, significantly outperforming both ICD-code based identification (which showed high sensitivity of 97.1% but poor specificity of 59.1%) and zero-shot classification using LLaMA 3.2 (AUROC 0.80). The model maintained strong performance when tested on an external dataset of 600 notes from Beth Israel Deaconess Medical Center (AUROC 0.976, AUPRC 0.779). This approach provides an accurate, automated method for identifying patients with neuroinfectious diseases from clinical notes, enabling more precise cohort creation for research compared to traditional ICD code-based methods.
Background
Meningitis and encephalitis pose serious threats to health, potentially leading to severe neurological compromise or death. Many neuroinvasive pathogens, including viruses, bacteria, and fungi, are linked to long-term cognitive sequelae. Recent population-based studies suggest associations between pathogen exposures, especially neuroinvasive infections causing encephalitis or meningitis, and subsequent risk of developing neurodegenerative conditions such as Alzheimer's disease.
These findings underscore the need for comprehensive, longitudinal studies to elucidate the mechanisms by which neuroinvasive pathogens might contribute to neurodegenerative diseases. However, such research is severely hampered by the scarcity of large, well-annotated hospital datasets. Accurately identifying neuroinfectious diseases (NIDs) in hospital patient records remains a significant challenge for supporting evidence-based clinical decision-making and enabling large-scale research into their long-term consequences.
Most epidemiological studies linking infectious disease burden to subsequent dementia rely on healthcare databases that use International Classification of Diseases (ICD) codes but do not validate diagnoses with microbial assays. The limited reliability of ICD codes in accurately identifying infectious diseases has been previously reported, with positive predictive values as low as 56-57% observed in some registries.
Recent efforts to move beyond ICD-based coding have aimed at improving the timely and accurate diagnosis of NID etiologies, particularly in differentiating between viral and bacterial meningitis. These efforts predominantly employ approaches such as decision trees and ensemble methods, but most classification studies on NIDs lack external validation or rely on small datasets with limited cross-validation. To our knowledge, no study has utilized unstructured data such as clinical notes, which limits the generalizability and insights that could be derived.
Natural language processing (NLP) tools can leverage the rich information present in large-scale comprehensive electronic health records (EHRs), including unstructured clinical assessments and notes. NLP holds the potential to surpass the accuracy of traditional ICD-based phenotyping while reducing reliance on semi-structured data formats. Our goal is to enhance existing methods to achieve a more accurate classification of NID cases in EHRs.
Methods
We included 34,556 patients who underwent lumbar punctures (LP), sourcing medical notes from the EHR database of the Mass General Brigham (MGB) network of hospitals. This approach enriched individuals likely to be evaluated for CNS infections, with a total of 4,971,491 notes extracted between 01/22/2010 and 09/21/2023. ICD diagnostic codes billed one calendar day before and after the notes were extracted and labeled as likely to be associated with NID. For external validation, we also extracted 600 notes from patients in the Beth Israel Deaconess Medical Center (BIDMC) EHR system.
We excluded notes with fewer than 5,000 characters, as they likely lacked the necessary detail for clinicians to accurately assess patient status or provide valid training for models. We categorized notes into two groups: (1) notes with NID-related ICD codes (969 unique patients, 44,259 notes) and (2) those without NID ICD codes (24,320 unique patients, 641,063 notes). From these two groups, we randomly selected 3,000 notes from 2,469 patients at MGB, of which 50% were associated with NID-related ICD codes, and 50% were not.
For ground-truth labeling, 39,245 regular expressions were formulated based on domain-specific knowledge, categorized as (1) Positive NID, (2) Negative NID, (3) NID drugs, and (4) NID-likely keywords. Six physicians, all domain experts in neuroimmunology or neuroinfectious diseases, independently classified 500 notes each from the MGB dataset following a standardized operating procedure. Ambiguous cases were reviewed by an independent physician who was not involved in the initial classification.
Preprocessing notes to identify model features required: (1) limiting text to regions matching regular expressions with a 100-character buffer, (2) converting to lowercase letters, (3) removal of non-alphabetical characters, (4) replacing consecutive whitespaces with a single whitespace, (5) using words with more than 2 characters, and (6) removing stop words. The remaining text was lemmatized using WordNetLemmatizer. The text was transformed into a bag-of-words representation, considering unigrams (1-gram), bigrams (2-gram), and trigrams (3-gram).
To reduce the number of features and enhance model interpretability, we employed an iterative approach using Logistic Regression with L1 regularization. We addressed class imbalances by incorporating class weights inversely proportional to the sample size. We used this process to identify the most common features, resulting in a set of 1,284 features. A manual review by three experts eliminated the remaining irrelevant features leading to a reduced set of 342 non-zero features.
The MGB dataset was randomly split 80:20 for training (2,400 notes) and holdout (600 notes) test set; maintaining the 50/50 ICD code distribution. We utilized 5-fold cross-validation on the MGB training dataset to select the optimal model. Given the imbalanced nature of the dataset, with only 16% of the total observations positive for NID, we compared Logistic Regression, Random Forest, and XGBoost using the Area Under the Precision-Recall Curve (AUPRC). XGBoost was selected for its superior AUPRC performance.
We then trained an XGBoost model on the 2,400 training notes and tested its performance on both the holdout dataset (445 notes from MGB) and an external dataset consisting of 600 notes from patients at BIDMC. Model performance was based on the AUPRC and the AUROC. We performed bootstrapping with replacement and performed 1,000 iterations to estimate the 95% confidence intervals (CIs) for these metrics.
We also evaluated the performance of the LLaMA 3.2 auto-regressive model with 3B size as a zero-shot learning approach for predicting NID from clinical reports, contrasting it with our NLP-based method. LLaMA was not fine-tuned or trained on examples for this task; instead, it directly classified NID presence or absence from clinical notes formatted in JSON.
This study was approved by the Mass General Brigham Institutional Review Board (IRB approval #2017P001133).
Data Description
The dataset consists of clinical notes from 2,469 patients comprising 3,000 notes from Mass General Brigham (MGB) and 600 notes from Beth Israel Deaconess Medical Center (BIDMC). The median age (IQR) of the MGB cohort was 61.0 years (46-73 years), of which 55% (n=1350) were female sex at birth, 77% (n=1911) reported being of white race, and 83% (n=2041) reported being of non-Hispanic ethnicity.
Out of the 3,000 MGB notes, 16% (479 notes) were labeled as NID based on expert review, and 97% of these 479 notes had an NID-related ICD code. Among the 1,500 notes with an NID-related ICD code, 465 notes (31%) were confirmed as NID cases by expert review. Conversely, among the 1,500 notes without a NID-related ICD code, 0.93% (14 notes) were identified as NID cases by expert review.
The dataset includes features extracted from clinical notes, with the top features for NID prediction including direct markers of central nervous system inflammation (e.g., 'meningitis,' 'ventriculitis,' and 'meningoencephalitis'), diagnostic tests (e.g., 'cytology,' 'viral load,' and 'pcr'), cell types ('lymphocytic'), specific pathogens, and medical conditions associated with NID.
Usage Notes
Data and code to generate all results and figures from the publication are provided here
Ethics
This study of human subjects was approved by the Mass General Brigham Institutional Review Board (IRB approval #2017P001133), including review of electronic health record data. The Partners Healthcare Human Research Committee provided a waiver of written consent for this study. All data is deidentified.
Conflicts of Interest
M.B.W. is a co-founder of and holds equity in Beacon Biosignals. Beacon Biosignals did not contribute funding nor played any role in the study.
References
- https://www.neurology.org/doi/10.1212/WNL.0000000000204782
Access
Access Policy:
Only registered users who sign the specified data use agreement can access the files.
License (for files):
BDSP Restricted Health Data License 1.0.0
Data Use Agreement:
BDSP Restricted Health Data Use Agreement
Discovery
Corresponding Author
Files
- sign the data use agreement for the project