Name: The Brain Imaging and Neurophysiology Database (BIND)
Published: Sept. 9, 2025
License: https://github.com/bdsp-core/bdsp-license-and-dua

Database Credentialed Access

Charlotte Maschke , Peter Hadar , Yicheng Zhang , Jian Li , Gauri Ganjoo , Andrew Hoopes , Alessandro Guazzo , Aditya Gupta , Manohar Ghanta , Bruce Nearing , Christine Tsien Silvers , Bharath Gunapati , Robert Thomas , Jennifer Kim , Shibani Mukerji , Adrian Dalca , Sahar Zafar , Alice Lam , Emmanuel Mignot , M Brandon Westover

Published: Sept. 9, 2025. Version: 1.0

When using this resource, please cite: (show more options)
Maschke, C., Hadar, P., Zhang, Y., Li, J., Ganjoo, G., Hoopes, A., Guazzo, A., Gupta, A., Ghanta, M., Nearing, B., Silvers, C. T., Gunapati, B., Thomas, R., Kim, J., Mukerji, S., Dalca, A., Zafar, S., Lam, A., Mignot, E., & Westover, M. B. (2025). The Brain Imaging and Neurophysiology Database (BIND) (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/mby8-3a26.

MLA	Maschke, Charlotte, et al. "The Brain Imaging and Neurophysiology Database (BIND)" (version 1.0). Brain Data Science Platform (2025), https://doi.org/10.60508/mby8-3a26.
APA	Maschke, C., Hadar, P., Zhang, Y., Li, J., Ganjoo, G., Hoopes, A., Guazzo, A., Gupta, A., Ghanta, M., Nearing, B., Silvers, C. T., Gunapati, B., Thomas, R., Kim, J., Mukerji, S., Dalca, A., Zafar, S., Lam, A., Mignot, E., & Westover, M. B. (2025). The Brain Imaging and Neurophysiology Database (BIND) (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/mby8-3a26.
Chicago	Maschke, Charlotte, Hadar, Peter, Zhang, Yicheng, Li, Jian, Ganjoo, Gauri, Hoopes, Andrew, Guazzo, Alessandro, Gupta, Aditya, Ghanta, Manohar, Nearing, Bruce, Silvers, Christine Tsien, Gunapati, Bharath, Thomas, Robert, Kim, Jennifer, Mukerji, Shibani, Dalca, Adrian, Zafar, Sahar, Lam, Alice, Mignot, Emmanuel, and M Brandon Westover. "The Brain Imaging and Neurophysiology Database (BIND)" (version 1.0). Brain Data Science Platform (2025). https://doi.org/10.60508/mby8-3a26.
Harvard	Maschke, C., Hadar, P., Zhang, Y., Li, J., Ganjoo, G., Hoopes, A., Guazzo, A., Gupta, A., Ghanta, M., Nearing, B., Silvers, C. T., Gunapati, B., Thomas, R., Kim, J., Mukerji, S., Dalca, A., Zafar, S., Lam, A., Mignot, E., and Westover, M. B. (2025) 'The Brain Imaging and Neurophysiology Database (BIND)' (version 1.0), Brain Data Science Platform. Available at: https://doi.org/10.60508/mby8-3a26.
Vancouver	Maschke C, Hadar P, Zhang Y, Li J, Ganjoo G, Hoopes A, Guazzo A, Gupta A, Ghanta M, Nearing B, Silvers C T, Gunapati B, Thomas R, Kim J, Mukerji S, Dalca A, Zafar S, Lam A, Mignot E, Westover M B. The Brain Imaging and Neurophysiology Database (BIND) (version 1.0). Brain Data Science Platform. 2025. Available from: https://doi.org/10.60508/mby8-3a26.

Abstract

The Brain Imaging and Neurophysiology Database (BIND) represents one of the largest multi-institutional, multimodal neuroimaging repositories, comprising 1.8 million brain scans from 38,945 subjects linked to neurophysiological recordings. This comprehensive dataset addresses critical limitations in neuroimaging research by providing unprecedented scale and diversity across pathologies and healthy controls. BIND integrates de-identified data from three major academic medical centers -- Massachusetts General Hospital, Brigham and Women's Hospital, and Stanford University Medical Center -- including 1,724,300 MRI scans (1.5T, 3T, and 7T), 54,154 CT scans, 5,720 PET scans, and 655 SPECT scans, converted to standardized NIfTI format following BIDS organization. The database spans the full age spectrum and encompasses diverse neurological conditions alongside healthy subjects. We deployed Bio-Medical Large Language Models to extract structured clinical metadata from 84,960 associated radiology reports, categorizing findings into standardized pathology classifications. All imaging data are linked to previously published EEG and polysomnography recordings from the Harvard Electroencephalography Database and Human Sleep Project, enabling unprecedented multimodal analyses. BIND is freely accessible for academic research through the Brain Data Science Platform (https://bdsp.io/). This resource facilitates large-scale neuroimaging studies, machine learning applications, and multimodal brain research to accelerate discoveries in clinical neuroscience.

Background

The past decade has witnessed remarkable growth in neuroinformatics, with computational advances unlocking new potential from neuroimaging data. Machine learning techniques and computational neuroimaging analyses now demonstrate performance matching or exceeding human neuroradiologists. These sophisticated analytical methods can detect subtle findings in neurological conditions including epilepsy, stroke, multiple sclerosis, traumatic brain injury, and dementia that yield novel insights, advancing our understanding of neurological conditions and human neuroscience.

However, progress has been constrained by datasets that are small, institution-specific, and focused on narrow clinical populations. The vast diversity of neurological conditions and significant inter-rater variability require large-scale datasets for robust analysis. While researchers have created various neuroimaging databases, most target specific disorders and fail to capture the natural variability encountered in clinical neurological practice. Open data sharing is essential for driving innovation in neuroscience and developing practical clinical applications.

BIND addresses these limitations as a comprehensive resource for the research community. This large-scale multimodal imaging database contains over 1.8 million scans from nearly 39,000 patients who underwent clinical electroencephalogram (EEG) or polysomnogram (PSG) testing over three decades. The dataset encompasses diverse pathologies and normal findings across all age ranges, from newborns to elderly patients. Available modalities include MRI (acquired at 1.5, 3, and 7 Tesla field strengths), Positron Emission Tomography (PET), Single Photon Emission Computed Tomography (SPECT), and CT. MRI sequences span structural, diffusion, perfusion, and functional imaging protocols. The corresponding EEG and PSG data are available through the Harvard Electroencephalography Database, and all modalities can be linked with additional neurological testing and electronic health record data through the Brain Data Science Platform (BDSP). By providing this comprehensive resource to researchers worldwide, BIND aims to accelerate neuroscience discoveries and enable innovative clinical applications that improve patient care and outcomes.

Methods

Data Acquisition

The imaging dataset consists of clinical brain imaging scans acquired from three major academic medical centers. Scans were identified retrospectively from IRB-approved chart review under institutional protocols, which provided waivers of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed. The dataset includes imaging from all patients who had undergone EEG (routine or long-term monitoring) or polysomnography testing for clinical purposes over a 30-year period.

Data De-identification

Data were de-identified by institutional de-identification services following strict HIPAA Safe Harbor standards. Demographics and imaging metadata were processed through each institution's imaging service, and all data were date-shifted for additional privacy protection. Clinical reports underwent automated de-identification, and all images were processed through optical character recognition (OCR) to remove any embedded text that could contain identifiable information.

Access to the data requires "controlled access" protocols: all users must sign a mandatory Data Use Agreement with strict terms and conditions and provide proof of CITI training certification. These agreements prohibit attempts to reidentify individual records or further sharing of the data.

Computational Processing

All imaging data were converted from DICOM format to NIfTI format using dcm2niix software. Sequence identification and clinical metadata extraction were performed using custom software tools developed specifically for this dataset.

MRI Sequence Standardization

To standardize MRI sequence naming across the heterogeneous multi-institutional dataset, we implemented a metadata-driven approach using acquisition parameters from de-identified DICOM headers. Key parameters included Image Type, Scanning Sequence, Sequence Variant, Echo Time (TE), Repetition Time (TR), Inversion Time (TI), Flip Angle, Diffusion B Value, and Echo Train Length.

Using parameter-guided thresholds based on MRI physics principles and unsupervised clustering with Gaussian Mixture Models, sequences were classified into standardized categories:

T1-weighted (T1)
T2-weighted (T2)
T2 fluid-attenuated inversion recovery (FLAIR)
Diffusion-weighted imaging (DWI)
Functional MRI (fMRI)
Susceptibility-weighted imaging (SWI)
Perfusion-weighted imaging (PWI)
Magnetic resonance angiography (MRA)
Localizer, other, and unknown categories

Additional keyword-based matching was applied using scanner-generated sequence descriptions when available after de-identification. The sequence identification accuracy was validated using manually annotated ground truth datasets. Custom code for sequence identification is available on GitHub.

Clinical Metadata Extraction

Clinical metadata was extracted from unstructured imaging reports using Bio-Medical Large Language Models (Bio-Medical-Llama-3-8B), specifically fine-tuned for processing clinical text. The extraction process followed a systematic four-step approach:

Step 1 - Information Extraction: The LLM categorized each report as brain-related or not and identified the presence or absence of pathological conditions (responses limited to 'Yes', 'No', or 'Unknown').

Step 2 - Pathology Identification: The model extracted and listed all mentioned pathologies, generating JSON-formatted output with detailed information including pathology type, clinical term, anatomical location, brain-relatedness, severity, acuity (acute vs chronic), and additional details.

Step 3 - Standardization: Extracted findings were assigned to standardized clinical categories. Experienced neurologists iteratively developed 10 overarching clinical categories with 150 subcategories from over 300,000 initial findings. The model was prompted to select the best-fitting clinical category for each finding through dialogue-driven refinements evaluated on randomly selected samples.

Step 4 - Self-validation: The model verified whether each extracted pathology was explicitly mentioned in the original imaging report to identify potential over-interpretations or hallucinations.

Important Limitation: Due to the nature of LLMs and the scale of nearly 2 million brain scans, we cannot guarantee complete accuracy of extracted information. The standardized metadata serves as a navigation tool to facilitate dataset access and identification of clinical subgroups of interest, not as clinical ground truth. Original de-identified clinical reports are provided alongside extracted metadata for user verification and should always be consulted for authoritative interpretation.

Data Description

Dataset Storage and Organization

BIND is published as part of the Brain Data Science Platform (BDSP; https://bdsp.io/), a comprehensive collection of open-source clinical datasets for brain research. BDSP hosts multiple repositories including the Human Sleep Project (HSP), the I-CARE (International Cardiac Arrest REsearch consortium), and others. On November 21, 2024, the NIH recognized BDSP as an approved data-sharing repository, aligning with the NIH's Data Management and Sharing (DMS) policy and demonstrating its commitment to open science and collaborative neuroscience research. Data storage within BDSP is sponsored by the Amazon Web Services (AWS) Open Data Sponsorship Program, enabling secure, scalable, and free dataset access.

The BIND dataset follows a standardized hierarchical folder structure using the Brain Imaging Data Structure (BIDS) format. The organization includes:

Top-level directories by data collection site
Subdirectories for individual patients
Session folders containing neuroimaging data organized by modality (anat/dwi/func/swi/etc.)
Demographic information and clinical notes provided alongside BIDS-compliant data
Comprehensive metadata tables at the site level
Original de-identified imaging reports in separate directories

Multimodal Integration: Most (>80%) of patients in BIND are also included in the Harvard Electroencephalography Database in the Human Sleep Project, enabling powerful multimodal analyses combining neuroimaging with neurophysiological data.

File Formats:

Imaging files: NIfTI format
Metadata: JSON and CSV formats
Clinical reports: TXT files

Demographics

Total Population: 38,942 individuals across 108,737 clinical encounters

Age Distribution:

Range: 20 days to >100 years
Mean age at first encounter: 53.7 ± 23.2 years

Racial and Ethnic Statistics:

White: 67.8%
Asian: 6.4%
Black or African American: 9.2%
Other/Multiple identities: 11.2%
Unknown: 5.4%

Additional Demographics Available: Comprehensive demographic metadata includes marital status, occupation, primary language, education level, veteran status, height, weight, BMI, smoking history, alcohol use, and when applicable, cause and date of death.

Imaging Modalities and Sequences

Total Dataset: 1,791,935 clinical images

Modality Breakdown:

MRI: 1,724,300 scans (95.87%)
- Multiple field strengths: 1.5T, 3T, and 7T
- Sequences: T1-weighted, T2-weighted, FLAIR, DWI, fMRI, SWI, PWI, MRA, localizers
CT: 54,154 scans (3.01%)
PET: 5,720 scans (0.32%)
SPECT: 655 scans (0.04%)

Clinical Context: Scanning sequences were selected based on clinical indications and are therefore not standardized across studies, reflecting real-world clinical practice variability.

Clinical Metadata

Report Availability: Clinical reports are available for 96.54% of sessions

LLM Processing Results:

84,960 reports determined to be brain-related (87.31% of total)
17,209 reports (17.69%) describe no pathology
76,274 reports (78.39%) describe pathological findings
3,822 reports (3.93%) categorized as unknown

Extracted Findings: 394,122 total findings (mean 3.05 findings per session)

318,726 findings (80.87%) identified as brain-related

Clinical Categories (brain-related findings):

Vascular conditions: 22.88%
Acquired and traumatic injuries: 15.97%
Neoplasms: 12.17%
Neurodegenerative conditions: 9.41%
Inflammatory conditions: 8.95%
White matter conditions: 3.57%
Structural abnormalities: 2.10%
Cyst-like lesions: 1.04%
Other/miscellaneous: 8.78%
Unassigned category: 8.98%
Technical/artifacts: 3.40%

A standardized clinical metadata table is provided to facilitate data selection and interpretation, enabling researchers to efficiently identify relevant clinical subgroups for their studies.

Usage Notes

How to Access the Data

Data access is provided via the Brain Data Science Platform (BDSP). Complete data access instructions and security protocols are available on bdsp.io.

Requirements:

Active Amazon Web Services (AWS) account with Amazon ID (must be provided in BDSP profile settings)
Signed Data Use Agreement with strict terms and conditions
Proof of completed CITI Training certification

Access Methods: After application approval, data can be accessed through:

AWS Command Line Interface using AWS Access Keys
Directory listing and file downloads
Bulk folder copying to local systems
Cloud-based data processing

BDSP provides flexible options for both local download and cloud-based analysis workflows.

Use of Clinical Metadata

⚠️ Important Limitation: Due to inherent limitations of Large Language Models (LLMs) and the absence of human-curated metadata, we cannot guarantee the accuracy or completeness of extracted information.

The standardized metadata should be viewed as a preliminary contextual framework to help users:

Navigate the dataset efficiently
Search for relevant subgroups
Identify clinically relevant cases

Always refer to original reports: Full-text de-identified clinical reports are available alongside extracted metadata and serve as the authoritative source for clinical interpretation. Clinical metadata should always be validated against the corresponding full-text clinical report.

Use of Diffusion Images

Missing Parameters: A subset of diffusion images are missing corresponding B-values and B-vectors. This occurred during the institutional de-identification process, which overwrote portions of the DICOM headers to ensure patient privacy.

Future Updates: We expect to restore B-values and B-vectors files in future database updates to improve diffusion processing capabilities.

Imaging Quality

Unprocessed Clinical Data: This database contains unprocessed clinical imaging studies in NIfTI format acquired without prior signal quality assessment or preprocessing.

Quality Considerations:

Scanning quality varies significantly
Resolution differences across studies
Artifact levels may impact interpretation
Clinical acquisition protocols (not research-optimized)

Recommendation: Perform visual quality assessment of images for your specific subgroups of interest before analysis.

Sequence Identification

Overall Performance: Our automated sequence identification achieves high accuracy, but several limitations should be considered:

Sources of Variability:

Heterogeneous clinical data from multiple sites
Scans acquired at external institutions with non-standardized protocols
Different scanner vendors and field strengths
Sequence names removed during de-identification
Research or experimental sequences included in clinical data

Common Misclassifications:

Localizers occasionally identified as T1- or T2-weighted images
Site-specific variations and overlapping parameter values
Neonatal scans present additional classification challenges
Non-brain images and uncommon sequence types

Best Practices:

Verify sequence classifications for your specific research needs
Consider manual review for critical analyses
Refer to our GitHub repository for sequence identification code and ongoing improvements

These limitations highlight the importance of understanding the clinical context of the data and performing appropriate quality checks for your specific research applications.

Release Notes

Limitations & Expected Future Updates

Of note, during the institutional de-identification process, some of the header information for the diffusion scans, namely the B values and B vectors, were overwritten. For the current dataset, approximately 12.,57% of the diffusion scans have missing (overwritten) B values and B vectors. Future updates to the dataset will incorporate the B values and B vectors for all, as well as expanding the participating institutions for both the neurophysiology and neuroimaging datasets.

Ethics

In this dataset, all data were anonymized with all identifiable patient information removed. Scans were identified retrospectively from IRB-approved chart review under protocols approved by the BIDMC IRB (protocols #2022P000481, #2022P000417) and MGB IRB (protocol #2013P001024), which provided a waiver of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed.

Acknowledgements

This work was supported by grants from the NIH (RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119, R01NS131347).

SSM is supported by the National Institute of Mental Health [R01MH131194, R01MH134823], the Claflin Distinguished Scholar award [Massachusetts General Hospital]

Conflicts of Interest

Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals.

ADL has served as a consultant for Neurona Therapeutics, and the institution of ADL has received research funding from Neurona Therapeutics and Sage Therapeutics.

Dr. Silvers is employed by and has personal equity interest in AWS.

Parent Projects

The Brain Imaging and Neurophysiology Database (BIND) was derived from:

Please cite them when using this project.

Access

Access Policy:
Only credentialed users who sign the DUA can access the files.

License (for files):
BDSP Credentialed Health Data License 1.5.0

Data Use Agreement:
BDSP Credentialed Health Data Use Agreement

Required training:
CITI Data or Specimens Only Research

Discovery

DOI:
https://doi.org/10.60508/mby8-3a26

Topics:
ct mri brain imaging

Project Website:
https://github.com/bdsp-core/BigBrainImagingDatabase

Corresponding Author

You must be logged in to view the contact information.

Files

This is a restricted-access resource. To access the files, you must fulfill all of the following requirements:

be a credentialed user
complete required training:

CITI Data or Specimens Only Research

here

sign the data use agreement for the project

The Brain Imaging and Neurophysiology Database (BIND)

Cite