Database Credentialed Access
The Brain Imaging and Neurophysiology Database (BIND)
Charlotte Maschke , Peter Hadar , Yicheng Zhang , Jian Li , Gauri Ganjoo , Andrew Hoopes , Alessandro Guazzo , Aditya Gupta , Manohar Ghanta , Bruce Nearing , Christine Tsien Silvers , Bharath Gunapati , Robert Thomas , Jennifer Kim , Shibani Mukerji , Adrian Dalca , Sahar Zafar , Alice Lam , Emmanuel Mignot , M Brandon Westover
Published: Sept. 9, 2025. Version: 1.0
When using this resource, please cite:
(show more options)
Maschke, C., Hadar, P., Zhang, Y., Li, J., Ganjoo, G., Hoopes, A., Guazzo, A., Gupta, A., Ghanta, M., Nearing, B., Silvers, C. T., Gunapati, B., Thomas, R., Kim, J., Mukerji, S., Dalca, A., Zafar, S., Lam, A., Mignot, E., & Westover, M. B. (2025). The Brain Imaging and Neurophysiology Database (BIND) (version 1.0). Brain Data Science Platform. https://doi.org/10.60508/mby8-3a26.
Abstract
The Brain Imaging and Neurophysiology Database (BIND) represents one of the largest multi-institutional, multimodal neuroimaging repositories, comprising 1.8 million brain scans from 38,945 subjects linked to neurophysiological recordings. This comprehensive dataset addresses critical limitations in neuroimaging research by providing unprecedented scale and diversity across pathologies and healthy controls. BIND integrates de-identified data from three major academic medical centers -- Massachusetts General Hospital, Brigham and Women's Hospital, and Stanford University Medical Center -- including 1,724,300 MRI scans (1.5T, 3T, and 7T), 54,154 CT scans, 5,720 PET scans, and 655 SPECT scans, converted to standardized NIfTI format following BIDS organization. The database spans the full age spectrum and encompasses diverse neurological conditions alongside healthy subjects. We deployed Bio-Medical Large Language Models to extract structured clinical metadata from 84,960 associated radiology reports, categorizing findings into standardized pathology classifications. All imaging data are linked to previously published EEG and polysomnography recordings from the Harvard Electroencephalography Database and Human Sleep Project, enabling unprecedented multimodal analyses. BIND is freely accessible for academic research through the Brain Data Science Platform (https://bdsp.io/). This resource facilitates large-scale neuroimaging studies, machine learning applications, and multimodal brain research to accelerate discoveries in clinical neuroscience.
Background
The past decade has witnessed remarkable growth in neuroinformatics, with computational advances unlocking new potential from neuroimaging data. Machine learning techniques and computational neuroimaging analyses now demonstrate performance matching or exceeding human neuroradiologists. These sophisticated analytical methods can detect subtle findings in neurological conditions including epilepsy, stroke, multiple sclerosis, traumatic brain injury, and dementia that yield novel insights, advancing our understanding of neurological conditions and human neuroscience.
However, progress has been constrained by datasets that are small, institution-specific, and focused on narrow clinical populations. The vast diversity of neurological conditions and significant inter-rater variability require large-scale datasets for robust analysis. While researchers have created various neuroimaging databases, most target specific disorders and fail to capture the natural variability encountered in clinical neurological practice. Open data sharing is essential for driving innovation in neuroscience and developing practical clinical applications.
BIND addresses these limitations as a comprehensive resource for the research community. This large-scale multimodal imaging database contains over 1.8 million scans from nearly 39,000 patients who underwent clinical electroencephalogram (EEG) or polysomnogram (PSG) testing over three decades. The dataset encompasses diverse pathologies and normal findings across all age ranges, from newborns to elderly patients. Available modalities include MRI (acquired at 1.5, 3, and 7 Tesla field strengths), Positron Emission Tomography (PET), Single Photon Emission Computed Tomography (SPECT), and CT. MRI sequences span structural, diffusion, perfusion, and functional imaging protocols. The corresponding EEG and PSG data are available through the Harvard Electroencephalography Database, and all modalities can be linked with additional neurological testing and electronic health record data through the Brain Data Science Platform (BDSP). By providing this comprehensive resource to researchers worldwide, BIND aims to accelerate neuroscience discoveries and enable innovative clinical applications that improve patient care and outcomes.
Methods
Data Acquisition
The imaging dataset consists of clinical brain imaging scans acquired from three major academic medical centers. Scans were identified retrospectively from IRB-approved chart review under institutional protocols, which provided waivers of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed. The dataset includes imaging from all patients who had undergone EEG (routine or long-term monitoring) or polysomnography testing for clinical purposes over a 30-year period.
Data De-identification
Data were de-identified by institutional de-identification services following strict HIPAA Safe Harbor standards. Demographics and imaging metadata were processed through each institution's imaging service, and all data were date-shifted for additional privacy protection. Clinical reports underwent automated de-identification, and all images were processed through optical character recognition (OCR) to remove any embedded text that could contain identifiable information.
Access to the data requires "controlled access" protocols: all users must sign a mandatory Data Use Agreement with strict terms and conditions and provide proof of CITI training certification. These agreements prohibit attempts to reidentify individual records or further sharing of the data.
Computational Processing
All imaging data were converted from DICOM format to NIfTI format using dcm2niix software. Sequence identification and clinical metadata extraction were performed using custom software tools developed specifically for this dataset.
MRI Sequence Standardization
To standardize MRI sequence naming across the heterogeneous multi-institutional dataset, we implemented a metadata-driven approach using acquisition parameters from de-identified DICOM headers. Key parameters included Image Type, Scanning Sequence, Sequence Variant, Echo Time (TE), Repetition Time (TR), Inversion Time (TI), Flip Angle, Diffusion B Value, and Echo Train Length.
Using parameter-guided thresholds based on MRI physics principles and unsupervised clustering with Gaussian Mixture Models, sequences were classified into standardized categories:
- T1-weighted (T1)
- T2-weighted (T2)
- T2 fluid-attenuated inversion recovery (FLAIR)
- Diffusion-weighted imaging (DWI)
- Functional MRI (fMRI)
- Susceptibility-weighted imaging (SWI)
- Perfusion-weighted imaging (PWI)
- Magnetic resonance angiography (MRA)
- Localizer, other, and unknown categories
Additional keyword-based matching was applied using scanner-generated sequence descriptions when available after de-identification. The sequence identification accuracy was validated using manually annotated ground truth datasets. Custom code for sequence identification is available on GitHub.
Clinical Metadata Extraction
Clinical metadata was extracted from unstructured imaging reports using Bio-Medical Large Language Models (Bio-Medical-Llama-3-8B), specifically fine-tuned for processing clinical text. The extraction process followed a systematic four-step approach:
Step 1 - Information Extraction: The LLM categorized each report as brain-related or not and identified the presence or absence of pathological conditions (responses limited to 'Yes', 'No', or 'Unknown').
Step 2 - Pathology Identification: The model extracted and listed all mentioned pathologies, generating JSON-formatted output with detailed information including pathology type, clinical term, anatomical location, brain-relatedness, severity, acuity (acute vs chronic), and additional details.
Step 3 - Standardization: Extracted findings were assigned to standardized clinical categories. Experienced neurologists iteratively developed 10 overarching clinical categories with 150 subcategories from over 300,000 initial findings. The model was prompted to select the best-fitting clinical category for each finding through dialogue-driven refinements evaluated on randomly selected samples.
Step 4 - Self-validation: The model verified whether each extracted pathology was explicitly mentioned in the original imaging report to identify potential over-interpretations or hallucinations.
Important Limitation: Due to the nature of LLMs and the scale of nearly 2 million brain scans, we cannot guarantee complete accuracy of extracted information. The standardized metadata serves as a navigation tool to facilitate dataset access and identification of clinical subgroups of interest, not as clinical ground truth. Original de-identified clinical reports are provided alongside extracted metadata for user verification and should always be consulted for authoritative interpretation.
Data Description
Dataset Storage and Organization
BIND is published as part of the Brain Data Science Platform (BDSP; https://bdsp.io/), a comprehensive collection of open-source clinical datasets for brain research. BDSP hosts multiple repositories including the Human Sleep Project (HSP), the I-CARE (International Cardiac Arrest REsearch consortium), and others. On November 21, 2024, the NIH recognized BDSP as an approved data-sharing repository, aligning with the NIH's Data Management and Sharing (DMS) policy and demonstrating its commitment to open science and collaborative neuroscience research. Data storage within BDSP is sponsored by the Amazon Web Services (AWS) Open Data Sponsorship Program, enabling secure, scalable, and free dataset access.
The BIND dataset follows a standardized hierarchical folder structure using the Brain Imaging Data Structure (BIDS) format. The organization includes:
- Top-level directories by data collection site
- Subdirectories for individual patients
- Session folders containing neuroimaging data organized by modality (anat/dwi/func/swi/etc.)
- Demographic information and clinical notes provided alongside BIDS-compliant data
- Comprehensive metadata tables at the site level
- Original de-identified imaging reports in separate directories
Multimodal Integration: Most (>80%) of patients in BIND are also included in the Harvard Electroencephalography Database in the Human Sleep Project, enabling powerful multimodal analyses combining neuroimaging with neurophysiological data.
File Formats:
- Imaging files: NIfTI format
- Metadata: JSON and CSV formats
- Clinical reports: TXT files
Demographics
Total Population: 38,942 individuals across 108,737 clinical encounters
Age Distribution:
- Range: 20 days to >100 years
- Mean age at first encounter: 53.7 ± 23.2 years
Racial and Ethnic Statistics:
- White: 67.8%
- Asian: 6.4%
- Black or African American: 9.2%
- Other/Multiple identities: 11.2%
- Unknown: 5.4%
Additional Demographics Available: Comprehensive demographic metadata includes marital status, occupation, primary language, education level, veteran status, height, weight, BMI, smoking history, alcohol use, and when applicable, cause and date of death.
Imaging Modalities and Sequences
Total Dataset: 1,791,935 clinical images
Modality Breakdown:
- MRI: 1,724,300 scans (95.87%)
- Multiple field strengths: 1.5T, 3T, and 7T
- Sequences: T1-weighted, T2-weighted, FLAIR, DWI, fMRI, SWI, PWI, MRA, localizers
- CT: 54,154 scans (3.01%)
- PET: 5,720 scans (0.32%)
- SPECT: 655 scans (0.04%)
Clinical Context: Scanning sequences were selected based on clinical indications and are therefore not standardized across studies, reflecting real-world clinical practice variability.
Clinical Metadata
Report Availability: Clinical reports are available for 96.54% of sessions
LLM Processing Results:
- 84,960 reports determined to be brain-related (87.31% of total)
- 17,209 reports (17.69%) describe no pathology
- 76,274 reports (78.39%) describe pathological findings
- 3,822 reports (3.93%) categorized as unknown
Extracted Findings: 394,122 total findings (mean 3.05 findings per session)
- 318,726 findings (80.87%) identified as brain-related
Clinical Categories (brain-related findings):
- Vascular conditions: 22.88%
- Acquired and traumatic injuries: 15.97%
- Neoplasms: 12.17%
- Neurodegenerative conditions: 9.41%
- Inflammatory conditions: 8.95%
- White matter conditions: 3.57%
- Structural abnormalities: 2.10%
- Cyst-like lesions: 1.04%
- Other/miscellaneous: 8.78%
- Unassigned category: 8.98%
- Technical/artifacts: 3.40%
A standardized clinical metadata table is provided to facilitate data selection and interpretation, enabling researchers to efficiently identify relevant clinical subgroups for their studies.
Usage Notes
How to Access the Data
Data access is provided via the Brain Data Science Platform (BDSP). Complete data access instructions and security protocols are available on bdsp.io.
Requirements:
- Active Amazon Web Services (AWS) account with Amazon ID (must be provided in BDSP profile settings)
- Signed Data Use Agreement with strict terms and conditions
- Proof of completed CITI Training certification
Access Methods: After application approval, data can be accessed through:
- AWS Command Line Interface using AWS Access Keys
- Directory listing and file downloads
- Bulk folder copying to local systems
- Cloud-based data processing
BDSP provides flexible options for both local download and cloud-based analysis workflows.
Use of Clinical Metadata
⚠️ Important Limitation: Due to inherent limitations of Large Language Models (LLMs) and the absence of human-curated metadata, we cannot guarantee the accuracy or completeness of extracted information.
The standardized metadata should be viewed as a preliminary contextual framework to help users:
- Navigate the dataset efficiently
- Search for relevant subgroups
- Identify clinically relevant cases
Always refer to original reports: Full-text de-identified clinical reports are available alongside extracted metadata and serve as the authoritative source for clinical interpretation. Clinical metadata should always be validated against the corresponding full-text clinical report.
Use of Diffusion Images
Missing Parameters: A subset of diffusion images are missing corresponding B-values and B-vectors. This occurred during the institutional de-identification process, which overwrote portions of the DICOM headers to ensure patient privacy.
Future Updates: We expect to restore B-values and B-vectors files in future database updates to improve diffusion processing capabilities.
Imaging Quality
Unprocessed Clinical Data: This database contains unprocessed clinical imaging studies in NIfTI format acquired without prior signal quality assessment or preprocessing.
Quality Considerations:
- Scanning quality varies significantly
- Resolution differences across studies
- Artifact levels may impact interpretation
- Clinical acquisition protocols (not research-optimized)
Recommendation: Perform visual quality assessment of images for your specific subgroups of interest before analysis.
Sequence Identification
Overall Performance: Our automated sequence identification achieves high accuracy, but several limitations should be considered:
Sources of Variability:
- Heterogeneous clinical data from multiple sites
- Scans acquired at external institutions with non-standardized protocols
- Different scanner vendors and field strengths
- Sequence names removed during de-identification
- Research or experimental sequences included in clinical data
Common Misclassifications:
- Localizers occasionally identified as T1- or T2-weighted images
- Site-specific variations and overlapping parameter values
- Neonatal scans present additional classification challenges
- Non-brain images and uncommon sequence types
Best Practices:
- Verify sequence classifications for your specific research needs
- Consider manual review for critical analyses
- Refer to our GitHub repository for sequence identification code and ongoing improvements
These limitations highlight the importance of understanding the clinical context of the data and performing appropriate quality checks for your specific research applications.
Release Notes
Limitations & Expected Future Updates
Of note, during the institutional de-identification process, some of the header information for the diffusion scans, namely the B values and B vectors, were overwritten. For the current dataset, approximately 12.,57% of the diffusion scans have missing (overwritten) B values and B vectors. Future updates to the dataset will incorporate the B values and B vectors for all, as well as expanding the participating institutions for both the neurophysiology and neuroimaging datasets.
Ethics
In this dataset, all data were anonymized with all identifiable patient information removed. Scans were identified retrospectively from IRB-approved chart review under protocols approved by the BIDMC IRB (protocols #2022P000481, #2022P000417) and MGB IRB (protocol #2013P001024), which provided a waiver of consent for retrospective data analysis; no prospective data acquisition or participant recruitment was performed.
Acknowledgements
This work was supported by grants from the NIH (RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119, R01NS131347).
SSM is supported by the National Institute of Mental Health [R01MH131194, R01MH134823], the Claflin Distinguished Scholar award [Massachusetts General Hospital]
Conflicts of Interest
Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals.
ADL has served as a consultant for Neurona Therapeutics, and the institution of ADL has received research funding from Neurona Therapeutics and Sage Therapeutics.
Dr. Silvers is employed by and has personal equity interest in AWS.
Parent Projects
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
CITI Data or Specimens Only Research
Discovery
DOI:
https://doi.org/10.60508/mby8-3a26
Topics:
ct
mri
brain imaging
Project Website:
https://github.com/bdsp-core/BigBrainImagingDatabase
Corresponding Author
Files
- be a credentialed user
- complete required training:
- CITI Data or Specimens Only Research You may submit your training here.
- sign the data use agreement for the project