Database Credentialed Access
Automated Extraction of Seizures and Ictal-Interictal Continuum Patterns from EEG Reports to Enable Large-Scale Neurophysiology and Neurocritical Care Research - Data and Code
Shadi Sartipi , Deena S. Godfrey , Alexandra-Maria Tauțan , Marta P. Fernandes , Manohar Ghanta , Aditya Gupta , Bruce Nearing , Jennifer Kim , Aaron F. Struck , Tobias Loddenkemper , Jurriaan Peters , Jong Woo Lee , M. Brandon Westover , Sahar F. Zafar
Published: June 1, 2026. Version: 1.0.0
When using this resource, please cite:
(show more options)
Sartipi, S., Godfrey, D. S., Tauțan, A., Fernandes, M. P., Ghanta, M., Gupta, A., Nearing, B., Kim, J., Struck, A. F., Loddenkemper, T., Peters, J., Lee, J. W., Westover, M. B., & Zafar, S. F. (2026). Automated Extraction of Seizures and Ictal-Interictal Continuum Patterns from EEG Reports to Enable Large-Scale Neurophysiology and Neurocritical Care Research - Data and Code (version 1.0.0). Brain Data Science Platform. https://doi.org/10.60508/j6w0-v279.
Abstract
Objective. Critical clinical information in EEG reports is often locked in free text, limiting large-scale research. This work develops and validates an automated, cross-institutional pipeline to extract seizures, rhythmic patterns, and ictal-interictal continuum features for population-level neurophysiology studies.
Methods. We developed a five-stage hybrid pipeline comprising: (1) text normalization, (2) rule-based pattern filtering to focus the analysis on relevant report segments, (3) segment-focused prompt generation, (4) large language model (LLM) inference for structured attribute extraction, and (5) post-processing. This hybrid design combines the consistency of rule-based screening with the flexibility of LLM-based extraction from variable free-text EEG reports. The pipeline was applied to 156,582 EEG reports from three health systems (two adult, one pediatric). Performance was evaluated using a balanced set of 1,500 expert-reviewed samples, which extracted seizure presence, count, burden, and timing, as well as the frequency and prevalence of periodic and rhythmic patterns. Patterns included lateralized periodic discharges (LPD), generalized periodic discharges (GPD), and generalized/lateralized rhythmic delta activity (GRDA/LRDA).
Results. The pipeline achieved high performance across health systems for seizure (average accuracy 0.94) and periodic and rhythmic pattern detection (LPD: 0.97 [95% CI: 0.94–0.99], GPD: 0.96 [0.92–0.98], GRDA/LRDA: 0.97 [0.94–0.99]). Seizure specificity was 0.72 at two adult institutions (95% CI: 0.66–0.77) and 0.99 at the other two (one adult and one pediatric). Frequency and prevalence extraction showed strong agreement with mean values 0.94 and 0.93, respectively. High-frequency discharges (≥2 Hz) were associated with higher seizure occurrence within 24 and 48 hours in descriptive analysis.
Conclusions. This study introduces a scalable method for EEG analysis using clinical narratives, linking rhythmic patterns to seizure progression and supporting critical care risk stratification and large-scale EEG outcome prediction research.
Background
Continuous electroencephalography (cEEG) plays a vital role in seizure detection, ischemia detection, and prognostication in patients with acute brain injuries and altered mental status [1], [2], [3], [4]. Up to 20-40% of patients with acute brain injuries and altered mental status are found to have nonconvulsive seizures and rhythmic and periodic EEG patterns that fall on the ictal–interictal continuum (IIC) [2], [5]. Patterns such as lateralized periodic discharges (LPDs), generalized periodic discharges (GPDs), lateralized rhythmic delta activity (LRDA), are increasingly recognized as markers of brain dysfunction and prognosis in critically ill patients [6], [7], [8].
Despite the clinical significance of these EEG patterns, large-scale studies investigating treatment of seizures and other IIC patterns have been limited due to the time-consuming nature of extracting structured data from EEG reports [9]. Typically, EEG interpretations are recorded as unstructured free-text reports, which vary widely in terminology, formatting, and level of detail depending on the institution and the individual clinician [10], [11]. Manual chart review is labor-intensive, time-consuming, unscalable, and requires trained experts, posing a significant roadblock to population-level studies aimed at understanding the prognostic implications of EEG patterns or evaluating treatment responses in heterogeneous patient cohorts [12], [13].
Numerous approaches have been developed to automate EEG interpretation, and they fall into two main categories: waveform-based and report-based methods [14], [15]. waveform-based models apply machine learning frameworks to detect seizures or classify EEG waveforms directly from raw data [16], [17]. These methods can achieve high accuracy but depend on access to large volumes of raw EEG data, which is not always available, and often require substantial computational resources. Alternative efforts, such as the American Clinical Neurophysiology Society (ACNS) standardized terminology initiatives, have focused on harmonizing EEG annotation protocols [2]. Text-based approaches using natural language processing (NLP) have shown promise in extracting seizure-related information from EEG reports. Some have employed rule-based systems or keyword extraction [18], [19], [20], while others have applied machine learning and transformer-based models [21], [22].
While the performance of these prior works is significant, they have important limitations. Many NLP pipelines focus only on binary seizure detection, overlooking critical contextual features such as seizure frequency, duration, and burden. In addition, there are no pipelines for extraction of periodic, rhythmic or IIC patterns, as well as their frequency and prevalence [23], [24], [25]. Moreover, most have been validated on small datasets or are from a single institution, limiting generalizability. Furthermore, large language models (LLMs) used in recent studies are often prone to hallucination, particularly when tasked with extracting information that is only implicitly mentioned or inconsistently reported [26]. There remains a need for a validated method that can extract both the presence and characteristics of seizures, periodic and rhythmic, and IIC patterns from EEG reports across diverse hospital systems.
To address these challenges, we developed a cross-institutional pipeline that leverages recent advances in LLMs for structured extraction of seizure and IIC features from unstructured EEG reports. Our pipeline combines preprocessing, pattern-focused segmentation, rule-based filtering, LLM prompting, and post-processing to generate high-resolution structured outputs. Our approach explicitly extracts detailed seizure and IIC characteristics, including frequency, prevalence, and timing, while operating locally on clinical reports, ensuring privacy preservation. Our contributions are threefold: (1) we introduce a scalable, reproducible pipeline for automated EEG report parsing across diverse hospital systems; (2) we validate this approach on expert-annotated samples across three adult sites and one pediatric site; (3) we show that early IIC patterns, especially high-frequency discharges, are associated with increased seizure risk, highlighting their utility in neurocritical care triage and monitoring. Together, this work provides a framework for unlocking large-scale EEG datasets to advance research in neurocritical care.
Methods
Dataset
We used EEG report data from three health systems: Massachusetts General Hospital and Brigham and Women's Hospital (grouped as MGB), Beth Israel Deaconess Medical Center (BIDMC), and Boston Children's Hospital (BCH). For MGB and BIDMC, we used EEG reports from adult inpatient hospitalizations between 1998 and 2023. The BCH dataset included inpatient and outpatient EEG reports from 2007 to 2024. The study was done under research protocols approved by the Mass General Brigham and Beth Israel Deaconess Medical Center (BIDMC) IRB (MGB: 2023P000478, 2024P002630, BIDMC: 2022P000417, 2022P000481), with a Data Use Agreement with Boston Children's Hospital.
Table 1 below provides a breakdown of segment counts, age, gender, and EEG findings across the three health systems. Out of a total of 156,582 EEG segments, the pipeline identified 22,510 (14.8%) segments with seizures, 11,954 (7.6%) with LPDs, 10,511 (6.7%) with GPDs, 5,394 (3.4%) with LRDA, and 14,849 (9.5%) with GRDA.
Table 1. Summary of data across all cohorts. N and NT are the number of segments and unique patients respectively.
| MGB N=58,612 |
BIDMC N=48,978 |
BCH N=48,992 |
Total segments N=156,582 |
Total Unique Patients NT=56,580 |
|
|---|---|---|---|---|---|
| Age, Count (percentage%) | |||||
| [0, 10) | 0 (0.0) | 0 (0.0) | 8,566 (17.4) | 8,566 (5.5) | 3,849 (6.8) |
| [10, 20) | 867 (1.5) | 354 (0.7) | 20,307 (41.4) | 21,528 (13.8) | 10,818 (19.1) |
| [20, 30) | 4,203 (7.2) | 4,796 (9.8) | 16,904 (34.5) | 25,903 (16.5) | 10,278 (18.2) |
| [30, 40) | 4,483 (7.6) | 5,133 (10.5) | 3,080 (6.3) | 12,696 (8.1) | 3,765 (6.7) |
| [40, 50) | 5,998 (10.2) | 6,122 (12.5) | 116 (0.2) | 12,236 (7.8) | 3,054 (5.4) |
| [50, 60) | 10,385 (17.7) | 9,139 (18.7) | 6 (0.0) | 19,530 (12.5) | 5,320 (9.4) |
| [60, 70) | 14,274 (24.4) | 9,866 (20.1) | 13 (0.0) | 24,153 (15.4) | 7,583 (13.4) |
| [70, 80) | 11,675 (19.9) | 7,733 (15.8) | 0 (0.0) | 19,408 (12.4) | 6,896 (12.2) |
| [80, 90) | 5,717 (9.8) | 4,830 (9.7) | 0 (0.0) | 10,547 (6.7) | 4,113 (7.3) |
| [90, +∞) | 1,020 (1.7) | 1,005 (2.1) | 0 (0.0) | 2,025 (1.3) | 904 (1.6) |
| Gender, Count (percentage%) | |||||
| Male | 30,532 (52.1) | 26,183 (53.5) | 26,772 (54.6) | 83,487 (53.3) | 30,763 (54.4) |
| Female | 28,079 (47.9) | 22,795 (45.5) | 22,217 (45.3) | 73,091 (46.7) | 25,795 (45.6) |
| Other | 1 (0.0) | 2 (0.0) | 3 (0.0) | 6 (0.0) | 6 (0.0) |
| EEG Findings, Count (percentage%) | |||||
| Seizure | 7,084 (12.1) | 5,653 (11.5) | 8,169 (16.7) | 22,510 (14.8) | — |
| LPD | 9,658 (16.5) | 2,124 (4.3) | 172 (0.4) | 11,954 (7.6) | — |
| GPD | 8,156 (13.9) | 2,295 (4.7) | 60 (0.1) | 10,511 (6.7) | — |
| LRDA | 4,602 (7.9) | 736 (1.5) | 56 (0.1) | 5,394 (3.4) | — |
| GRDA | 12,431 (21.2) | 2,288 (4.7) | 130 (0.3) | 14,849 (9.5) | — |
Five-stage pipeline
We developed a multi-stage pipeline consisting of five stages: preprocessing, rule-based pattern detection, segment-focused prompt generation, LLM inference, and post-processing. The rule-based filtering step was used as an initial screening stage to identify candidate report segments with relevant EEG terminology before LLM-based structured extraction, thereby reducing irrelevant context and limiting false-positive or unsupported extractions, and reducing compute. This hybrid design combines the consistency of rule-based screening with the flexibility of LLM-based extraction from variable free-text EEG reports.
- Preprocessing. Whitespace normalization, conversion to lowercase, removal of non-informative lines, and elimination of redundant symbols such as asterisks and exclamation points.
- Rule-based pattern filtering. The processed reports were divided into clinically meaningful text segments. Regular expressions based on American Clinical Neurophysiology Society (ACNS) standardized terminology identified segments containing explicit mentions of seizures, LPD, GPD, LRDA, and GRDA. This rule-based step limited downstream extraction to the most relevant portions of each report.
- Segment-focused prompt generation. Selected segments were stemmed and inserted into structured prompts designed for pattern-specific information extraction. Prompts were written to extract only explicitly stated findings and to return the output in a predefined JSON format.
- LLM inference. Prompts were passed to the language model for structured inference. Output was restricted to JSON format to facilitate downstream processing and reduce ambiguity. Local inference used Llama 3.2 (3B-parameter) through Ollama, temperature 0, with task-specific output token limits.
- Post-processing. Validation that extracted attributes appeared verbatim in the source segments (reducing hallucination risk); re-sorting segments chronologically; aggregation to encounter-level and 24-hour windows; imputation of missing values using the encounter-level mode.
Model performance verification
To validate pipeline performance, two annotators (one neurologist and one neurophysiologist) independently reviewed report segments. We selected 100 segments per pattern per site, with balanced class representation (50 positive and 50 negative samples per site). This resulted in 300 samples per pattern and a total of 1,500 expert-annotated samples across all five patterns. The validation set was intentionally constructed with balanced positive and negative samples to enable stable evaluation across all target findings, including less frequent patterns; this does not reflect the natural prevalence of EEG findings in the full cohort. Performance metrics included accuracy, sensitivity, and specificity with 95% binomial proportion confidence intervals.
Data Description
This BDSP project hosts the deidentified data and code that accompany Sartipi et al. (manuscript under review at the International Journal of Medical Informatics). The companion S3 folder s3://bdsp-opendata-credentialed/eeg-report-extraction/ contains three CSV files of extracted EEG findings and a code directory with the extraction pipeline.
Data files
Three CSV files of deidentified EEG findings, one per source health system:
| File | Source | Columns | Size |
|---|---|---|---|
data/bch_eeg_findings_deidentified.csv | Boston Children's Hospital (BCH, pediatric) | 26 | 6.4 MB |
data/bidmc_eeg_findings_deidentified.csv | Beth Israel Deaconess Medical Center (BIDMC, adult) | 21 | 3.5 MB |
data/mgb_eeg_findings_deidentified.csv | Mass General Brigham (MGB, adult; deidentified shareable subset) | 31 | 12 MB |
Each row corresponds to a single EEG epoch / report segment. Patient identifiers have been replaced with BDSP-style integer IDs (BDSPPatientID); dates have been shifted (ShiftedDateOfBirth, ShiftedDate, etc.) to prevent re-identification while preserving relative intervals.
Common columns across health systems
Each CSV contains the following groups of extracted attributes:
- Demographics:
BDSPPatientID,ShiftedDateOfBirth,Gender/SexDSC, and date/time fields for the EEG and (where applicable) hospital admission and discharge. - Seizures: presence, count, timestamps, duration, burden.
- IIC patterns (each of: GPD, GRDA, LPD, LRDA, BIRD where reported): presence, frequency, prevalence, and (for MGB/BCH) the LLM-emitted response string for downstream audit.
Column schemas differ slightly across health systems because of source-report formatting differences; refer to the file headers and the per-cohort sections of the manuscript for full column definitions.
Coverage caveat
The MGB CSV in this release contains the deidentified-shareable subset of the MGB data analyzed in the manuscript, in accordance with MGB's data-sharing rules. The BIDMC and BCH CSVs match the manuscript counts exactly. Manuscript-level segment counts (per Table 1 in the Methods section): MGB 58,612; BIDMC 48,978; BCH 48,992; total 156,582.
Code
The code/ folder contains the extraction pipeline:
code_eeg_annotation.ipynb— main Jupyter notebook implementing the five-stage pipeline (preprocessing, rule-based filtering, segment-focused prompt generation, LLM inference, post-processing).help_functions.py— shared utility functions (text normalization, pattern matching, structured-attribute parsing).list_szPatterns0x.txt,list_szPatterns1.txt,list_szPatternsR.txt— seizure-pattern keyword lists used in rule-based filtering.list_gpdPatterns.txt,list_grdaPatterns.txt,list_lpdPatterns.txt,list_lrdaPatterns.txt— pattern keyword lists for each IIC pattern (GPD, GRDA, LPD, LRDA).
Usage Notes
Code on GitHub
The pipeline code is mirrored at https://github.com/bdsp-core/eeg-report-extraction (public, browseable without credentialed access). Clone for the canonical reference; data still requires credentialed access via the S3 paths below.
Loading the data
import pandas as pd
bch = pd.read_csv("data/bch_eeg_findings_deidentified.csv")
bidmc = pd.read_csv("data/bidmc_eeg_findings_deidentified.csv")
mgb = pd.read_csv("data/mgb_eeg_findings_deidentified.csv")
print(bch.shape, bidmc.shape, mgb.shape)
print(bch.columns.tolist())
Reproducing the extraction pipeline
The notebook code/code_eeg_annotation.ipynb walks through the five-stage pipeline end-to-end, applying it to a sample input. To run it on your own EEG-report corpus:
- Open the notebook in JupyterLab (Python ≥3.10). Install
pandas,numpy, and your preferred LLM client (the pipeline was developed against the OpenAI API; any chat-completion-compatible client works with minor adjustments). - Point the input path to your own EEG-report CSV/TSV with one report per row.
- The pattern-keyword files (
list_*Patterns*.txt) drive the rule-based filtering stage; adjust them if your reports use different terminology. - Configure your LLM credentials and model name. The original work used GPT-4-class models; smaller models will degrade extraction quality, especially on rare pattern types.
Suggested analyses
- Reproduce the Table 1 demographic and pattern-prevalence breakdowns by institution.
- Reproduce the 24-/48-hour seizure association analysis for high-frequency IIC patterns (≥2 Hz) reported in the manuscript.
- Cross-validate extraction against a separate manually-annotated set in your own institution to estimate transferability.
Accessing the S3 data
If you've been granted credentialed access through bdsp.io, you can download the files via the AWS CLI:
aws s3 sync s3://bdsp-opendata-credentialed/eeg-report-extraction/ ./eeg-report-extraction/
Or via the access-point alias if your team uses one:
aws s3 sync s3://bdsp-credentialed-pr-fymwc8rqh9fzdisq7om7eiq9wutqhuse1b-s3alias/eeg-report-extraction/ ./
Release Notes
Updated 2026-06-02 to match the final revision of Sartipi et al. (IJMI-D-25-03601-R1). Changes: title updated to match the manuscript; author list expanded from 11 to 14 (added Tobias Loddenkemper, Jurriaan Peters, and Jong Woo Lee); corresponding author updated to Sahar F. Zafar; Aaron F. Struck affiliation updated from University of Wisconsin to Washington University in St. Louis; segment count corrected from 177,902 to 156,582 (the manuscript groups MGH+BWH as the MGB health system); Table 1 rebuilt as a proper HTML table; funding statement updated to NIH R01NS131347, R01NS126282; COI extended to include disclosures from Loddenkemper and Lee. The underlying S3 data files are unchanged.
Ethics
The study was done under research protocols approved by the Mass General Brigham and Beth Israel Deaconess Medical Center (BIDMC) IRB (MGB: 2023P000478, 2024P002630, BIDMC: 2022P000417, 2022P000481), with a Data Use Agreement with Boston Children's Hospital.
Acknowledgements
Funding: This work was supported by grants from the NIH (R01NS131347, R01NS126282).
Conflicts of Interest
Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals. Dr. Zafar receives publishing royalties from Springer and Wolters Kluwer. Dr. Loddenkemper is part of patent applications that detect and predict clinical outcomes, as well as manage, diagnose, and treat neurological conditions, epilepsy, and seizures. Dr. Loddenkemper received device donations from various companies, including Epitel, and receives research support from Epitel. Dr. Lee is a co-founder and scientific advisor to Soterya, Inc.
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
Discovery
DOI:
https://doi.org/10.60508/j6w0-v279
Project Website:
https://github.com/bdsp-core/eeg-report-extraction
Corresponding Author
Files
- be a credentialed user
- sign the data use agreement for the project