Database Credentialed Access
Evaluating crowdsourcing for ICU EEG annotation: A comparison with expert performance — Data and Code
Wan-Yee Kong , Fabio Nascimento , Aaron F Struck , Erik Duhaime , Srishti Kapur , Edilberto Amorim , Gregory Kapinos , Andres rodriguez , Brendan Thomas , Masoom Desai , Jong Woo Lee , M Brandon Westover , Jin Jing
Published: May 17, 2026. Version: 1.0.0
When using this resource, please cite:
(show more options)
Kong, W., Nascimento, F., Struck, A. F., Duhaime, E., Kapur, S., Amorim, E., Kapinos, G., rodriguez, A., Thomas, B., Desai, M., Lee, J. W., Westover, M. B., & Jing, J. (2026). Evaluating crowdsourcing for ICU EEG annotation: A comparison with expert performance — Data and Code (version 1.0.0). Brain Data Science Platform. https://doi.org/10.60508/fn0v-8w16.
Abstract
Objective. Detection of seizures and rhythmic or periodic patterns (SRPPs) on electroencephalography (EEG) is crucial for the diagnosis and management of patients with neurological critical illness. Automated detection methods require large, high-quality, expert-annotated datasets for training, but expert annotation is bottlenecked by the limited supply of trained neurophysiologists. Crowdsourcing may offer a scalable alternative. This BDSP publication shares the underlying data and analysis code from a study that evaluated whether crowdsourced annotations of short epochs of ICU EEG can match expert-quality labels.
Methods. An EEG scoring contest was conducted via the DiagnosUs mobile app (in collaboration with Centaur Labs) over a 1-month period in 2021. Participants annotated 10-second EEG epochs (with an accompanying 10-minute spectrogram) for six SRPP categories: seizure, generalized periodic discharges (GPDs), lateralized periodic discharges (LPDs), generalized rhythmic delta activity (GRDA), lateralized rhythmic delta activity (LRDA), and "Other." Performance was assessed via pairwise agreement, Fleiss' kappa, and Gwet's AC1 between expert raters, and via accuracy comparisons between experts and the crowd using both individual votes and weighted majority voting.
Results. A total of 1,542 participants (8 board-certified clinical neurophysiologists/epileptologists and 1,534 non-experts) answered 478,834 questions across six SRPPs. Using unweighted individual votes, the crowd's performance was inferior to experts overall and for every SRPP. Using weighted majority voting, the crowd was non-inferior to experts overall (accuracy 0.70 [0.69-0.70] vs 0.68 [0.68-0.70]) and matched or exceeded experts in most SRPPs — except LPDs and "Other." No individual expert outperformed the crowd on overall metrics.
Significance. This proof-of-concept demonstrates that crowdsourcing, with appropriate weighting, can yield expert-level SRPP annotations and offers a path toward the large, diverse datasets needed for training automated detection algorithms.
Background
Continuous EEG monitoring is now standard practice in many ICUs for identifying seizures, periodic discharges, and rhythmic patterns associated with secondary brain injury. Automated detection methods based on machine-learning models — particularly deep neural networks — have advanced rapidly, but every such method requires substantial volumes of expert-annotated EEG. Manual annotation is labor-intensive and limited by the number of trained neurophysiologists worldwide. Existing public EEG datasets are mainly geared toward seizure detection/prediction; only one public dataset to date is curated for the full set of seizures and rhythmic/periodic patterns (SRPPs).
The classical "wisdom of the crowd" argument — that the aggregated judgment of many non-experts can rival or exceed that of an individual expert — has been validated in medical image annotation tasks (skin lesions, sleep spindles, lung ultrasound, dermoscopy). Whether it generalises to the more complex visual reasoning required for ICU EEG SRPP identification was, until this study, an open question.
Methods
Participants and scoring contest
Conducted under IRB protocols at MGH (#2013P001024) and BIDMC (#2024P000804) with consent waivers for use of de-identified EEG data. EEG segments were drawn from the IIIC labeling study (Jing et al., 2023). The contest was launched at the Critical Care EEG Monitoring Research Consortium (CCEMRC) annual meeting on June 6, 2021 and hosted on the DiagnosUs iOS app for one month. Each question displayed a 10-second EEG epoch (bipolar montage) with a 10-minute companion spectrogram (Figure 1 of the paper). Top performers received small monetary prizes; the contest also served as an educational tool with real-time feedback. Participants self-reported their experience level; "Expert" was defined as having completed at least one year of specialty training (board-certified in epilepsy or clinical neurophysiology).
Gold standard
The reference labels were inherited from prior work (Jing et al., 2023, Neurology), in which 30 expert raters scored 50,697 EEG segments from 2,711 patients. Only segments receiving ≥10 expert votes were retained; the modal class with the highest vote count was taken as the gold-standard SRPP label.
Calibration / test split
For each qualified user (one who answered at least one question covering each of the six SRPPs in their own answers and in the gold standard), a "greedy set cover" was used to select a minimal calibration set per user; all remaining responses formed that user's test set. The calibration dataset is used only to compute per-user, per-pattern accuracies (which serve as voting weights); performance is evaluated on the test set.
Inter-rater agreement
Pairwise agreement was computed over all overlapping items (pairs with ≥5 shared questions retained). Fleiss' kappa was computed by comparing observed item-level agreement to expected agreement; Gwet's AC1 was computed as an alternative metric less sensitive to skewed prevalence.
Mixed-effects models
Four mixed-effects models were fitted using Restricted Maximum Likelihood (REML), each with a random intercept for problem_id to capture between-problem variability:
- Model 1 — overall, non-weighted:
accuracy ~ group + avg_question_count + (1 | problem_id) - Model 2 — by-pattern, non-weighted:
accuracy ~ group * pattern + avg_question_count + (1 | problem_id) - Model 3 — overall, weighted majority: same fixed effects, outcome is per-problem WM correctness
- Model 4 — by-pattern, weighted majority
Weighted majority voting
For each problem, the predicted class is
argmax_c ∑i wi · I(cij = c)
where wi is user i's weight, defined as the mean of their six
per-pattern calibration accuracies.
Non-inferiority test
A non-inferiority margin of 0.05 was chosen a priori. The null hypothesis is that Expert accuracy exceeds Crowd accuracy by more than the margin; rejection demonstrates non-inferiority. Per-pattern p-values were Bonferroni-corrected; p < 0.025 was considered statistically significant.
Sensitivity analyses
(i) Leave-one-group-out: sequentially removed each crowd subgroup (MD/DO, Medical student, NP/PA/pharmacist, Other students, Other) and recomputed weighted-majority accuracy. (ii) Sample-size bootstrap: drew 1,000 bootstrap samples of N users (N = 5...50) and recomputed crowd WM accuracy. (iii) Hard filter: excluded the bottom 20% of crowd raters by calibration accuracy. (iv) By-subgroup forest plot: fitted separate mixed models pitting each crowd subgroup against the experts.
Data Description
This BDSP project hosts the data and code that accompany Kong et al. (Epilepsia 2025). The S3 folder is organised into five sub-directories — annotations, raw exports, EEG signals, contest images, and docs:
<project-root>/
├── README.md, LICENSE.txt, citation.bib, CHANGELOG.md
├── annotations/
│ ├── test_df4.csv (~150 MB) ← primary analysis dataframe
│ ├── test_df4_dictionary.md
│ ├── calibration_assignments.csv
│ └── labels_experts30.xlsx
├── raw_contest_exports/
│ ├── 1251-all-reads_ac.csv (~133 MB)
│ ├── 1251-all-users_deidentified.csv
│ ├── eeg-all-users-and-topics_deidentified.csv
│ └── Results_SeizureLike_Patterns_Dec_12_2022.csv
├── eeg_signals/
│ ├── iiic_contest_eeg.h5 (~12 GB) ← bundled per-segment EEG + spectrograms
│ └── iiic_contest_eeg_schema.md
├── contest_images/ ← optional (subject to Centaur release)
│ ├── manifest.csv
│ └── png/{segment_id}.png
└── docs/
├── methods_summary.pdf
└── data_dictionary_full.pdf
Key files:
annotations/test_df4.csv— the master per-response dataframe (~496k rows, ~478,834 in the test split + ~17,619 in the calibration split). One row per (user_id, problem_id) with the user's selected SRPP, the gold-standard SRPP, the experience-level group, the per-user/per-pattern calibration accuracies (the weights used for weighted-majority voting), and the contest image URL. Schema documented inannotations/test_df4_dictionary.md.annotations/labels_experts30.xlsx— the pivoted 30-expert label matrix from Jing et al. 2023; one row per EEG segment, one column per expert; numeric values 0-5 map to (other, seizure, lpd, gpd, lrda, grda). Used for Supp S4 and S5.raw_contest_exports/— Centaur Labs source exports with PII removed (email, first/last names, app_display_name stripped). These are kept for provenance: the upstream pipeline that buildstest_df4.csvuses them directly.eeg_signals/iiic_contest_eeg.h5— one HDF5 archive bundling the 50 s EEG signal (21 channels @ 200 Hz) and four 10-minute regional spectrograms (LL/RL/LP/RP) per segment, plus per-segment metadata (gold-standard label, 30-expert vote tallies, contest image URL). Random access by segment_id via h5py; replaces what would otherwise be 10,704 individual.matfiles.contest_images/(if Centaur Labs releases them) — the PNGs participants actually saw. Otherwise these can be regenerated locally fromiiic_contest_eeg.h5usingscripts/contest_image_gen/make_contest_image.pyin the GitHub repo.
Usage Notes
A complete, documented Python reproduction of every figure and table in the paper is available on GitHub at https://github.com/bdsp-core/iiic-crowdsourcing-wanyee.
To reproduce the paper from this data:
- Apply for credentialed access through BDSP if you don't already
have it; the data live at
s3://bdsp-opendata-credentialed/iiic-irr-crowd/. - Clone the GitHub repository and create a Python environment
(
pip install -r requirements.txt). - Sync the S3 folder into your local checkout:
aws s3 sync s3://bdsp-opendata-credentialed/iiic-irr-crowd/annotations/ ./data/
- Run
python scripts/reproduce_all.pyto regenerate every figure, table, and supplemental analysis. Individual figures can be run via the per-script entry points (e.g.python scripts/figure3_forest_plot.py).
Outputs are written to figures/ as PNG + PDF, with a CSV of the
underlying numeric values for each figure to ease spot-checking.
Suggested use cases:
- Training and evaluating machine-learning classifiers for ICU EEG
pattern detection (the per-question gold-standard labels in
test_df4.csvare suitable as training targets). - Benchmarking new crowdsourcing or label-aggregation algorithms (the per-response, per-user dataset supports virtually any weighting scheme).
- Methodological research on inter-rater agreement and consensus building among medical experts.
Release Notes
Version 1.0.0 — initial public release accompanying the published manuscript (Epilepsia, August 2025).
Ethics
This study used de-identified EEG data and was conducted under institutional review board (IRB) protocols at Massachusetts General Hospital (Protocol no. MGH 2013P001024) and Beth Israel Deaconess Medical Center (BIDMC #2024P000804). Both protocols provided waiver of consent for the research use of de-identified EEG. Contest participants registered freely through the DiagnosUs app and were notified that aggregated, de-identified contest results would be used for research.
No directly identifying information about EEG patients is included in this data package; all per-segment identifiers are pseudonymous and cannot be linked to a specific individual without keys held by Massachusetts General Hospital.
For contest participants, the BDSP release contains only the Centaur Labs
user_id and self-reported demographics (country,
experience_level, preferred_specialty); names, email addresses, and any
other directly identifying fields present in the original Centaur exports
have been removed prior to upload.
Acknowledgements
The authors gratefully acknowledge the contributions of all participants in the SRPP scoring contest. The contest was hosted by Centaur Labs on the DiagnosUs platform. The 30-expert gold-standard labels are inherited from the IIIC labeling study (Jing et al., 2023, Neurology) and the authors thank all contributors to that effort.
Conflicts of Interest
This work was supported by the National Institutes of Health (NIH; RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119) and the National Science Foundation (NSF; 2014431).
Dr. Westover is a co-founder, scientific advisor, consultant to, and has personal equity interest in Beacon Biosignals. He also receives royalties for authoring Pocket Neurology from Wolters Kluwer and Atlas of Intensive Care Quantitative EEG by Demos Medical. Erik Duhaime and Srishti Kapur are employees and have personal equity interest in Centaur Labs. There are no conflicts of interest for the other authors.
References
- Kong W-Y, Nascimento FA, Struck A, Duhaime E, Kapur S, Amorim E, Kapinos G, Rodriguez A, Thomas B, Desai M, Lee JW, Westover MB, Jing J. Evaluating crowdsourcing for ICU EEG annotation: A comparison with expert performance. <em>Epilepsia</em> 2025;66(11):4366-4380. <a href="https://doi.org/10.1111/epi.18547">doi:10.1111/epi.18547</a>
- Jing J, Ge W, Struck AF, Fernandes MB, Hong S, An S, et al. Interrater reliability of expert electroencephalographers identifying seizures and rhythmic and periodic patterns in EEGs. <em>Neurology</em> 2023;100(17):e1737-e1749. <a href="https://doi.org/10.1212/WNL.0000000000201670">doi:10.1212/WNL.0000000000201670</a>
Access
Access Policy:
Only credentialed users who sign the DUA can access the files.
License (for files):
BDSP Credentialed Health Data License 1.5.0
Data Use Agreement:
BDSP Credentialed Health Data Use Agreement
Required training:
Discovery
DOI:
https://doi.org/10.60508/fn0v-8w16
Project Website:
https://github.com/bdsp-core/iiic-crowdsourcing-wanyee
Corresponding Author
Files
- be a credentialed user
- sign the data use agreement for the project