Designing Benchmarks and Challenges for Measuring Algorithm Performance in Biomedical Image Analysis

Wednesday, 13 April, 8:30am-12:15pm


  • Michal Kozubek (Masaryk University, Brno, Czech Republic)


  • Bram van Ginneken (Radboud University Medical Center, Nijmegen, the Netherlands)
  • AdriĆ«nne Mendrik (University Medical Center Utrecht, Utrecht, the Netherlands)
  • Stephen R. Aylward (Kitware Inc., North Carolina, USA)

Topic and background

Biologists and physicians have to be able to rely on the correctness of results obtained by automatic analysis of biomedical images. This, in turn, requires paying proper attention to quality control of the developed algorithms and software for this task. Both the medical image analysis and bioimage analysis communities are becoming increasingly aware of the strong need for benchmarking various image analysis methods in order to compare their performance and assess their suitability for specific applications. Reference benchmark datasets with ground truth (both simulated and real data annotated by experts) have become publicly available and challenges are being organized in association with well-known conferences, such as ISBI and MICCAI. This tutorial summarizes recent developments in this respect and describes common ways of measuring algorithm performance as well as providing guidelines for best practices for designing biomedical image analysis benchmarks and challenges, including proper dataset selection (training versus test sets, simulated versus real data), task description and defining corresponding evaluation metrics that can be used to rank performance.

Proper benchmarking of image analysis algorithms and software makes life easier not only for future developers (to learn the strengths and weaknesses of existing methods) but also for users (who can select methods that best suit their particular needs). Also reviewers can better assess the usefulness of a newly developed analysis method if it is compared to the best performing methods for a particular task on the same type of data using standard metrics.


The tutorial will concentrate on best practices to design biomedical image analysis benchmarks and challenges to measure algorithm performance in a standardized way. First, the design of a benchmark or a challenge will be presented including proper selection of datasets, tasks and evaluation metrics. Next past benchmarks and challenges will be shortly reviewed. Finally, the topic will be summarized and future directions discussed.


  1. Representative dataset selection: Covering variability of imaged objects
  2. Real versus synthetic data: Advantages and disadvantages
  3. Annotation of real data: Combining ground truth from several experts
  4. Training versus test data: Splitting principles
  5. Evaluation metrics: Measuring performance of classification, segmentation, tracking, restoration and other methods
  6. Merging multiple metrics: Normalization and weighting
  7. Creating rankings: Coping with variable method performance across datasets
  8. Benchmark and challenge lifecycle: Updates, repetitions and open submission modes


This tutorial is open to everyone with an interest in designing benchmarks or challenges for biomedical image analysis, and/or using benchmarks and challenges for validating algorithms.

The goal is to increase awareness of the importance of benchmarking and proper validation of algorithms. Using public benchmarks tends to convince reviewers more easily of the novelty of your method, for example using various benchmarks to proof the generalization of your method. Organizing a challenge is a great way to help the community and to learn more about the strengths and weaknesses of various methods, which aids in building on previous work and has the potential to lead to more novel solutions in terms of algorithm development and solving the task at hand.

Other tutorials

T2: SimpleITK: An Interactive, Python-Based Introduction to SimpleITK with the Insight Segmentation and Registration Toolkit (ITK)
T3: Heart Mechanics by Magnetic Resonance Imaging: Techniques and Applications
T4: Deep Learning