ICML 2026 · vision-language robustness

VLM-RobustBench

A comprehensive benchmark for robustness of vision-language models.

We evaluate modern VLMs under realistic visual corruptions and find a consistent pattern: current models are semantically strong, but spatially fragile.

Rohit Saxena Alessandro Suglia Pasquale Minervini

University of Edinburgh · Miniml.AI

Paper Code Results

Original, high brightness reduction, and low glass blur comparison from VLM-RobustBench.

49 augmentation types

133 corrupted settings

15 VLMs evaluated

2 benchmarks

Benchmark design

From clean accuracy to deployment stress tests

VLM-RobustBench applies 42 severity-based corruptions at low, mid, and high levels, plus 7 binary transforms, to MMBench and MMMU-Pro. The evaluation keeps prompts and answer formats fixed so drops isolate visual robustness.

Blur

Gaussian, motion, defocus, glass, zoom.

Noise

Gaussian, shot, speckle, salt and pepper.

Weather

Fog, frost, snow, rain, spatter.

Spatial

Rotation, shear, perspective, elastic transforms.

Resolution

Downsample, upsample, sharpen, posterize, solarize.

Binary

Flips, grayscale, invert, channel swap, equalize.

Corruption gallery

The suite covers photometric degradation, spatial warping, resampling artifacts, occlusion, and VLM-specific overlays.

Examples of geometric augmentations at increasing severity levels.

Interactive results

Spatial and resampling corruptions dominate the tail

Explore the main robustness table across MMBench and MMMU-Pro. Worst-case drops reach 34 percentage points on MMBench, with upsampling as the dominant failure mode.

Dataset Metric

Clean accuracy by model

Higher is better.

Accuracy vs tail risk

Bubble size tracks visual gain.

mCE profile

Lower is better; 100 matches the dataset reference.

Model

Severity trajectories

Family-level mean drops on MMBench.

Family

Failure modes

Visually mild does not mean model-easy

Low-severity glass blur reduces MMBench accuracy by about 8 points on average, while high-severity brightness reduction costs only about 2 points. This breaks the usual assumption that visual severity is a reliable proxy for model difficulty.

Top MMBench corruptions by severity and accuracy drop. — Top corruptions on MMBench by mean drop over 12 open-weights models.

Top MMMU-Pro corruptions by severity and accuracy drop. — Top corruptions on MMMU-Pro by mean drop over 12 open-weights models.

Answer flips by severity

Harmful flips are correct-to-wrong transitions.

Tier distribution

Counts over corruption configurations and models.

Domain sensitivity

Worst drop by MMBench category or MMMU-Pro domain.

Relative corruption error

Top corruptions by lost visual contribution.

Qualitative examples of VLM prediction changes under visual corruptions. — Qualitative examples: spatial and resampling corruptions flip otherwise correct predictions, while some severe photometric changes preserve the answer.

Spatial fragility

Upsampling and elastic transforms drive the largest drops, reaching up to 34 points.

Severity mismatch

Violation rates are 30.2% on MMBench and 56.1% on MMMU-Pro.

Family fingerprints

Robustness is not explained by parameter count alone; families fail differently.

Citation

VLM-RobustBench

@inproceedings{saxena2026vlmrobustbench,
  title = {VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models},
  author = {Saxena, Rohit and Suglia, Alessandro and Minervini, Pasquale},
  booktitle = {Proceedings of the International Conference on Machine Learning (ICML)},
  year = {2026}
}