Project Page • Vision-Language Model Robustness

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

1State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences

2University of Chinese Academy of Sciences

+10.8% higher clean accuracy than FARE on 16-dataset average
+4.4% higher AutoAttack (4/255) robustness than FARE on 16-dataset average
640× fewer training images than standard adversarial fine-tuning
Extensive applicability classification, retrieval, VQA, and captioning tasks
Overview of the R-Adapt framework and average clean/robust performance.
R-Adapt shows that adversarial robustness in VLMs is largely established in the earliest layers. By freezing the pre-trained backbone and adapting only the input filter and first attention block, it achieves a much stronger robustness-accuracy trade-off than standard adversarial fine-tuning.

R-Adapt: Adversarial Robustness Adaptation

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth . Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization.

Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness.

Why shallow layers matter

Layer-wise analysis shows that adversarial robustness emerges early, while deeper updates often hurt both clean accuracy and robust generalization.

Shallow layers are the primary source of robustness

Centered Kernel Alignment and progressive replacement experiments show that most robustness gains are already established after the embedding layer and the first attention block. Replacing deeper blocks adds little robustness while increasing the risk of semantic overfitting.

Layer-wise CKA analysis and progressive replacement results.

Two-stage robustness pattern in shallow layers

Robust models first suppress high-frequency perturbations in the embedding layer, then redirect first-block attention away from fragile low-level primitives toward a stable, input-insensitive response. R-Adapt explicitly recreates these two effects.

Spectral shift and attention pattern visualizations.

Updates to deeper layers hurt generalization

Progressive replacement shows that once robustness has already been established in the shallow layers, continuing to update deeper blocks becomes harmful. Deeper updates often hurt both clean accuracy and robust generalization, especially when the model is transferred beyond the source distribution used for adversarial fine-tuning.

Progressive replacement trends showing the robustness-accuracy trade-off and overfitting.
Once replacement is applied to deeper layers, clean accuracy and robust generalization both begin to drop in the out-of-distribution setting.

Minimal adaptation, frozen backbone

R-Adapt keeps the original CLIP backbone intact and intervenes only where the analysis says robustness lives: the embedding layer and the first attention block.

Three robustness anchor acquisition paradigms.
R-Adapt supports training-free, model-guided, and data-driven anchor acquisition.

Gaussian Input Filter

A lightweight Gaussian low-pass filter attenuates the high-frequency components typically exploited by adversarial perturbations before they enter the visual encoder.

Fixed Robustness Anchor

The first attention block is replaced with a linear combination of the original response and a fixed robustness anchor. This stabilizes the most fragile stage of the network without changing any deeper weights.

Training-Free

R-AdaptCLIP

Extract the anchor from a standard CLIP using a uniform white image. No training, no external robust model.

Model-Guided

R-AdaptM

Borrow the robust prior from an adversarially fine-tuned model (M), then run inference with a frozen clean backbone.

Data-Driven

R-Adapt+

Optimize only the anchor on 2k adversarial samples, then transfer the learned anchor across tasks and datasets.

Strong robustness without collapsing clean performance

R-Adapt+ improves the trade-off across classification, retrieval, and multimodal generation (Captioning and VQA), while remaining much more data-efficient than standard adversarial fine-tuning.

67.0%

Average clean accuracy

R-Adapt+ reaches the best 16-dataset average clean accuracy among robust methods (TeCoA, TGA, FARE) while remaining close to standard CLIP.

57.2%

Average robustness

R-Adapt+ sets the strongest robustness average 16 datasets at ε = 4/255 under AutoAttack, outperforming FARE by 4.4 points.

83.6 / 79.7

Retrieval R@1

On Flickr30k image-to-text retrieval, R-Adapt+ preserves clean alignment and improves robust R@1 under attack.

85.3 / 79.4

CAP / VQA robustness

The same design extends to larger LVLMs and restores robust captioning and VQA under strong visual attacks.

Average classification trade-off on 16 datasets

R-Adapt reaches stronger clean-robust balance than standard adversarial fine-tuning baselines.

Method Training Budget Clean Avg. Robust Avg.
CLIP None 69.8 0.0
TeCoA 1.28M images 49.6 48.2
TGA 1.28M images 56.3 50.2
FARE 1.28M images 56.2 52.8
R-AdaptCLIP (ours) 0 images (training-free) 66.2 48.4
R-AdaptFARE (ours) 0 images (model-guided) 64.5 55.6
R-Adapt+ (ours) 2k images (data-driven) 67.0 (↑10.8) 57.2 (↑4.4)

BibTeX

Use the entry below to cite the project page or copy the file directly.

@article{nie2026what,
  title   = {What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models},
  author  = {Nie, Sen and Zhang, Jie and Wang, Zhongqi and Wei, Zhaoyang and Shan, Shiguang and Chen, Xilin},
  journal = {arXiv preprint arXiv:2603.12799},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.12799}
}