Project Page • Vision-Language Model Robustness
What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models
1State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
Abstract
R-Adapt: Adversarial Robustness Adaptation
Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth . Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization.
Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness.
Insights
Why shallow layers matter
Layer-wise analysis shows that adversarial robustness emerges early, while deeper updates often hurt both clean accuracy and robust generalization.
Shallow layers are the primary source of robustness
Centered Kernel Alignment and progressive replacement experiments show that most robustness gains are already established after the embedding layer and the first attention block. Replacing deeper blocks adds little robustness while increasing the risk of semantic overfitting.
Two-stage robustness pattern in shallow layers
Robust models first suppress high-frequency perturbations in the embedding layer, then redirect first-block attention away from fragile low-level primitives toward a stable, input-insensitive response. R-Adapt explicitly recreates these two effects.
Updates to deeper layers hurt generalization
Progressive replacement shows that once robustness has already been established in the shallow layers, continuing to update deeper blocks becomes harmful. Deeper updates often hurt both clean accuracy and robust generalization, especially when the model is transferred beyond the source distribution used for adversarial fine-tuning.
Method
Minimal adaptation, frozen backbone
R-Adapt keeps the original CLIP backbone intact and intervenes only where the analysis says robustness lives: the embedding layer and the first attention block.
Gaussian Input Filter
A lightweight Gaussian low-pass filter attenuates the high-frequency components typically exploited by adversarial perturbations before they enter the visual encoder.
Fixed Robustness Anchor
The first attention block is replaced with a linear combination of the original response and a fixed robustness anchor. This stabilizes the most fragile stage of the network without changing any deeper weights.
Training-Free
R-AdaptCLIP
Extract the anchor from a standard CLIP using a uniform white image. No training, no external robust model.
Model-Guided
R-AdaptM
Borrow the robust prior from an adversarially fine-tuned model (M), then run inference with a frozen clean backbone.
Data-Driven
R-Adapt+
Optimize only the anchor on 2k adversarial samples, then transfer the learned anchor across tasks and datasets.
Results
Strong robustness without collapsing clean performance
R-Adapt+ improves the trade-off across classification, retrieval, and multimodal generation (Captioning and VQA), while remaining much more data-efficient than standard adversarial fine-tuning.
67.0%
Average clean accuracy
R-Adapt+ reaches the best 16-dataset average clean accuracy among robust methods (TeCoA, TGA, FARE) while remaining close to standard CLIP.
57.2%
Average robustness
R-Adapt+ sets the strongest robustness average 16 datasets at
ε = 4/255 under AutoAttack, outperforming FARE by 4.4 points.
83.6 / 79.7
Retrieval R@1
On Flickr30k image-to-text retrieval, R-Adapt+ preserves clean alignment and improves robust R@1 under attack.
85.3 / 79.4
CAP / VQA robustness
The same design extends to larger LVLMs and restores robust captioning and VQA under strong visual attacks.
Average classification trade-off on 16 datasets
R-Adapt reaches stronger clean-robust balance than standard adversarial fine-tuning baselines.
| Method | Training Budget | Clean Avg. | Robust Avg. |
|---|---|---|---|
| CLIP | None | 69.8 | 0.0 |
| TeCoA | 1.28M images | 49.6 | 48.2 |
| TGA | 1.28M images | 56.3 | 50.2 |
| FARE | 1.28M images | 56.2 | 52.8 |
| R-AdaptCLIP (ours) | 0 images (training-free) | 66.2 | 48.4 |
| R-AdaptFARE (ours) | 0 images (model-guided) | 64.5 | 55.6 |
| R-Adapt+ (ours) | 2k images (data-driven) | 67.0 (↑10.8) | 57.2 (↑4.4) |
Against Diverse Attacks
R-Adapt preserves cross-modal retrieval (image->text) quality and improves robustness under PGD, C&W, APGD-CE, DLR, FAB, and Square attacks, indicating that the effect is not tied to a single attack family.
Transfer to larger LVLMs
On LLaVA, robust captioning climbs from 21.3% to 85.3% and robust VQA from 22.6% to 79.4%. Appendix results further show that Qwen3-VL also benefits strongly from the same shallow-layer adaptation strategy.
GIF and FRA are complementary. The strongest robustness appears when both are used together.
Citation
BibTeX
Use the entry below to cite the project page or copy the file directly.
@article{nie2026what,
title = {What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models},
author = {Nie, Sen and Zhang, Jie and Wang, Zhongqi and Wei, Zhaoyang and Shan, Shiguang and Chen, Xilin},
journal = {arXiv preprint arXiv:2603.12799},
year = {2026},
url = {https://arxiv.org/abs/2603.12799}
}