论文阅读“VLM4VLA: REVISITING VISION-LANGUAGE MODELS IN VISION-LANGUAGE-ACTION MODELS“

优质文章学习记录

10 Apr 2026 — 8 min read

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLMs) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policy performance. We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs.
Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM’s general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control.
We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM’s performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action planning.

结论

In this paper, we investigated the impact of various VLMs—including the effect of auxiliary fine-tuning tasks—on the performance of VLA models. Through over 100 training and evaluation experiments conducted across three distinct environments, we assessed the capabilities of nine models and eight categories of auxiliary data for executing manipulation tasks. Our findings offer practical recommendations and a performance reference for the community. A core insight from our study is the significant gap between the capabilities of current VLMs and the demands of VLA embodied tasks. Specifically, we observe a notable discrepancy between a VLM’s performance on standard VQA benchmarks and its actual effectiveness when deployed in a VLA.
A limitation of our work is the absence of experiments on physical robots. This decision was motivated by challenges related to fairness and reproducibility, as well as difficulties in ensuring test efficiency and fairness across physical hardware setups. We conducted an in-depth experimental analysis of the Real-to-Sim gap and found that the visual discrepancy between VLMs and VLAs likely arises from the inherent heterogeneity between vision–language tasks and low-level action control tasks, rather than merely from a simple image-level sim-to-real gap. This issue is universal across both simulation and real-robot settings. From this perspective, while real-world deployment remains the ultimate goal, we believe that our comprehensive results across multiple, diverse simulation benchmarks provide valuable insights that can inspire and guide future research in this area.

1. 研究背景与核心问题

这篇论文探讨了一个基础但很少被系统研究的问题：预训练视觉-语言模型(VLMs)如何影响下游视觉-语言-动作(VLA)模型的性能。当前VLA研究主要集中在：

设计更复杂的网络架构
引入额外训练范式或模态
改进动作解码方案

然而，VLM骨干本身对VLA性能的影响却被忽视。本文重新审视这一关键问题，系统性地评估不同VLM在机器人控制任务上的表现。

2. 方法创新：VLM4VLA框架

作者提出了VLM4VLA框架，这是一个最小化适配管道，仅引入不到1%的新参数，就能将通用VLM转换为VLA策略。其核心设计包括：

简洁架构：引入一个可学习的action query token，从VLM中提取具身知识，通过小型MLP解码为动作
统一训练目标：使用确定性的Huber损失+交叉熵损失，避免扩散/流匹配方法带来的随机性
公平比较原则：所有VLM采用相同的训练和评估设置，确保结果可比性
输入标准化：统一将输入图像处理为224×224分辨率，消除输入差异
模态隔离：仅使用视觉-语言输入，排除本体感知等其他模态，直接评估VLM内在能力

尽管结构简单，VLM4VLA在基准测试中表现与更复杂的网络设计(如pi0)相当，为公平比较提供了坚实基础。

3. 实验设置与评估基准

研究在三个模拟环境中进行广泛评估：

Calvin ABC-D：在ABC场景训练，D场景测试，评估跨视觉域泛化能力
SimplerEnv Bridge(WindowX)：在真实BridgeV2数据训练，在模拟环境测试
Libero-Long(-10)：最具挑战性的任务套件，包含10个长时程任务

评估了9种开源VLM(1B-10B参数范围)，包括：

QwenVL系列(2B-30B)
Paligemma系列(1-2)
Kosmos-2

4. 关键发现

4.1 VLM通用能力与VLA性能的关系

VLM预训练带来一致性收益：所有预训练VLM显著优于从零训练的策略
通用能力不是可靠预测指标：VLM在标准VQA基准的表现与其VLA性能相关性较差
环境依赖性：在Calvin环境中，VQA表现与VLA性能高度相关；但在Simpler/Libero环境中，这种相关性几乎不存在
模型规模不决定性能：最小的Kosmos-2(1.7B)在Simpler/Libero任务上超过了更大的QwenVL模型

4.2 辅助具身任务微调的影响

作者在7种辅助具身任务上微调VLM(如视觉指向、空间理解、轨迹预测)，发现：

性能普遍下降：所有微调后的VLM表现均不如原始基线
任务特异性无助于泛化：提高VLM在特定具身技能上的表现并不能改善下游控制
混合数据最有效：结合通用VQA和具身任务的混合训练(VQA-Mix)表现最佳，表明VLA需要广泛能力
生成任务效果有限：深度估计、分割图生成等密集预测任务对VLA性能无显著提升

4.3 模态级消融分析

视觉编码器至关重要：冻结视觉编码器导致性能大幅下降(在Paligemma-1上下降42%)
语言模块相对次要：冻结词嵌入对性能影响微弱
参数数量不等于性能：冻结Qwen2.5VL-7B(7.6B可训练参数)的视觉编码器后，性能甚至低于完全微调的Qwen2.5VL-3B(3.8B参数)

4.4 VLM与VLA的视觉表征差距

深入分析揭示了两个关键因素导致VLM到VLA的性能差距：

真实图像vs模拟渲染：VLM预训练中缺乏桌面模拟场景
任务目标不匹配：VLM学习的视觉特征针对语言输出优化，而非低级动作控制

关键实验证明：即使在使用真实世界图像(BridgeV2)训练时，冻结视觉编码器仍导致性能显著下降。这表明差距不仅来自sim-to-real差异，更源于视觉-语言理解与动作控制之间的根本不匹配。

5. 理论解释与洞察

作者提出一个重要观点：VLM和VLA的表征学习在训练初期遵循相似轨迹，但在某一点后分叉到不同区域，导致当前观察到的性能差距。这解释了：

为何VLM预训练对VLA泛化必不可少(初始学习方向一致)
为何简单微调无法弥合差距(后期学习目标差异)

6. 贡献与启示

主要贡献

提出公平评估框架：VLM4VLA为不同VLM骨干提供公平比较基准
揭示重要发现：VLM通用能力与VLA性能之间缺乏可靠相关性
识别核心瓶颈：视觉编码器(而非语言模块)是VLA性能的主要限制因素
提供实践指导：为VLA研究提供性能参考和方法论启示

领域启示

VLM预训练需要重新思考：当前VLM预训练目标与具身控制需求不匹配
视觉表征需专业化：需要设计专门针对控制相关视觉特征的训练方法
跨领域知识转移：通用VQA能力不足以支撑具身任务，需要新的预训练范式
研究方向调整：VLA研究应更关注视觉编码器的适配，而非仅扩展模型规模

7. 局限性与未来方向

局限性：

仅在模拟环境中评估，缺乏物理机器人验证
未探索多视角输入和时序建模对性能的影响

未来方向：

开发专为具身控制设计的VLM预训练方法
探索将控制相关监督直接注入VLM视觉编码器的方法
研究如何平衡通用视觉-语言能力和特定控制能力

总结

这篇论文通过系统性实验揭示了VLM与VLA之间的关键差距，特别是视觉表征方面的不匹配。研究证明，当前VLM预训练虽为VLA提供了必要基础，但其表征与控制任务需求存在本质差异。这一发现挑战了"更强的通用VLM必然带来更好的VLA"的假设，为未来研究指明了新方向——需要专门设计适应具身控制任务的视觉表征学习方法。这项工作对推动VLA领域发展具有重要理论和实践价值。