VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks(Preview Version)

Shiduo Zhang1, Zhe Xu1, Peiju Liu1*, Xiaopeng Yu1*, Yuan Li1, Qinghui Gao1,
Zhaoye Fei1, Zhangyue Yin1, Zuxuan Wu1, Yu-Gang Jiang1, Xipeng Qiu1
1School of Computer Science and Technology, Fudan University.
*Equal Contribution.

Abstract

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

gif

VLABench

New Definition for LCM Tasks Suitable for Foundation Models

——What Abilities Should a True VLA Have?

From the perspective of “intelligence”, it is divided into six capability dimensions:

  • Mesh & Texture Understanding. It should be able to recognize irregular and uniquely shaped meshes as well as diversified textures with rich semantic information. This involves basic open-vocabulary object recognition, OCR capabilities and etc.
  • Spatial Understanding. It should possess basic spatial perception abilities, enabling accurate judgment of the relative positions of objects in an image,spatial constraints between different objects, and even direct distance estimation.
  • Common Sense & World Knowledge Transfer. It should acquire world knowledge and common sense from large-scale pretraining and apply such priors to corresponding tasks. For example, associating visual information with world knowledge to align it with user requirements.
  • Semantic Instruction Understanding. It should retain strong language comprehension abilities, enabling it to extract user needs from natural interactions or understand the implicit goals of a task, and then execute dynamic action sequences. Instead of template instructions like “pick A and then place it to B”.
  • Physical Laws Understanding. It should understand the principles of the physical world, such as friction, gravity, acceleration, and even fundamental physical concepts like the lever principle.
  • Long-Horizon Reasoning. Reasoning here primarily refers to the ability to plan for long-horizon, multi-step tasks, where logical correlations between multiple action steps are required. Broader reasoning encompasses several of the aforementioned abilities, such as semantic inference, the incorporation of world knowledge, and alignment between vision and task objectives. However, in this context, we focus solely on the former.

100 Tasks in VLABench

——What Tasks Should a True VLA Do?

VLABench divides tasks into two categories: Primitive and Composite.

  • Primitive Tasks: 60 tasks that require only one or two dimensions abilities and few skill combinations.
  • Composite Tasks: 40 tasks that require multi-step reasoning and long-horizon planning, involving more skills and abilities.
The 100 task categories are just a starting point. E.g. New tasks can be created by arbitrarily combining the above-mentioned abilities and skills.

Example of Primitive Tasks

Visualization for

Pick the banana into the plate.

Mesh & Texture

Pick the pear outside into the plate.

Spatial

Pick the fruit with Heat-clearing character into the plate.

Common Sense & World knowledge

One apple one day, doctor keey away! I want one please.

Semantic

Example of Composite Tasks

Evaluation

The evaluation method of VLABench includes both interactive and non-interactive approaches.
Interactive: mainly for evaluation of VLA policies and workflows utilizing VLM/LLM. The policies should step in the environment under specific tasks to compute the progress score and statics success rate.
Non-interactive: mainly for evalution of VLMs. The VLMs should generate the action sequence in the format of our skill sequence based on the instruction and the segmented scene images. The generated action sequence will be computed a overall score by matching the DAG between generated and groundtruth. The framework is as the figure below.

Example GIF

Experiment Result

——Latest Leaderboad Comming Soon

VLABench provides standard evaluations for three types of methods: Vision-Language-Action Models, workflows utilizing VLM/LLM, and Vision-Language Models. In preview version, We present the early experimental results to facilitate further analysis. The complete leaderboard will be released soon.

Leaderboard of policies(mainly VLAs)

The experimental design related to VLA revolves around the following questions:

  • Q1: Do pre-trained VLAs exhibit stronger general abilities with unseen categories of objects?
  • Q2: Can pre-trained VLAs transfer their general knowledge and behavioral abilities to similar but unseen tasks?
  • Q3: Can pre-trained VLAs understand natural user interactions and implicit goal requirements?
  • Q4: Do pre-trained VLAs have the potential to transfer their world knowledge to related tasks?
  • Q5: Can existing VLA architectures accurately support the completion of long-horizon tasks?

static image

The current VLAs have not demonstrated the expected capabilities, particularly in terms of the intelligence derived from pretraining, as they struggle with tasks involving generalization, skill transfer, and long-horizon planning. Drawing an analogy to the development trajectory of large language models, the present state of VLAs is still far from reaching a level comparable to GPT-2.


Leaderboard of workflows

Sample Image

In fact, these so-called zero-shot manipulation workflows are designed for specific types of tasks. When the task scenarios or capability requirements exceed their original design, these methods are less effective.

  • Bottleneck Effect in Submodules Workflow’s hierarchical and modular design results in a bottleneck effect. This is particularly evident in the actuator module. E.g. Errors in predicting grasp poses, as well as behaviors that do not satisfy spatial constraints. As well as perception module. E.g. The inability to recognize complex objects leads to failures in the perception stage.
  • Error in Module Connections The requirement for specific input-output formats between modules leads to information loss or mismatch. Especially the unexpected behavior of the LLM or VLM. E.g. The language model did not output the expected rotation matrix, or generated incorrect code that resulted in execution failure.

Leaderboard of VLMs

Without considering errors caused by action execution, we evaluated the capabilities of current leading language model series in embodied scenarios using action sequence matching to calculate scores. It was found that these state-of-the-art multimodal models do not perform as well as expected in embodied scenarios.

  • Overall Failing Grades Compared to VLAs, these models retain more high-level intelligence. Since they are evaluated primarily from the perception and decision-making aspects, their evaluation scores are relatively higher. Qwen-VL-7B and GPT-4v demonstrated relatively decent performance, but none of VLMs achieved a passing score.
  • Pour Planning Ability All models performed poorly in task reasoning and planning. Only 4o achieved a relatively balanced score in reasoning.

Please refer to our paper for more detailed analysis.

Sample Image

BibTeX


      @misc{zhang2024vlabenchlargescalebenchmarklanguageconditioned,
        title={VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks}, 
        author={Shiduo Zhang and Zhe Xu and Peiju Liu and Xiaopeng Yu and Yuan Li and Qinghui Gao and Zhaoye Fei and Zhangyue Yin and Zuxuan Wu and Yu-Gang Jiang and Xipeng Qiu},
        year={2024},
        eprint={2412.18194},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2412.18194}, 
  }