VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Shiduo Zhang¹, Zhe Xu¹, Peiju Liu^1*, Xiaopeng Yu^1*, Yuan Li¹, Qinghui Gao¹,
Zhaoye Fei¹, Zhangyue Yin¹, Zuxuan Wu¹, Yu-Gang Jiang¹, Xipeng Qiu¹

¹School of Computer Science and Technology, Fudan University.
^*Equal Contribution.

Paper Arxiv

Dataset Code Twitter

Abstract

General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.

VLABench

New Definition for LCM Tasks Suitable for Foundation Models

——What Abilities Should a True VLA Have?

From the perspective of “intelligence”, it is divided into six capability dimensions:

Mesh & Texture Understanding. It should be able to recognize irregular and uniquely shaped meshes as well as diversified textures with rich semantic information. This involves basic open-vocabulary object recognition, OCR capabilities and etc.
Spatial Understanding. It should possess basic spatial perception abilities, enabling accurate judgment of the relative positions of objects in an image,spatial constraints between different objects, and even direct distance estimation.
Common Sense & World Knowledge Transfer. It should acquire world knowledge and common sense from large-scale pretraining and apply such priors to corresponding tasks. For example, associating visual information with world knowledge to align it with user requirements.
Semantic Instruction Understanding. It should retain strong language comprehension abilities, enabling it to extract user needs from natural interactions or understand the implicit goals of a task, and then execute dynamic action sequences. Instead of template instructions like “pick A and then place it to B”.
Physical Laws Understanding. It should understand the principles of the physical world, such as friction, gravity, acceleration, and even fundamental physical concepts like the lever principle.
Long-Horizon Reasoning. Reasoning here primarily refers to the ability to plan for long-horizon, multi-step tasks, where logical correlations between multiple action steps are required. Broader reasoning encompasses several of the aforementioned abilities, such as semantic inference, the incorporation of world knowledge, and alignment between vision and task objectives. However, in this context, we focus solely on the former.

100 Tasks in VLABench

——What Tasks Should a True VLA Do?

VLABench divides tasks into two categories: Primitive and Composite.

Primitive Tasks: 60 tasks that require only one or two dimensions abilities and few skill combinations.
Composite Tasks: 40 tasks that require multi-step reasoning and long-horizon planning, involving more skills and abilities.

The 100 task categories are just a starting point. E.g. New tasks can be created by arbitrarily combining the above-mentioned abilities and skills.

Example of Primitive Tasks

Visualization for

Pick the banana into the plate.

Mesh & Texture

Pick the pear outside into the plate.

Spatial

Pick the fruit with Heat-clearing character into the plate.

Common Sense & World knowledge

One apple one day, doctor keey away! I want one please.

Semantic

Example of Composite Tasks

Instruction: Could you sort the books on the shelf starting with the most recent publication year and ending with the oldest?

Book Rearrange

Instruction: Could you help sort the billiard balls into baskets in front of you? We need them organized for the upcoming tournament.

Cluser Billiards

Instruction: Could you help sort out these books on the shelf so the library looks tidy?

Cluser Book

Instruction: Hi! I’m trying to conduct a chemistry experiment. Could you help me prepare Pb(OH)₂ ？

Take Chemistry Experiment

Instruction: Would you mind sorting out the drinks to help keep the bar area tidy and efficient for service?

Cluser Drink

Instruction: Could you sort out the toys in front of you so we can donate the right ones to the local charity?

Cluser Toy

Instruction: Prepare the carrots and corn for a refreshing vegetable salad.

Cook Dishes

Instruction: I can't seem to find the spirit, could you please find it for me?

Find Unseen Object

Instruction: Hey, I’m feeling a bit sluggish this afternoon; can you make me a coffee with some sugar to perk me up?

Get Latte Coffee

Instruction: Please hang the 武州玉川 on the wall safely and steadily.

Hammer Nail and Hang Picture

Instruction: Hey there! Could you please heat up the tray of food for me? Thanks a bunch!

Heat Proper Food

Instruction: Please give the answer of the following question by rearrange the number cube in placemat_seen:Cynthia eats one serving of ice cream every night. She buys cartons of ice cream with 15 servings of ice cream per carton at a cost of $4.00 per carton. After 60 days, how much will she spend on ice cream?

Play Math Game

Instruction: Hey, could you set the table? We're having some friends over for dinner tonight, and the dish is steak.

Set Dining Table

Instruction: Time for some textile history! Make sure the desk is tidy, and fire up the computer so I can delve into "cotton_the_fabric_that_made_the_modern_world".

Set Study Table

Instruction: Hey! it's your turn in Texas Hold'em. Please select the highest-ranking hand available and place them onto placemat.

Play Texas Hold'em

Instruction: The cards are dealt, and it's your turn. Analyze and pick the hand with the highest probability of winning. (1 or 2 pokers face down)

Explore and Play Texas Hold'em

Instruction: Take me the pear please.

Simple Seesaw Usage

Evaluation

The evaluation method of VLABench includes both interactive and non-interactive approaches.
Interactive: mainly for evaluation of VLA policies and workflows utilizing VLM/LLM. The policies should step in the environment under specific tasks to compute the progress score and statics success rate.
Non-interactive: mainly for evalution of VLMs. The VLMs should generate the action sequence in the format of our skill sequence based on the instruction and the segmented scene images. The generated action sequence will be computed a overall score by matching the DAG between generated and groundtruth. The framework is as the figure below.

Experiment Result

——Latest Leaderboad Comming Soon

VLABench provides standard evaluations for three types of methods: Vision-Language-Action Models, workflows utilizing VLM/LLM, and Vision-Language Models. In preview version, We present the early experimental results to facilitate further analysis. The complete leaderboard will be released soon.

Leaderboard of policies(mainly VLAs)

The experimental design related to VLA revolves around the following questions:

Q1: Do pre-trained VLAs exhibit stronger general abilities with unseen categories of objects?
Q2: Can pre-trained VLAs transfer their general knowledge and behavioral abilities to similar but unseen tasks?
Q3: Can pre-trained VLAs understand natural user interactions and implicit goal requirements?
Q4: Do pre-trained VLAs have the potential to transfer their world knowledge to related tasks?
Q5: Can existing VLA architectures accurately support the completion of long-horizon tasks?

The current VLAs have not demonstrated the expected capabilities, particularly in terms of the intelligence derived from pretraining, as they struggle with tasks involving generalization, skill transfer, and long-horizon planning. Drawing an analogy to the development trajectory of large language models, the present state of VLAs is still far from reaching a level comparable to GPT-2.

Leaderboard of workflows

In fact, these so-called zero-shot manipulation workflows are designed for specific types of tasks. When the task scenarios or capability requirements exceed their original design, these methods are less effective.

Bottleneck Effect in Submodules Workflow’s hierarchical and modular design results in a bottleneck effect. This is particularly evident in the actuator module. E.g. Errors in predicting grasp poses, as well as behaviors that do not satisfy spatial constraints. As well as perception module. E.g. The inability to recognize complex objects leads to failures in the perception stage.
Error in Module Connections The requirement for specific input-output formats between modules leads to information loss or mismatch. Especially the unexpected behavior of the LLM or VLM. E.g. The language model did not output the expected rotation matrix, or generated incorrect code that resulted in execution failure.

Leaderboard of VLMs

Without considering errors caused by action execution, we evaluated the capabilities of current leading language model series in embodied scenarios using action sequence matching to calculate scores. It was found that these state-of-the-art multimodal models do not perform as well as expected in embodied scenarios.

Overall Failing Grades Compared to VLAs, these models retain more high-level intelligence. Since they are evaluated primarily from the perception and decision-making aspects, their evaluation scores are relatively higher. Qwen-VL-7B and GPT-4v demonstrated relatively decent performance, but none of VLMs achieved a passing score.
Pour Planning Ability All models performed poorly in task reasoning and planning. Only 4o achieved a relatively balanced score in reasoning.

Please refer to our paper for more detailed analysis.

BibTeX


      @misc{zhang2024vlabench,
        title={VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks}, 
        author={Shiduo Zhang and Zhe Xu and Peiju Liu and Xiaopeng Yu and Yuan Li and Qinghui Gao and Zhaoye Fei and Zhangyue Yin and Zuxuan Wu and Yu-Gang Jiang and Xipeng Qiu},
        year={2024},
        eprint={2412.18194},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2412.18194}, 
  }