We present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites.


Leaderboard


Train D → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1000 chains)
1 2 3 4 5 Avg. Len.
Baseline Static RGB 53.9% 48.9% 12.9% 2.6% 0.5% 0.08% 0.64
Baseline Static RGB + Gripper RGB 51.8% 34.4% 5.8% 1.1% 0.2% 0.08% 0.41
HULC Static RGB + Gripper RGB - 82.5% 66.8% 52% 39.3% 27.5% 2.68
Baseline Static RGB + Tactile 54.2% 28.5% 3.2% 0% 0% 0% 0.31
Baseline Static RGB-D + Gripper RGB-D 46.1% 28.2% 4.6% 0.3% 0.08% 0% 0.33

Train A, B, C, D → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1000 chains)
1 2 3 4 5 Avg. Len.
Baseline Static RGB 35.6% 28.2% 2.5% 0.3% 0% 0% 0.28
Baseline Static RGB + Gripper RGB 49.7% 37.3% 2.7% 0.17% 0% 0% 0.40
Baseline Static RGB + Tactile 47.9% 22.7% 2.3% 0.3% 0% 0% 0.25
HULC Static RGB + Gripper RGB - 88.9% 73.3% 58.7% 47.5% 38.3% 3.06
Baseline Static RGB-D + Gripper RGB-D 40.7% 14.4% 1.8% 0.08% 0.08% 0% 0.16

Train A, B, C → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1000 chains)
1 2 3 4 5 Avg. Len.
Baseline Static RGB 38.6% 20.2% 0.2% 0% 0% 0% 0.20
Baseline Static RGB + Gripper RGB 38% 30.4% 1.3% 0.17% 0% 0% 0.31
Baseline Static RGB + Tactile 43.7% 17.3% 0.8% 0.08% 0% 0% 0.26
HULC Static RGB + Gripper RGB - 41.8% 16.5% 5.7% 1.9% 1.1% 0.67
Baseline Static RGB-D + Gripper RGB-D 30.8% 21.1% 1.3% 0% 0% 0% 0.22

Videos


Code

The Calvin environments, baselines and benchmarks can be found in our GitHub repository for academic usage and is released under the MIT license.

Publications

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Oier Mees, Lukas Hermann, Erick Rosete, Wolfram Burgard
IEEE Robotics and Automation Letters (RAL), 2022

People