We present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites.


Leaderboard


Train D → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1000 chains)
1 2 3 4 5 Avg. Len.
Baseline + delta actions Static RGB - 76.4% 48.8% 30.1% 18.1% 9.3% 1.82
Baseline Static RGB 53.9% 48.9% 12.9% 2.6% 0.5% 0.08% 0.64
Baseline Static RGB + Gripper RGB 51.8% 34.4% 5.8% 1.1% 0.2% 0.08% 0.41
HULC++ Static RGB + Gripper RGB - 93% 79% 64% 52% 40% 3.3
LCD Static RGB + Gripper RGB 74.01% 88.7% 69.9% 54.5% 42.7% 32.2% 2.88
HULC Static RGB + Gripper RGB - 82.7% 64.9% 50.4% 38.5% 28.3% 2.64
SPIL Static RGB + Gripper RGB - 84.6% 65.1% 50.8% 38.0% 28.6% 2.67
MDT Static RGB + Gripper RGB - 93.7% 84.5% 74.1% 64.4% 55.6% 3.72
RoboUniView Static RGB-D + Gripper RGB-D + Cam params - 96.2% 88.8% 77.6% 66.6% 56.3% 3.85
Baseline Static RGB + Tactile 54.2% 28.5% 3.2% 0% 0% 0% 0.31
Baseline Static RGB-D + Gripper RGB-D 46.1% 28.2% 4.6% 0.3% 0.08% 0% 0.33

Train A, B, C, D → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1000 chains)
1 2 3 4 5 Avg. Len.
Baseline Static RGB 35.6% 28.2% 2.5% 0.3% 0% 0% 0.28
Baseline Static RGB + Gripper RGB 49.7% 37.3% 2.7% 0.17% 0% 0% 0.40
Baseline Static RGB + Tactile 47.9% 22.7% 2.3% 0.3% 0% 0% 0.25
HULC Static RGB + Gripper RGB - 88.9% 73.3% 58.7% 47.5% 38.3% 3.06
RoboFlamingo Static RGB + Gripper RGB - 96.4% 89.6% 82.4% 74.0% 66.0% 4.08
MDT Static RGB + Gripper RGB - 98.6% 95.8% 91.6% 86.2% 80.1% 4.52
GR-1 Static RGB + Gripper RGB + Proprio - 94.9% 89.6% 84.4% 78.9% 73.1% 4.21
Baseline Static RGB-D + Gripper RGB-D 40.7% 14.4% 1.8% 0.08% 0.08% 0% 0.16

Train A, B, C → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1000 chains)
1 2 3 4 5 Avg. Len.
Baseline Static RGB 38.6% 20.2% 0.2% 0% 0% 0% 0.20
Baseline Static RGB + Gripper RGB 38% 30.4% 1.3% 0.17% 0% 0% 0.31
Baseline Static RGB + Tactile 43.7% 17.3% 0.8% 0.08% 0% 0% 0.26
HULC Static RGB + Gripper RGB - 41.8% 16.5% 5.7% 1.9% 1.1% 0.67
SuSIE Static RGB - 87.0% 69.0% 49.0% 38.0% 26.0% 2.69
RoboFlamingo Static RGB + Gripper RGB - 82.4% 61.9% 46.6% 33.1% 23.5% 2.47
GR-1 Static RGB + Gripper RGB + Proprio - 85.4% 71.2% 59.6% 49.7% 40.1% 3.06
SPIL Static RGB + Gripper RGB - 74.2% 46.3% 27.6% 14.7% 8.0% 1.71
3D Diffuser Actor Static RGB-D + Gripper RGB-D + Proprio + Cam params - 92.2% 78.7% 63.9% 51.2% 41.2% 3.27
GR-MG Static RGB + Gripper RGB + Proprio - 96.8% 89.3% 81.5% 72.7% 64.4% 4.04
RoboUniView Static RGB-D + Gripper RGB-D + Cam params - 94.2% 84.2% 73.4% 62.2% 50.7% 3.64
Baseline Static RGB-D + Gripper RGB-D 30.8% 21.1% 1.3% 0% 0% 0% 0.22

Videos


Code

The Calvin environments, baselines and benchmarks can be found in our GitHub repository for academic usage and is released under the MIT license.

Publications

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Oier Mees, Lukas Hermann, Erick Rosete, Wolfram Burgard
IEEE Robotics and Automation Letters (RAL), 2022

People