We present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites.


Leaderboard


Train D → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1158 chains)
1 2 3 4 5
Baseline Static RGB 53.9% 48.9% 12.9% 2.6% 0.5% 0.08%
Baseline Static RGB + Gripper RGB 51.8% 34.4% 5.8% 1.1% 0.2% 0.08%
Baseline Static RGB + Tactile 54.2% 28.5% 3.2% 0% 0% 0%
Baseline Static RGB-D + Gripper RGB-D 46.1% 28.2% 4.6% 0.3% 0.08% 0%

Train A, B, C, D → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1158 chains)
1 2 3 4 5
Baseline Static RGB 35.6% 28.2% 2.5% 0.3% 0% 0%
Baseline Static RGB + Gripper RGB 49.7% 37.3% 2.7% 0.17% 0% 0%
Baseline Static RGB + Tactile 47.9% 22.7% 2.3% 0.3% 0% 0%
Baseline Static RGB-D + Gripper RGB-D 40.7% 14.4% 1.8% 0.08% 0.08% 0%

Train A, B, C → Test D

Method Input MTLC LH-MTLC
(32 tasks) No. Instructions in a Row (1158 chains)
1 2 3 4 5
Baseline Static RGB 38.6% 20.2% 0.2% 0% 0% 0%
Baseline Static RGB + Gripper RGB 38% 30.4% 1.3% 0.17% 0% 0%
Baseline Static RGB + Tactile 43.7% 17.3% 0.8% 0.08% 0% 0%
Baseline Static RGB-D + Gripper RGB-D 30.8% 21.1% 1.3% 0% 0% 0%

Videos


Code

The Calvin environments, baselines and benchmarks can be found in our GitHub repository for academic usage and is released under the MIT license. For any commercial purpose, please contact the authors.

Publications

Calvin: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks
Oier Mees, Lukas Hermann, Erick Rosete, Wolfram Burgard
arXiv:2112.03227

People