Multi-GraspLLM: A Multimodal LLM for Multi-Hand Semantic Guided Grasp Generation

1Institute of Software, Chinese Academy of Sciences 2Waseda University 3Hongkong University 4MEGVII Technology 5Hong Kong University of Science and Technology
Abstract Image

Abstract

Multi-hand semantic grasp generation aims to generate feasible and semantically appropriate grasp poses for different robotic hands based on natural language instructions. Although the task is highly valuable, due to the lack of multi-hand grasp datasets with fine-grained contact description between robotic hands and objects, it is still a long-standing difficult task. In this paper, we present Multi-GraspSet, the first large-scale multi-hand grasp dataset with automatically contact annotations. Based on Multi-GraspSet, we propose Multi-GraspLLM, a unified language-guided grasp generation framework. It leverages large language models (LLM) to handle variable-length sequences, generating grasp poses for diverse robotic hands in a single unified architecture. Multi-GraspLLM first aligns the encoded point cloud features and text features into a unified semantic space. It then generates grasp bin tokens which are subsequently converted into grasp pose for each robotic hand via hand-aware linear mapping. The experimental results demonstrate that our approach significantly outperforms existing methods on Multi-GraspSet.

Dataset Construction

The initial unified grasp generation produces physically stable grasps. Then, through two levels of annotations, we generate pair data containing basic conversation with corresponding grasp pose for each robotics hand.

Dataset Construction Image

Dataset Construction

Multi-GraspLLM

The point encoder extracts point clouds from objects and maps them with language descriptions into the same latent space. The LLM backbone then generate grasp bin tokens as output. Finally, we convert these grasp bin tokens into corresponding grasp angles for each robotic hand.

Pipeline Image

Pipeline of Multi-GraspLLM

Visualization

Multi-GraspSet and Training data Templates

Multi-GraspSet includes grasp poses of common objects across three robotic hands with two types of contact annotations. Based on this dataset, we generate training data for Multi-GraspLLM, consisting of three different categories.

Grasp Visualization Image 2

Illustration of Multi-GraspSet

Grasp Visualization Image 1

Training Data Templates

Multi-GraspLLM Results

Multi-GraspLLM can generate different robotic hand grasping poses based on instructions with varying levels of contact information.For the low-level instruction, model generates grasps without any contact information, while mid-level focuses on the object part information. The high-level instruction utilizes finger contact information, enabling our model to simply control individual fingers.

Visualization Image

Visualization of the Multi-GraspLLM Results

Real World Experiment

We conducted extensive experiments in real world environments. we collect 11 objects from daily life covering from tools to potted plant. We tested both the baseline method for grippers and Multi-GraspLLM on these objects, evaluating them based on grasping success rate and part selection accuracy.

Real World Experiment For Grippper

Real World Experiment For Grippper

BibTeX

@misc{li2024multigraspllmmultimodalllmmultihand,
      title={Multi-GraspLLM: A Multimodal LLM for Multi-Hand Semantic Guided Grasp Generation}, 
      author={Haosheng Li and Weixin Mao and Weipeng Deng and Chenyu Meng and Haoqiang Fan and Tiancai Wang and Ping Tan and Hongan Wang and Xiaoming Deng},
      year={2024},
      eprint={2412.08468},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2412.08468}, 
}