TU WIEN develops ZS6D: Zero-Shot 6D Object Pose Estimation
As robotic systems interact with increasingly complex environments, recognizing a wide variety of objects becomes crucial. Traditional 6D object pose estimation methods rely on object- specific training, which limits their ability to generalise to unseen objects. Recent methods address this limitation by employing deep template matching through fine-tuned CNNs. However, these approaches require costly data rendering and extensive training.
To overcome these challenges, TU WIEN introduced ZS6D, a zero-shot 6D object pose estimation method. ZS6D uses visual descriptors from pre-trained Vision Transformers (ViT) to match rendered templates with query images and establish local correspondences. These correspondences are used to estimate an object’s 6D pose using RANSAC-based PnP, without requiring task-specific fine-tuning.
Their experiments demonstrate that ZS6D is an improvement on state-of-the-art methods like MegaPose and OSOP on the LMO, YCBV, and TLESS datasets. The use of ViT descriptors leads to superior performance in generalisation without the need for massive datasets and extensive training.
This novel approach will be integrated into the MANiBOT cobot, which will demonstrate fast and effective manipulation, even of unknown objects, in environments with a human presence.
You can read the publication here.