This Artificial Intelligence Paper Proposes ‘SuperGlue,’ A Graph Neural Network That Simultaneously Performs Context Aggregation, Matching, And Filtering of Local Features for Wide-Baseline Pose Estimation

Imagine you have two images of the same scene taken from different angles. Most of the objects in both images are the same, only you look at them from different angles. Computer vision assumes that objects have certain characteristics, such as edges, corners, and so on. Matching these characteristics is critical for some applications. But what would it take to match features between two images?

Finding matches between images is a prerequisite for estimating 3D structures and camera poses in computer vision tasks such as Simultaneous Localization and Mapping (SLAM) and Structure-from-Motion (SfM). This is done by matching local features and is difficult to achieve due to the changes in lighting conditions, occlusion, blurring, etc.

Traditionally, feature matching is done using a two-step approach. First the front end Step extracts visual features from the images. Second the backend Step applies bundle fitting and pose estimation to help match extracted visual features. Once this is done, the features are ready and feature matching is modeled as a linear mapping problem.

As in all other fields, deep neural networks have played a crucial role in feature matching problems in recent years. They have been used to learn better sparse detectors and local descriptors from data using convolutional neural networks (CNNs).

However, they were typically a component of the feature matching problem and not an end-to-end solution. What if a single neural network could perform context aggregation, matching, and filtering in a single architecture? Time to introduce superglue.

SuperGlue approaches have matching problems in other ways. It learns the matching process from pre-existing local features using a graph neural network structure. This replaces the existing approaches, in which the task-agnostic features are first learned and compared with heuristics and simple methods. As an end-to-end approach, SuperGlue has a strong advantage over existing methods. SuperGlue is learnable middle end which could be used to improve existing approaches.

How does SuperGlue achieve this? It appears through a new window and considers the feature matching problem as a partial match between two sets of local features. Instead of solving a linear mapping problem to match features, it treats it as an optimal transport problem. SuperGlue uses a Graph Neural Network (GNN) that predicts the cost function of this transport optimization.

We all know how Transformers have made massive strides in natural language processing and, more recently, in computer vision tasks. SuperGlue uses a transformer to take advantage of key points’ spatial relationships as well as their visual appearance.

SuperGlue is trained continuously. Image pairs are used as training data. Priors for pose estimation are learned from a large labeled dataset; Therefore, SuperGlue can understand the 3D scene.

SuperGlue can be applied to multiple problems where high quality feature correspondence is required for a multi-view geometry. It runs in real time on off-the-shelf hardware and can be used for both classic and learned functions. For more information on SuperGlue, see the links below.

Try this paper Project, and code. All credit for this research goes to the researchers on this project. Also don’t forget to participate our Reddit page and Discord Channelwhere we share the latest AI research news, cool AI projects and more.

Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. Dissertation on image denoising using deep convolutional networks. He is currently pursuing a Ph.D. Graduated from the University of Klagenfurt, Austria and works as a researcher in the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.