Object recognition has been an important task in computer vision for the last few decades. The goal is to detect instances of objects like people, cars, etc. in digital images. Hundreds of methods have been developed to answer a single question: what objects are where?
Traditional methods attempted to answer this question by extracting handcrafted features such as edges and corners within the image. Most of these approaches used a sliding window approach, meaning they examined small parts of the image at different scales to see if any of those parts contained the object they were looking for. This was really time consuming, and even the slightest change in object shape, flash, etc. could have caused the algorithm to miss it.
Then came the deep learning era. As computing hardware became more powerful and large datasets were introduced, it became possible to leverage advances in deep learning to develop a reliable and robust object detection algorithm that could work end-to-end.
Using deep learning methods can lead to extremely successful object detection techniques. They are robust to changes in the environment and the objects in the image. Most of them can be run in real time, even on mobile devices. Sounds really good, doesn’t it? Does that mean we can say that the problem of object detection is finally solved? Not yet.
The problem we have is that all of these methods are limited by the dataset they are trained on. When training your model to recognize pandas in the picture, use lots of panda pictures to teach them what it looks like. Collecting these images is one aspect, but the bigger problem is labeling them. Going through thousands of images and marking the exact locations of pandas in each image is an extremely time-consuming task.
Also, you would need to do this for each object you want your model to recognize. Imagine you want to develop a generic object detection model that will recognize all objects it will see. You can use large datasets like COCO that contain a variety of objects, but you’re still limited to the number of different categories in your dataset.
What if the model could discover new objects? In this case, we wouldn’t have to label every single object in the world. Perhaps the model would be given a set of familiar objects, and then when it saw a new one, it would understand and predict the label for it. That’s what the authors of RNCDL paper try to reach.
This problem is known as Novel Class Discovery and Localization (NCDL). The aim is to discover and recognize objects from raw and unlabeled data. Existing methods approach this problem in a data-driven manner by feeding in prior knowledge. In this way, the unknown objects are divided into semantic classes under some supervision. However, it is not a common solution.
Solving the novel class detection and localization together is a more difficult problem since each image in the data set contains labeled and unlabeled object classes together. Therefore, each of these objects must be located and categorized at the same time.
RNCDL is trained on mixed and modified COCO and LVIS datasets. Half of the COCO dataset is used as-is, but the remaining half has labels removed to assess how well the network learns to recognize new classes using the long-tail LVIS label set.
A two-stage detector is trained end-to-end for this problem. The model has two purposes; to correctly recognize tagged objects and recognize untagged objects by simultaneously learning feature representations. Supervised training is performed on the dataset first, and then self-supervised training is performed on the unlabeled data. The knowledge learned in the supervised phase is transferred to the later phase by maintaining weights for class-independent modules and segmentation headers.
For classification, a new classification is added alongside the primary one, and they are trained together with the goal of categorizing each region suggestion. Non-uniform classification priority helps the network capture features that represent the diversity of objects and prevents the network from being biased towards the labeled or background classes.
RNCDL can discover and recognize novel classes and outperform previous approaches. In addition, it can be generalized beyond the COCO dataset.
Try this paper, GitHub, and project page. All credit for this research goes to the researchers on this project. Also don’t forget to participate our Reddit page and Discord Channelwhere we share the latest AI research news, cool AI projects and more.
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. Dissertation on image denoising using deep convolutional networks. He is currently pursuing a Ph.D. Graduated from the University of Klagenfurt, Austria and works as a researcher in the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.