The most popular paradigm to solve modern image processing tasks like image classification/object detection etc. on small datasets involves fine-tuning the latest pre-trained deep network, which was previously ImageNet-based and is now probably CLIP-based. The current pipeline has been largely successful, but still has some limitations.
The main concern is probably the tremendous amount of effort required to collect and tag these large volumes of images. Notably, the size of the most popular pre-training dataset has grown from 1.2MB (ImageNet) to 400MB (CLIP) and doesn’t seem to stop. As a direct consequence, the training of generalist networks also requires a large computational effort, which nowadays only a few industrial or academic laboratories can afford. Another critical point regarding such aggregated databases is their static nature. Although these datasets are very large, they are not updated. Therefore, their informative value in relation to known concepts is limited in time.
Recent work by researchers at Carnegie Mellon University and Berkley University proposes treating the Internet as a special data set to overcome the aforementioned problems of the current pre-training and fine-tuning paradigm.
🚨 Be part of the fastest growing AI subreddit community with 15,000 members
Specifically, the paper proposes a reinforcement learning-inspired, disembodied online agent called Internet Explorer that actively searches the web using standard search engines to find relevant visual data that improves feature quality in a target dataset.
The agent’s actions are text queries to search engines, and the observations are the data obtained from the search.
The proposed approach differs from active learning and related work by performing an actively improving directed search in a fully self-supervised manner on an expanding dataset that requires no labels for training, even from the target dataset. In particular, the approach is not applied to a single dataset and does not require the intervention of label experts as is the case with standard active learning.
In practice, Internet Explorer uses WorNet concepts to query a search engine (e.g. Google Images) and embeds such concepts in a representation space to learn to identify relevant queries over time. The model uses self-supervised learning to learn useful representations from the unlabeled images downloaded from the internet. The initial Vision encoder is a self-monitored, pre-trained MoCoV3 model. The images downloaded from the Internet are ranked by self-monitored loss to understand their similarity to the target dataset as a proxy for relevance to training.
On five popular fine-grained and challenging benchmarks, i.e. Birdsnap, Flowers, Food101, Pets and VOC2007, Internet Explorer (with the additional use of GPT-generated descriptors for concepts) manages to compete with a CLIP oracle ResNet 50 and the number calculate and train images to reduce by one or two orders of magnitude.
In summary, this paper introduces a novel and intelligent agent that queries the Internet to download and learn helpful information to solve a specific image classification task at a fraction of the training cost compared to previous approaches, and opens further research on the topic.
Check out the paper and Github. All credit for this research goes to the researchers on this project. Also, don’t forget to join our SubReddit, Discord Channel and email newsletter with 15,000+ ML where we share the latest AI research news, cool AI projects and more.
Lorenzo Brigato is a postdoctoral researcher at the ARTORG Center, a research facility affiliated with the University of Bern, and is currently working on the application of AI to health and nutrition. He has a Ph.D. Degree in Computer Science from the University of Sapienza in Rome, Italy. His Ph.D. The dissertation focused on image classification problems with sample- and label-deficient data distributions.