There’s nothing like a good benchmark to help motivate the computer vision field.
That’s why one of the research teams at the Allen Institute for AI, also known as AI2, recently worked jointly with the University of Illinois at Urbana-Champaign to develop a new, unifying benchmark called GRIT (General Robust Image Task) for general-purpose computer vision models. Their goal is to help AI developers build the next generation of computer vision programs that can be applied to a number of generalized tasks – an especially complex challenge.
“We discuss, like weekly, the need to create more general computer vision systems that are able to solve a range of tasks and can generalize in ways that current systems cannot,” said Derek Hoiem, professor of computer science at the University of Illinois at Urbana-Champaign. “We realized that one of the challenges is that there’s no good way to evaluate the general vision capabilities of a system. All of the current benchmarks are set up to evaluate systems that have been trained specifically for that benchmark.”
What general computer vision models need to be able to do
According to Tanmay Gupta, who joined AI2 as a research scientist after receiving his Ph.D. from the University of Illinois at Urbana-Champaign, said there have been other efforts to try to build multitask models that can do more than one thing – but a general-purpose model requires more than just being able to do three or four different tasks.
“Often you wouldn’t know ahead of time what are all tasks that the system would be required to do in the future,” he said. “We wanted to make the architecture of the model such that anybody from a different background could issue natural language instructions to the system.”
For example, he explained, someone could say “describe the image,” or say ‘find the brown dog’ and the system could carry out that instruction and either return a bounding box – a rectangle around the dog that you’re referring to – or return a caption saying ‘there’s a brown dog playing on a green field.’ So, that was the challenge, to build a system that can carry out instructions, including instructions that it has never seen before and do it for a wide array of tasks that encompass segmentation or bounding boxes or captions, or answering questions,” he said.
The GRIT benchmark, Gupta continued, is just a way to evaluate these capabilities in a way so that the system can be evaluated as to how robust it is to distortions in the images and how general it is across different data sources. “Does it solve the problem for not just one or two or ten or twenty different concepts, but across thousands of concepts?” he said.
Benchmarks have served as drivers for computer vision research
Benchmarks have been a big driver of computer vision research since the early aughts, said Hoiem. “When a new benchmark is created, if it’s well-geared towards evaluating the kinds of research that people are interested in, then it really facilitates that research by making it much easier to compare progress and evaluate innovations without having to reimplement algorithms, which takes a lot of time,” he said.
Computer vision and AI have made a lot of genuine progress over the past decade, he added. “You can see that in smartphones, home assistance and vehicle safety systems, with AI out and about in ways that were not the case ten years ago,” he said. “We used to go to computer vision conferences and people would ask ‘What’s new?’ and we’d say, ‘It’s still not working’ – but now things are starting to work.”
The downside, however, is that existing computer vision systems are typically designed and trained to do only specific tasks. “For example, you could make a system that can put boxes around vehicles and people and bicycles for a driving application, but then if you wanted it to also put boxes around motorcycles, you would have to change the code and the architecture and retrain it.”
The GRIT researchers wanted to figure out how to build systems that are more like people, in the sense that they can learn to do a whole host of different kinds of tests. “We don’t need to change our bodies to learn how to do new things,” he said. “We want that kind of generality in AI, where you don’t need to change the architecture, but the system can do lots of different things.”
Benchmark will advance computer vision field
The large computer vision research community, in which tens of thousands of papers are published each year, has seen an increasing amount of work on making vision systems more general, he added, including different people reporting numbers on the same benchmark.
The researchers say they hope to create a workshop around the GRIT benchmark and announce it at the 2022 Conference on Computer Vision and Pattern Recognition, June 19-20. “Hopefully, that will encourage people to submit their methods, their new models, and evaluate them on this benchmark,” said Gupta. “We hope that within the next year we will see a significant amount of work in this direction and quite a bit of performance improvement from where we are today.”
Because of the growth of the computer vision community, there are many researchers and industries that want to advance the field, said Hoiem.
“They are always looking for new benchmarks and new problems to work on,” he said. “A good benchmark can shift a large focus of the field, so this is a great venue for us to lay down that challenge and to help motivate the field, to build in this exciting new direction.”