They say a picture says more than a thousand words. But an image can’t “speak” to people who are blind or partially sighted without a little help. In a world driven by visual imagery, especially online, this creates a barrier to entry. The good news is that when screen readers — software that reads webpage content to BLV people — come across an image, they read each “alt text” Descriptions that the website creator added to the underlying HTML code, making the image accessible. The bad news: few images are accompanied by proper alt text descriptions.
In fact, according to one study, less than 6% of English-language Wikipedia images contain alt-text descriptions. And even in cases where websites provide descriptions, they can’t be of any help to the BLV community. For example, imagine alt-text descriptions that just list the photographer’s name, the filename of the image, or a few keywords to make searching easier. Or, imagine a home button that’s shaped like a house, but has no alt text that says “Home.”
Due to missing or unhelpful image descriptions, members of the BLV community are often locked out of valuable social media interactions or unable to access important information on websites that use images for site navigation or to convey meaning.
While we should encourage better tools and interfaces to get people to make images accessible, society’s failure to date to provide useful and accessible alt text descriptions for every image on the web points to the potential for an AI solution, he says Elisa Kreissa graduate student in linguistics at Stanford University and a member of the Stanford Natural Language Processing Group. But Natural Language Generated (NLG) image descriptions have not yet proven beneficial to the BLV community. “There’s a disconnect between the models we have in computer science that are supposed to generate text from images and what actual users find useful,” she says.
in the a new paper, Kreiss and their co-authors (including researchers from Stanford, Google Brain, and Columbia University) recently found that BLV users prefer image descriptions that take context into account. Since context can dramatically change the meaning of an image – e.g. For example, a soccer player in a Nike ad versus a story about a traumatic brain injury—contextual information is critical to creating useful alt text descriptions. However, previous image description quality metrics do not currently take context into account. These metrics therefore steer the development of NLG image descriptions in a direction that does not improve image accessibility, Kreiss says.
Read the newspaper, “Context Matters in Image Descriptions for Accessibility: Challenges for Referenceless Rating Metrics”
Kreiss and her team also found that BLV users prefer longer alt-text descriptions, rather than the succinct descriptions typically promoted by celebrities accessibility guidelines – a result contrary to expectations.
These results not only underscore the need for new ways to train sophisticated language models, Kreiss says, but also for new ways to evaluate them to ensure they meet the needs of the communities for which they were developed.
Measuring the usefulness of image descriptions in context
Computer scientists have long assumed that image descriptions should be objective and context-agnostic, Kreiss says, but research into human-computer interaction shows that BLV users tend to prefer descriptions that are both subjective and contextual. “If the dog is cute or the sunny day is beautiful, the description might need to say that depending on the context,” she says. And if the picture appears on one Shopping website versus a news blogthe alt text description should reflect its context to make its meaning clear.
However, existing metrics for assessing the quality of image descriptions focus on whether a description appropriately fits the image, regardless of the context in which it appears, Kreiss says. For example, current metrics might rate highly a photo description of a soccer team that reads “a soccer team plays on a field” if accompanying an article about collaboration (in which case the alt text should include something about how the team works together), a a story about unusual hairstyles of the athletes (in this case the hairstyles should be described) or a report on the dissemination of advertising in football stadiums (in this case the advertising in the arena could be mentioned). If image descriptions are to better serve the needs of BLV users, they need to be more context-aware, Kreiss says.
To examine the importance of context, Kreiss and her colleagues hired employees from Amazon Mechanical Turk to write image descriptions for 18 images, each of which appeared in three different Wikipedia articles. In addition to the soccer example cited above, the dataset included images such as a church tower linked to articles on roofs, building materials, and Christian crosses; and a view of the mountain range and lake, coupled with articles on montane ecosystems (mountainside), bodies of water, and orogeny (a particular way in which mountains are formed). The researchers then showed the images to both sighted and BLV study participants and asked them to rate the overall quality of each description, envisionability (how well it helped users visualize the image), relevance (how well it captured relevant information), the irrelevance (to rate how much irrelevant information it added) and general “fit” (how well the image fits into the article).
The study found that ratings from BLV and sighted participants were highly correlated. Knowing that the two populations were matched in their assessments will help in the development of future NLG systems for generating image descriptions, Kreiss says. “The perspectives of people in the BLV community are important, but often during system development we need much more data than we can get from the low-incidence BLV population.”
Another insight: context matters. Participants’ ratings of the overall quality of an image description closely matched their ratings for relevance.
For description length, BLV participants rated the quality of longer descriptions higher than sighted participants, a result that surprised Kreiss and warrants further research. “Users’ preference for shorter or longer image descriptions may also depend on context,” she notes. Numbers in scientific papers, for example, perhaps deserve a longer description.
Steering towards better metrics
Kreiss hopes her team’s research will drive image description quality metrics that better serve the needs of BLV users. In your papershe and her colleagues found that two of the current methods (CLIPScore and SPURTS) failed to capture context. For example, CLIPScore only provides a compatibility score for an image and its description. And SPURTS evaluates the quality of the descriptive text without reference to the image. While these metrics can assess the truthfulness of an image description, this is just a first step towards driving a “useful” description generation that also requires relevance (ie context dependency).says Kreiss.
It was therefore not surprising that CLIPScore’s ratings of the image descriptions in the researchers’ data set did not correlate with the ratings of the BLV and the sighted participants. CLIPScore rated the quality of the description as essentially the same regardless of the context. When the team added the text of the various Wikipedia articles to change the way CLIPScore is calculated, correlation with human scores improved somewhat — a proof of concept, Kreiss says, that referenceless score metrics can be made context-aware. She and her team are now working to create a metric that takes context into account from the start to make descriptions more accessible and responsive to the community of people they are designed to serve.
“We want to work towards metrics that can lead us to success in this very important social area,” says Kreiss. “If we don’t start with the right metrics, we won’t drive progress in the direction we want to go.”
“Context Matters in Image Descriptions for Accessibility: Challenges for Referenceless Rating Metrics“ was accepted by the 2022 Conference Empirical Methods in Natural Language Processing. Co-authors include Cynthia Bennett, Senior Research Scientist in Google’s People + AI Research Group; Shayan Hooshmand, student and NLP researcher at Columbia University; Stanford computer science graduate student Eric Zelikman; Google Brain Principal Scientist Meredith Ringel Morris; and Stanford Professor of Linguistics Christopher Potts.
Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.