There is a new hot trend in AI: text-to-image generators. Feed these programs with any text you like and they will generate remarkably accurate images that match that description. They can match a range of styles, from oil paintings to CGI images and even photos, and – although it sounds clichéd – in many ways the only limit is your imagination.
So far, the leader in the field has been DALL-E, a program created by the commercial AI lab OpenAI (and updated recently in April). Yesterday, however, Google announced his own understanding of the genreImagen, and it recently demolished DALL-E in the quality of its output.
The best way to understand the amazing ability of these models is to simply look at some of the images they can generate. There are some generated by Imagen above, and even below (you can see more examples on the dedicated Google homepage).
In any case, the text at the bottom of the image was the prompt inserted into the program, and the image above, the output. Just emphasize: that’s all it takes. You type what you want to see and the program generates it. Pretty fantastic, isn’t it?
But while these images are unquestionably impressive in their coherence and accuracy, they should also be taken with a pinch of salt. When research teams like Google Brain release a new AI model, they tend to pick the best results. So, although these images all look perfectly polished, they may not represent the average output of the Image system.
Often, images generated by text-to-image models look unfinished, smeared, or blurred – problems we’ve seen with images generated by OpenAI’s DALL-E program. (For more on the problem points for text-to-image systems, take a look at this interesting Twitter thread that is plunging into trouble with DALL-E. It highlights, among other things, the system’s tendency to misunderstand incentives, and to struggle with both text and faces.)
However, Google claims that Imagen produces consistently better images than DALL-E 2, based on a new benchmark it created for this project called DrawBench.
DrawBench is not a particularly complex metric: it is basically a a list of about 200 text invitations that Google team fed into Imagen and other text-to-image generators, with the output of each program then judged by human evaluators. As shown in the graphs below, Google found that people generally preferred Imagen’s output to that of rivals.
However, it will be difficult to judge this for ourselves, as Google does not make the model image available to the public. There is also a good reason for this. Although text-to-image models certainly have amazing creative potential, they also have a range of annoying applications. Imagine a system that generates almost any image you like to be used for fake news, deception or harassment, for example. As Google notes, these systems also encode social prejudice, and their production is often racist, sexist, or toxic in some other inventive fashion.
Much of this is due to how these systems are programmed. Basically, they are trained on huge amounts of data (in this case: many pairs of images and captions) which they study for patterns and learn to reproduce. But these models need a lot of data, and most researchers – even those working for well-funded technology giants like Google – have decided it is too expensive to comprehensively filter this input. So, they are scraping huge amounts of data from the web, and so their models are ingesting (and learning to reproduce) all the hateful bile you would expect to find online.
As Google researchers summarize this problem in theirs paper: “[T]he large-scale data requirements of text-to-image models […] led researchers to rely heavily on a large, mostly uncured, web-scraped database […] Data series have revealed that these databases tend to reflect social stereotypes, repressive views, and contemptuous, or otherwise harmful, associations to marginalized identity groups. ”
In other words, the worn-out adage of computer scientists is still valid in the fantasy world of AI: trash in, trash out.
Google does not go into too much detail about the annoying content generated by Imagen, but notes that the model “encodes several social prejudices and stereotypes, including a general bias towards generating images of people with brighter skin tones and a tendency for images portraying different professions. to reconcile with Western sexual stereotypes. ”
This is something researchers also found during evaluation of DALL-E. Ask DALL-E to generate images of a “flight attendant”, for example, and almost all subjects will be women. Ask for pictures of “CEO”, and, surprise, surprise, you get a bunch of whites.
That’s why OpenAI has also decided not to release DALL-E publicly, but the company is giving access to selected beta testers. It also filters out certain text inputs to stop the pattern used to generate racist, violent, or pornographic images. These measures go a long way in limiting potential harmful applications of this technology, but the history of AI tells us that such text-to-image models will almost certainly be made public sometime in the future, with all the annoying implications that a broader one brings. access. .
Google’s own conclusion is that Imagen is “not suitable for public use at present,” and the company says it plans to develop a new way to compare “social and cultural prejudice in future work” and test future iterations. For now, though, we’ll have to settle for the company’s optimal selection of images – raccoon royalty and cacti wearing sunglasses. That, however, is just the tip of the iceberg. The iceberg made of the unintended consequences of technological research, if Imagen wants to try to generate ke.