Have you ever witnessed a puppy breaking out of a dragon egg? What about a picture of a cyberpunk airship-filled city with greenery spilling over balconies? Or about a photo of two robots fighting in an octagon? All of the mentioned and much more is possible thanks to AI-generated images.
These may seem impossible, yet they are made possible through text-to-image generation, a cutting-edge form of machine learning. These models are capable of producing excellent, photorealistic graphics from a straightforward text cue.
Scientists and engineers at Google Research have been investigating text-to-image generation using a range of AI algorithms. Following extensive testing, they recently introduced Imagen and Parti as two new text-to-image models. Both are capable of producing photorealistic photos, but they do it in different ways. We would want to talk a little bit more about the functionality and potential of these models.
How Does Text-To-Image (AI-generated images) Work?
People use text-to-image models to create visuals that as nearly as possible match the provided text descriptions. Fractal building architecture made of cross-laminated timber peaking above the clouds is an example of more complicated detail, interaction, or descriptive indicator used in AI-generated images.
In recent years, ML models have been trained on big image datasets with accompanying textual descriptions, leading to improved images and a wider variety of descriptions. This has led to important developments in this field, such as Open AI’s DALL-E 2 or MidJourney, which was used for the illustrations for this article.
How Imagen and Parti Work?
Both Imagen and Parti expand on earlier models. Transformer models have the ability to analyze how words fit into a sentence. They are the basis of our text-to-image models’ representation of text. Both models employ a novel method that aids in producing visuals that more closely resemble the text description. Although Imagen and Parti employ comparable technology, they seek various but complementary models.
Imagen is a Diffusion AI-generated images model that learns how to create images from a pattern of random dots. These photos start out with a modest resolution before gradually increasing it. Recent advances in picture and audio tasks like as improving image resolution, recoloring black and white photos, modifying specific areas of an image, uncropping images, and text-to-speech synthesis have been made possible by diffusion models.
In Parti’s method, a collection of photos is first transformed into a series of code entries that resemble puzzle pieces. Then, a new image is produced by translating a supplied text prompt into these code entries. This method is essential for handling lengthy, complicated text prompts and producing high-quality images because it makes use of current research and infrastructure for large language models like PaLM.
These modelsused for AI-generated images have numerous drawbacks. For instance, neither can accurately put things based on precise spatial descriptions (such as “a red sphere to the left of a blue block with a yellow triangle on it”) or create precise counts of objects (such as “ten apples”). Additionally, when requests get more complicated, the models start to fall short, either forgetting facts or adding details that weren’t in the prompt.
Numerous flaws, such as a lack of explicit training materials, a lack of adequate data representation, and a lack of 3D awareness, are to blame for these actions. Through broader representations and better integration into the text-to-image creation process, we seek to close these gaps.