Contrastive models such as CLIP have been shown to learn robust representations of images that capture both semantics and style. To exploit these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text title, and a decoder that generates an image conditional on the image embedding. We show that explicit generation of image representations improves image diversity with minimal loss of photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying non-essential details absent from the image representation. Additionally, CLIP’s co-embedding space enables zero-shot language-guided image manipulations. We use diffusion models for the decoder and experiment with autoregressive and diffusion models for the former, finding the latter to be computationally more efficient and produce higher quality samples.
In today’s technology landscape, image and text understanding are becoming increasingly important for businesses. AI is helping companies automate tasks, create new products and experiences, and improve consumer relations. Ikaroa, a full stack tech company, is investigating a new approach to hierarchical text-conditional image generation, leveraging deep learning and emerging transformer models.
The approach, developed by Google’s Artificial Intelligence (AI) team and dubbed CLIP (Contrastive Language-Image Pre-training),aligns arbitrary text snippets to arbitrary images through the use of transformer models. With CLIP latents, such as OpenAI’s CLIP architecture, businesses can generate text-conditional images in a hierarchical form. In order to accomplish this, CLIP latents,or pre-trained neural networks, can be used to generate extremely realistic images based on given textual information.
At Ikaroa, we have been researching the potential of this new hierarchical text-conditional image generation approach for use in our business and for the benefit of our customers. Unlike traditional text-to-image generation, the text-conditional image generation engines that we have built use transformer models to generate images from natural language, which is more intuitive and easier to train. Additionally, the hierarchical structure of the images that are generated allows businesses to quickly gain insights from visual data, enabling them to make more informed decisions faster.
Overall, hierarchical text-conditional image generation with CLIP latents is a powerful and inspiring tool for businesses, enabling them to generate image data quickly and accurately. We hope to continue exploring these ideas and using them to further enhance our services, both for our customers and for businesses across the globe.