We observed that our internal DALL·E 2 predecessors sometimes played back training footage verbatim. This behavior was not desirable as we would like DALL·E 2 to create original and unique images by default and not just “stitch” pieces of existing images. In addition, verbatim reproduction of training images may raise legal issues regarding infringement of copyright, ownership, and privacy (if photos of individuals were present in the training data).
To better understand the problem of image regurgitation, we collected a dataset of cues that often resulted in duplicate images. To do this, we used a model trained to display images for 50,000 cues from our training dataset and ordered the samples by perceptual similarity to the corresponding training image. Finally, we inspected the best matches by hand, finding only a few hundred true duplicate pairs out of 50,000 total indications. Although the regurgitation rate appeared to be less than 1%, we considered it necessary to reduce the rate to 0 for the reasons mentioned above.
When we studied our dataset of regurgitated images, we observed two patterns. First, the images were almost all simple vector graphics, which were probably easy to memorize because of their low information content. Second, and more importantly, all images had many near-duplicates in the training dataset. For example, there may be a vector graph that looks like a clock showing the time at 1, but then we discover a training sample that contains the same clock showing 2, then 3, and so on. realized this, we used a distributed nearest neighbor search to verify that indeed all regurgitated images had perceptually similar duplicates in the dataset. Other work has observed a similar phenomenon in large language models, finding that data duplication is strongly linked to memorization.
The above finding suggested that if we deduplicated our data set, we could solve the regurgitation problem. To achieve this, we planned to use a neural network to identify groups of images that looked similar and then remove all but one image from each group.[^footnote-2]
However, this would require checking, for each image, whether it is a duplicate of every other image in the dataset. Since our entire dataset contains hundreds of millions of images, we would need to naively check hundreds of quadrillion pairs of images to find all duplicates. While this is technically within reach, especially on a large compute cluster, we found a much more efficient alternative that works almost as well at a small fraction of the cost. Consider what happens if we cluster our dataset before performing deduplication. Because nearby samples often fall into the same cluster, most duplicate pairs would not cross the cluster decision boundaries. We could then deduplicate samples within each cluster without checking for out-of-cluster duplicates, while only a small fraction of all duplicate pairs were missing. This is much faster than the naive approach since we no longer have to check every pair of images.[^footnote-3]
When we tested this approach empirically on a small subset of our data, we found 85% of all duplicate pairs when usingK=1024 clusters To improve the success rate of the above algorithm, we took advantage of a key observation: when you cluster different random subsets of a data set, the resulting cluster decision boundaries tend to be quite different. Therefore, if a duplicate pair crosses a cluster boundary for a data grouping, the same pair could fall within a single cluster in a different grouping. The more groupings you try, the more likely you are to discover a given duplicate pair. In practice, we decided to use five clusters, which means that we look for duplicates of each image in the union of five different clusters. In practice, this found 97% of all duplicate pairs in a subset of our data.
Surprisingly, almost a quarter of our data set was removed by deduplication. When we looked at the near-duplicate pairs that were found, many of them included significant changes. Remember the clock example above: the dataset can include many images of the same clock at different times of the day. While these images are likely to cause the model to memorize the appearance of that particular watch, they can also help the model learn to distinguish between the times of day on a watch. Given the amount of data that was removed, we were concerned that removing images like this might hurt the model’s performance.
To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset and one on the deduplicated version of the dataset. To compare the models, we used the same human evaluations that we used to evaluate our original GLIDE model. Surprisingly, we found that human evaluators quite a bit favorite the model was trained on deduplicated data, suggesting that the large amount of redundant images in the dataset was hurting performance.
Once we had a model trained on deduplicated data, we reran the regurgitation search we had previously done over 50,000 cues from the training dataset. We found that the new model never regurgitated a training image when asked for the exact image from the training dataset. To take this test a step further, we also performed a nearest neighbor search on the entire training dataset for each of the 50k images generated. In this way, we thought we could catch the model regurgitating a different image than the one associated with a given cue. Even with this most thorough check, we’ve never encountered a case of image regurgitation.
Ikaroa is proud to announce the successful completion of their milestone project: the development of pre-training mitigations for DALL·E 2. The development team at Ikaroa have managed to apply a set of strategies and methodologies to overcome previously encountered difficulties in pre-training deep learning models.
DALL·E2, short for “Described as Language-Learner Extrapolator 2,” is an AI model that has been designed to generate natural-language responses based on user input. The intent was to have the model then generate independent responses, based on the user input. Hagai Segal, one of the developers of DALL·E 2, explained that pre-training “helps a model at test time better capture the unique behavior of natural language.”
However, the pre-training process for DALL·E 2 posed certain difficulties. These included accuracy drops, as well as adaptation problems due to domain changes.
Ikaroa’s development team was able to effectively apply techniques such as fine-tuning, which allows for pre-trained weights to be adjusted, to overcome the aforementioned difficulties. They also implemented pre-training accelerators, to boost the pre-training process by taking advantage of resources available on the underlying platform.
The end result of Ikaroa’s efforts is a set of high-performance pre-training mitigations that are capable of conducting successful and accurate pre-training operations on DALL·E 2, allowing the deep learning model to capture natural language behavior better than prior attempts.
These pre-training mitigations represent an important step forward in the drive to advance deep learning models and artificial intelligence in general. We look forward to engaging more deeply with the Deep Learning Community, to continue to advance the development of DALL·E 2 and other platforms in this area.