In our Where’d This Sh** Come From post, you saw why data provenance is critical to a quality data set and the impact of human bias and poor algo implementation. Now we’ll see what happens when AI loses its mind.
The first rule that everyone learns when beginning to program is “garbage in, garbage out.” AI is solely dependent upon the data that it receives. Because the training data sets for generative AI models tend to be sourced from the internet, today’s AI models are being trained on increasing amounts of AI-synthesized data. Content, such as text and images, that used to be created only by humans are now being created by AI models. It’s often faster, cheaper, and easier to use synthetic data in a range of applications. As deep learning models become gargantuan in size, we’re also running out of genuine human-generated data of the right type to support specific applications. The problem is that there is often no indication whether data being used is synthesized or original. And once the fundamental bedrock of datasets are built, untangling them will be impossible.
A recent paper written by collaborating computer engineers from Stanford and Rice Universities, Self-Consuming Generative Models Go MAD, explores what happens over time when using synthetic data to train new AI models. Repeating the use of synthetic data in succeeding generations of models creates different types of autophagus (self-consuming) loops that trade off quality (precision) and diversity (types of results), depending on how much fresh data vs. synthesized data is used. In a nutshell, generative AI models based solely or on a majority of synthesized data will degrade over time. If you don’t ensure that the model receives enough real, correct data, the AI program will suffer from non-compo mentis—also known as an unsound mind.
Keep in mind that bias and parameter choices still operate within users’ choices of synthesized and fresh data. According to the study, generative model users tend to cherry-pick their synthetic data, preferring high-quality samples. They also can control parameters to manipulate increases of quality at the expense of diversity or vice versa. The impact of sampling bias is too complicated to discuss here, but the paper authors are happy to walk you through it.
What does this mean for companies that are considering or heading down an AI road? Expect a lot of unintended consequences if relying on synthesized data. The paper’s authors offer (kind of) tongue-in-cheek advice: “Practitioners who are deliberately using synthetic data for training because it is cheap and easy can take our conclusions as a warning and consider tempering their synthetic data habits, perhaps by joining an appropriate 12-step program. Those in truly data-scarce applications can interpret our results as a guide to how much scarce real data is necessary to avoid MADness in the future.”
Although the benevolent, magic e-wizard AI machine is looking a little less shiny now, the good news is that we can keep the maniacal hell-bot from rearing its ugly head. In other words, you can avoid wasting a lot of time and money by simply ensuring the characteristics of your dataset before you start down the AI road. Thank goodness, it’s not as hard as it sounds. Flying Cloud CrowsNest can make it much easier.