Skip to main content
Get the article on our blog: What you need to know about your data before you launch your AI project.


AI #3 Where’d This Sh– Come From?

In our last post, “Magic E-Wizard Machine or Maniacal Hell-bot? ” You saw how AI is being presented as either a world-saving innovation or a beast from the pit. Or both. Or even neither. It all depends on the data it’s fed. So…where does that data come from? This was the actual question asked of me by a chainsaw artist friend. The substance in question was the data, or content, fueling ChatGPT. Good question.

The problem with most AI projects is that nobody knows exactly where the data in their data set is coming from. They don’t know who created it or the circumstances of why it was created. They don’t know how that data will be consumed. They don’t even know if it’s true. Worse, they might think they know the data source, still not know who created it or why, and implicitly trust it as “fact” or true. Even if they have an established data set and believe that the data set is original and correct, an AI specialist will still question the data set because it contains numerous artifacts that they just can’t seem to get rid of.

ChatGPT scrapes data off the internet as its dataset. Which raises numerous legal and ethical questions, as we mentioned in Deus ex Machina. OpenAI already has at least two class-action lawsuits filed against it for alleged copyright and privacy law violations. Other generative AI programs also go out, grab data from multiple sources, synthesize it, and deliver results. ChatGPT doesn’t provide sources, footnotes, or links, so you don’t know if the data came from authorized or credible sources. If you know how to prompt it, you can get a bit more insight. However, ChatGPT will also completely make stuff up—including names of academic journals. There goes any confidence you might have in the true provenance of your data.

Going further, scraping content from other sources provides no clues about why the content was created. Maybe it was a legitimate review of the top 10 most popular houseplants. It might be educational content, like the Nat Geo Lion Pride of Botswana 2021 documentary. And then there’s the clickbait headline of “17 Lion Vines That Will Kill You With Cuteness.” Stick that in your AI model, grab the popcorn and get ready for Synthetic Safari.

Don’t even bother asking if the content is actually true. AI won’t know. Will a lion vine really kill you? Is cuteness a lethal weapon? What actually is a lion vine? In one example, ChatGPT concocted the full text of a made-up lawsuit accusing a completely innocent person of financial crimes—an incident OpenAI called a “hallucination.” Which, unsurprisingly, resulted in a defamation lawsuit against OpenAI.

Repeat after me: just because data is on the internet or a computer doesn’t make it true. If it ends up in your company’s new AI project, fasten your seat belt.

Questionable Algorithms by Cantankerous Mathematicians

Poor data quality will take down any AI project, but it’s not the only pitfall. Poor implementations of AI and unsafe practices will—and already are starting to–have long-lasting impact on AI going forward. Algos and their creation are beyond our scope here, but there are a couple things to keep in mind when creating an AI system. The first is bias. Because algos are written by humans, they reflect the biases of the people who create them. Second, most algos rely on correlation between data. They don’t take cause into account. In the Synthetic Summer video, what started as two backyard firepit barbecues turned into a tornadic firestorm. With more emphasis on fire-related data, the algo continued reinforcing that factor, completely unaware that more fire would consume everything and everyone in the back yard instead of searing a steak faster. Algos should be explainable, auditable, and transparent—just like human decision-makers should be.

Another consideration is the parameter set for an AI model. Parameters are values that the algo learns or estimates, based on its initial data training and shaped by hyperparameters set during model design. When an algo delivers results that aren’t what the creators want, they add parameters. That doesn’t always work. For example, ask the algo how to make mustard gas, and it will decline because that’s deemed a dangerous or inappropriate question. But ask it what you should never mix with bleach, it will tell you never to mix bleach with ammonia to avoid creating toxic mustard gas. Or ask ChatGPT to officiate your weddingand it will tell you that it can’t because it doesn’t have eyes or a body. But apparently it was OK with being asked to write and recite a script combined of wedding vows and personal details. As we mentioned earlier, applying computer models to human circumstances doesn’t always work the way the creators intended. In these cases, humans can pretty easily figure out a way around the parameters.

Combine vast amounts of non-scrutinized data with unavoidable human bias and poorly thought-out algo design, and this is where we are with AI at the moment. Now watch what happens as unsafe data practices become the foundation for AI going forward. Hold my beer.