AI #5: Show Me What You Got

By Brian Christian September 22, 2023

It should be obvious by now that data and its quality is critical to avoiding AI algo insanity and delivering a model that can fulfill its intended purpose. But how do you even begin to assess the quality of what you’ve got? Few organizations really understand their data. Geoffrey Moore recently posted on LinkedIn a possible matrix for better data management in the age of AI, based on whether the data is structured, unstructured, curated or not. That’s a start. However, as the MAD researchers found, without enough good-quality, fresh data, generative chat programs actually do not get “better and better over time.” And no matter the data management strategy in place, until you actually can assess the quality of the data itself—at the binary level—you still won’t know what’s feeding your AI.

Data doesn’t create itself, so to assess its quality, you need to know where it’s coming from, who created it, and why. AI cannot overcome problems that you already have with data quality. You have to get an accurate picture of your data as it is today before you can know if it’s usable for AI.

These are the questions that need to be answered to give you a clear picture of your data now:

What is the data provenance?Where did the data come from—persons, devices, systems, external sources, the internet? If it came from outside of the organization, where was it gathered? How did it arrive?
Is it original?If generated by a person, who? If by a device, which device? Understanding who, which group, or which device created the data provides insight into its original purpose.
What was its purpose?Was its original purpose aligned with the AI model purpose? As we mentioned previously, content created to simply generate clicks probably isn’t good data for an AI-based sales analytics purpose. AI doesn’t know click-bait isn’t real.
Is the data qualifiable?Is it complete? Is it verifiably accurate?
What is its structure?Does the data represent text, an image, video, audio, numerical values or other formats?
Does it have integrity?Is the data actual, original data or a derivative of a dataset? Has it changed from its original state, and if so, how? Is it synthetic data?
What is its destiny?Where is data allowed to move and be used? Who should be allowed to use it? What is allowed to happen to it—whether it’s acted on by people, other systems, devices, or applications?

What’s in the Water?

Once you have the questions, now you need answers.

You might know where various data streams in your organization are coming from. For example, in a Product Lifecycle Management (PLM) platform, data might be fed from a CATIA Magic system, Word files, Simulink, and CAD systems. If you know the system, you’ll have a good idea of what it’s sending. But there’s a lot you don’t know. Exactly who originally created the data? Is data in a file actually a composite of data from multiple documents or creators? You won’t know how much has changed from its original creation. The data might be tagged or classified as sensitive or regulated data, but you won’t know where else it has moved outside of the system of origin. Outside the PLM system, who has received it? Where did it go?

You can see the river of data but you really don’t know what’s in the water.

If you’re moving toward incorporating AI into business processes, visibility into data has to be crystal clear. That’s where data surveillance comes in. Data surveillance enables you to see for the first time any—and all—business-critical data at the binary level. You can identify, fingerprint, and catalog it to create a data ledger that supports enterprise initiatives and individual department objectives. Data surveillance monitors this data everywhere it goes. You can see where it’s created, how it’s consumed, who has it, where it goes, and how it changes. You’ll have a chain of data custody that enables you to gather intelligence and analyze activity in support of your AI project—as well as to support compliance, cybersecurity, IP protection, and other corporate objectives. Finally, data surveillance also defends the data. Activity anomalies are immediately alerted, identified, and quarantined or stopped. Security teams have the details and context of what happened to inform their remediation and response efforts.

Don’t Poison the Pond

Answering “I don’t know” in regards to questions about data quality will increasingly become an unacceptable response. Until you can validate data creators, consumers, provenance, movement, and purpose of the binaries in your data set, your organization can unwittingly be placed at risk. There are dozens of articles from academics, think tanks, and industry analysts focusing on the ethics behind how and why AI models should be used. Our focus here is more fundamental. Just because you can find and use data in your AI model doesn’t mean it’s the right thing to do.

Large amounts of synthetic data will poison your pond. We previously mentioned incorporating synthetic data in AI models. If you don’t have a way to tell whether data is fresh or synthetic, you’re almost certainly poisoning your pond—you just won’t know how much, and as a result, you won’t be able to control your results.

Use of legally protected data is another consideration. You must be able to clearly recognize personally identifiable information (PII), medical data, financial data, and other regulated data types before you can decide if they should be used. The universe of content created by individuals and organizations also is protected by trademark and intellectual property laws. Art, photography, music, literature, blog posts, patents, code, etc. are owned by their creators and legally recognized as intellectual property (IP). One argument says that AI really just “examines” these items and doesn’t copy them exactly to come up with new content. However, other entities (media, corporations) that “examine” and incorporate IP into their finished work must pay the content creator for that right.

What happens when a company’s software developer turns to ChatGPT for some code to be used in the company’s own IP? He feeds in his requirements and ChatGPT returns code based on the request. The developer copies and pastes it into his code. Where did that code come from? Who owns it? Is it licensed? Is it original or synthetically derived? Do you now have to open source all of your code? Can you be sued for infringement?

How secure can that code be if an AI model simply finds it on the public internet or even behind a paywall? What if the code came from an exploit website? Why wouldn’t bad actors build libraries of exploitable code, feed them to ChatGPT or other generative models and then follow the poisoned breadcrumbs? Worse, requests fed into the AI model revealing the requestor’s need are now content fodder for the AI model. The organization’s intellectual property has just left the building.

Data provenance becomes even more critical when building AI data sets. When new data shows up in your network, how do you verify its creator or intent—not just where it was sent from? Data tainting will become a new attack vector as more companies adopt and use AI in day-to-day operations. When an attacker knows where your data is coming from, it’s a simple matter of intercepting it and poisoning it. If someone can throw off your analytics just a little, synthetic sabotage makes the AI susceptible to the attacker’s desired results. Data surveillance will be an essential defensive tool in ensuring a strong data validation and verification system is in place.

Flying Cloud

Data Security

AI #6: “The Data” is Not THE Data

Data Security