How are you dealing with the accuracy of your data when thinking about using GenAI?
Sort by:
I agree with the other comments that data quality is a foundational need for AI ROI. My favorite analogy to use with my executive team when we talk about shiny new AI desires, is if you have a Roomba and you have poop on the floor, don't turn on the robot! It's been pretty effective so far.
Data cleanup is foundational for us. We have a data retention project underway, focusing on data policies for emails and team messages, and we're expanding this to other systems. We've recently revamped our digital ecosystem, transferring only essential data, which naturally involves data cleanup. One of our AI framework's guiding principles is never to use output verbatim from GenAI. Everything must be reviewed by a human. While GenAI can get you to about 70% of the way, it should never be the final say. This principle will be emphasized in training, ensuring users understand that GenAI's output must be verified. You can't trust it entirely unless it's a specifically trained model. We're mostly using our Microsoft ecosystem and public models, so caution is essential.
We've been considering aspects like, "How do we address data accuracy?" One of the challenges is dealing with old data. Sometimes, when we run a query, it might seem like the AI is hallucinating, but it’s actually just referencing outdated information. This raises the question: How do we refresh and ensure our data policies, especially around records retention, are accurate? We can't have data lingering for decades, especially in a company that's over a century old. We need to determine if data should be retained for seven years or if older data should be purged. This is crucial because AI might mistakenly consider old data as authentic. Data sanitization is key here—understanding what's relevant and what's obsolete. It's a significant leap to remove old data and decide what the AI should access. Without clarity on what's current, the AI will always rely on what it knows.
Kairi nailed it. Data is a critical factor. We've worked closely with business users to clean up data sets. For instance, if there are five versions of a document, a user might know to pick version five, but the AI wouldn’t inherently know this. We've had to archive older documents to ensure only relevant data is accessible. Another challenge was grounding data by region, given our global presence. You can't provide North American data to an EMEA user—it would be inaccurate. Grounding was essential to deliver accurate information to specific users. Accuracy is a complex issue. While prompt engineering can achieve around 70% accuracy, pushing beyond that is challenging. We've experimented with rag models and SLMs to enhance efficiency, but results vary. It's a case-by-case basis, and we baseline based on specific use cases rather than a standard form.
I am going to go contrarian on this one and argue that any data pool where data accuracy concern is raised, is a pool that is not a fit for GenAI!
Let me anthropomorphize. AI was born from, lives in, and thrives in a universe of data measured in zettabytes and yottabytes. In this world, GenAI doesn't consume clean data. It learns from noise, contradiction, and pattern density. If you are feeding it clean data, then you are starving a whale by feeding it slowly, one plankton at a time, only organically sourced, hand-picked plankton at a time. The whale doesn't care about whole foods. It thrives on messy plankton bloom.
But, it's not a report! It's a model. And all models are wrong, but some are useful. GenAI output should not be confused with deterministic truth! It's a different type of truth for intelligence, cognitive expansion, and ultimately, for human reasoning.
And so I would argue that you need to shift your argument away from the data cleanliness, and toward the model utility. And so the real questions are ones of intentionality. What are you trying to explore? What are you trying to generate? What are you trying to simulate? Then from that purpose you work back to the guardrails, the feedback loops, and the prompts. And from there you identify the datasets (both structured and unstructured, clean and dirty, documents and databases, owned or borrowed, trusted or not) that might ground the model. But realise that if you are feeding the whale anything less than terabytes, to be pragmatic, that you are not likely feeding it the metaphorical energy to produce output influenced by your grounding. So data readiness is not the same as data cleanliness. Data readiness is about identifying where you can source the petabytes, exabytes and zettabytes that contain data that is **relevant** even if that data is full of contradictions. AI will sort that out. Although... sometimes it will sort it out with hallucination, which is why you need guardrails and feedback loops, not shiny data bytes.
my view, imho, 2cents, ymmv.