Your data is the real bottleneck (not the AI)
A team spends weeks evaluating AI models. They test GPT-5.2 against Claude Opus 4.6. They debate which one handles their use case better. They read benchmarks. They run trials. They finally commit to ...
Your data is the real bottleneck (not the AI)
Part 4 of the "Build with AI" series
Here's a scenario that plays out constantly.
A team spends weeks evaluating AI models. They test GPT-5.2 against Claude Opus 4.6. They debate which one handles their use case better. They read benchmarks. They run trials. They finally commit to a model and build their workflow.
Then they feed it their actual data — the spreadsheets, the CRM exports, the customer notes, the internal documents — and the results are a mess. Inconsistent. Unreliable. Sometimes useful, often wrong.
They blame the model. Switch to a different one. Same results.
The model was never the problem.
What's actually going on
There's a principle in data engineering that predates AI by decades: garbage in, garbage out. Feed a system bad data, and it produces bad results — regardless of how sophisticated the system is. This was true for databases in the 1980s. It's true for machine learning models. And it's especially true for AI today.
The difference is that AI is unusually good at hiding this problem. A language model can take inconsistent, messy, incomplete data and produce something that sounds coherent and confident. It fills gaps, smooths over contradictions, generates plausible-sounding outputs. Which means you can spend a long time thinking the AI is doing fine — until you try to act on its outputs and realize they were built on shaky foundations.
Most people evaluating AI tools are evaluating the model. They should be evaluating their data.
What "data" actually means here
When we talk about data in the context of AI, we don't just mean spreadsheets and databases. We mean any information you feed into an AI to help it do its job. That includes:
- Structured data — spreadsheets, CSV files, database exports, CRM records
- Unstructured text — emails, meeting notes, customer feedback, internal documents, PDFs
- Context you provide in prompts — background information, instructions, examples
- External sources — websites, APIs, documents you retrieve and pass to the model
All of it is data. And all of it has quality issues that affect what AI can do with it.
The four data problems that kill AI projects
1. Inconsistency
This is the most common and most damaging. The same thing described in five different ways: "New York", "NY", "NYC", "New York City", "new york". A customer status that's "Active" in one system and "active" in another. A product name with a space in some records and a hyphen in others.
Humans handle inconsistency automatically — our brains normalize it without effort. AI doesn't. It treats "Active" and "active" as potentially different things. It notices the pattern variations and either gets confused or makes assumptions that compound errors downstream.
Pick any key field in your data — a category, a status, a name — and count how many variations of the same value exist. If you find more than two or three, you have an inconsistency problem.
2. Incompleteness
Missing fields. Empty cells. "N/A" where there should be a value. Records that are half-filled because the person entering them was in a hurry, or because the system didn't require it.
AI can work with incomplete data — but it will fill the gaps with assumptions. Sometimes those assumptions are reasonable. Often they're not. And you won't always know which is which.
The danger compounds when incompleteness is systematic — when the same fields are always missing for the same type of record. That means the AI is making the same wrong assumption over and over, consistently, at scale.
Check what percentage of your records have all key fields populated. If it's below 80%, your AI will be operating on assumptions more than facts.
3. Staleness
Data has an expiry date. A customer record from two years ago may have a wrong address, an old job title, a phone number that no longer works. A product description written in 2022 may describe features that no longer exist. A market analysis from 18 months ago is describing a different market.
When you feed AI stale data and ask it to do something current — make a recommendation, generate a proposal, summarize a customer's situation — it will work from that stale information and produce output that sounds authoritative but is built on an outdated picture of reality.
Ask yourself: when was the last time each major data source was verified or updated? If you don't know, that's the answer.
4. Context collapse
This is the subtlest problem. Your data makes perfect sense to you — because you carry years of context that the data doesn't capture.
"Customer is on hold" means something specific to your team: they requested a pause in service, they're waiting for a hardware shipment, they're mid-negotiation on a contract renewal. But that phrase alone, fed to an AI, could mean any of a dozen things. The AI doesn't have the organizational context to interpret it correctly.
This is especially true of shorthand, internal jargon, abbreviations, and codes that make perfect sense to insiders but are opaque to any outside reader — including AI.
Take a random sample of your records and ask someone unfamiliar with your business to interpret them. Every place they're confused is a context collapse risk.
The data readiness checklist
Before you build any AI workflow on top of your data, run through this checklist:
Consistency
- Key categorical fields use standardized values (not multiple variants of the same thing)
- Names, IDs, and references match across systems
- Date formats are consistent
- Text fields don't contain structured data that should be in separate fields
Completeness
- Key fields have >80% population rate
- Missing values are explicitly marked (not empty, not "N/A", not "unknown" inconsistently)
- No required fields are routinely left blank
Freshness
- You know when each data source was last verified
- Stale records are flagged or archived, not mixed with current ones
- Time-sensitive fields (contact info, prices, statuses) have a review cadence
Context
- Abbreviations and internal jargon are documented somewhere
- Status codes and categories have definitions
- Records can be understood by someone without internal organizational context
If you can check everything on this list, your data is AI-ready. If you're checking fewer than half, your AI results will be unreliable regardless of which model you use.
What to do about it
The good news: AI itself is an excellent tool for cleaning and preparing data. The process is:
Step 1: Audit first, fix second. Don't start cleaning data randomly. Use AI to help you audit — feed it a sample and ask it to identify inconsistencies, missing patterns, and ambiguous values. Let it surface the problems before you decide how to fix them.
Step 2: Standardize the high-impact fields first. You don't need to fix everything. Focus on the fields that your AI workflow will actually use. If you're building a customer outreach tool, customer name, status, and contact info matter more than internal notes. Fix the critical path, not the entire dataset.
Step 3: Document what you've standardized. A short reference document — what values are allowed in each field, what each status means, what the abbreviations stand for — is worth more than hours of cleaning. It prevents the problem from recurring and gives AI a reference to work from.
Step 4: Build freshness into the process. Data quality is not a one-time project. It decays. Build a simple review cadence — quarterly for slow-changing data, monthly for fast-changing — so you're not solving the same staleness problem repeatedly.
What this is really about
Choosing the right AI model is a 10% problem. Getting your data right is a 90% problem.
This isn't a knock on AI. It's a reflection of where the real work is in any information-intensive task. The model is a tool. The data is the raw material. No craftsperson blames their tools when they're working with bad material.
The builders who get consistently good results from AI aren't necessarily using the best models. They're using clean, consistent, well-documented data — and they've built habits to keep it that way.
That's the unsexy truth the AI demo world never shows you. The demos use pristine, pre-cleaned example data. Your reality is messier. And fixing that mess is the highest-leverage thing you can do before adding AI to anything.
Key takeaways
AI can produce confident, fluent-sounding outputs from bad data — which makes the problem harder to detect, not easier. Most data quality problems fall into one of four categories: inconsistency, incompleteness, staleness, and context collapse. Worth knowing which ones you have before you build. The model choice matters far less than people think; data quality matters far more. Before you clean, use AI to identify what needs cleaning — feed it a sample and ask it to surface problems. And data quality is a habit, not a project. It decays. Build review cadences, document standards, and treat it as ongoing maintenance.
Want the full framework?
This post covers the data foundation. The AI Development Guide by Jaehee Song goes deeper — into how to structure your data for specific AI use cases, how to use AI to clean and enrich what you have, and how data quality connects to every layer of the AI stack.
📱 Apple Books ▶️ Google Play Books 🌐 All Platforms (Books2Read)
Next in the series: "The Real Skill Isn't Coding — It's Defining the Problem" — why in an age of tools that can build almost anything, the bottleneck has shifted entirely to how precisely you can articulate what you want.