Skip to content

We Cut KYC From 7 Days to 30 Min Using VLMs. The Hard Part Wasn't the Model.

Nigerian addresses are descriptive, not structured. Designing a pipeline that could extract them reliably was harder than picking the model.

Jordan Browne-Moore · June 2026

KYC automation at Kuda · Gemini Flash + Qwen3 8B VL · 400K+ submissions processed · 98% automated decision accuracy

Before and after comparison: Manual KYC review took 7 days and 176 hours per month with 75% reviewer agreement. VLM automated pipeline processes in 30 min with 2 hours of human review per month at 98% accuracy.
Before: 7-day processing with 176 monthly hours of manual review and 75% reviewer agreement. After: 30-min processing with 2 monthly hours of human oversight and 98% automated decision accuracy.

The Problem That Benchmark Scores Don't Capture

Eight customer experience workers spent roughly an hour per day each manually reviewing KYC documents. Total: about 176 hours a month. Each submission took 7 days end to end. The review queue never emptied.

The same document sent to two reviewers could get different results. A utility bill listing “Plot 5, Adeola Odeku, beside the blue building, behind the church.” Does that match the address on file? One reviewer might recognize the area and approve it. Another, working a late shift, might flag it for follow up. A third might misread the handwritten text and mark it declined. The process was structurally inconsistent, not because the reviewers were careless but because the documents themselves defied standardized extraction.

The Central Bank of Nigeria had updated its KYC regulations to require physical address verification for every customer. The manual process was the bottleneck. And off the shelf OCR built for structured forms could not help because Nigerian addresses do not conform to the standard schema of street number, street name, city, and postal code that OCR assumes. Landmarks and relative locations replace street numbers. The address fields those tools expect do not exist in the same way here.

This is not a model accuracy problem. It is a data standardization problem that most ML content assumes away.

Model Selection Under Constraints

The deadline was one week. CBN compliance urgency. Not enough time for model benchmarking.

I chose Gemini Flash because it was available immediately, had strong OCR and document understanding, and the latency profile suited batch processing. Built the initial system in one week. It processed 400K+ submissions.

I later expanded to Qwen3 8B VL for sensitive and edge cases that required self hosted inference. Fine tuned with SFT LoRA on 10K curated records. That project involved its own fine tuning and evaluation cycle. The details are in my writeup on LLM evaluation methodology, but the numbers are different because the task and data were different. The pattern is the same: curated data beats auto labeled data at any scale.

I picked what would ship in a week, not the benchmark leader. The benchmark leader came later, once we knew what the production distribution actually looked like.

The Consistency Problem

This section is about consistency, not detection accuracy. The distinction matters.

We measured inter reviewer agreement on address matching across a common document set. On documents with descriptive addresses (landmarks, relative locations, area names), reviewers agreed on the correct extraction roughly 75% of the time. The remaining quarter represented genuine ambiguity. The same address could plausibly map to multiple locations depending on how you interpreted the description.

The system produced identical extractions on the same inputs 100% of the time. Consistent within a run and consistent across runs with identical inputs.

This raises an obvious question: if the initial human labels were inconsistent, what did we train against? The initial ground truth was established through adjudication: a senior reviewer resolved disagreements across the document set, producing a single consistent label per submission. Subsequent training used system generated decisions for the majority of cases, with human review reserved for the less than 2% of submissions that fell below the confidence threshold. The system did not train on the raw inconsistent human labels. It trained on the adjudicated subset, then generated its own decisions, which were more consistent than either individual reviewer.

This is the difference between a process that depends on human judgment and one that depends on deterministic rules. The system was not better because the model was smarter. It was better because it was consistent. Consistency is what compliance actually needs. A regulator wants to know the rule was applied uniformly, not that each decision was individually brilliant.

400K submissions processed with greater consistency than the human QA team's results.

The Evaluation Harness

What We Measured and Why

A regulator does not ask “does the system work?” They ask “how do you know it works?” An aggregate accuracy memo is not proof. A stratified evaluation that reports accuracy per document type, per extraction field, per confidence band is proof.

We tracked TP/FP/FN/TN across five dimensions: address extraction accuracy, bill type classification, document date, extraction date, and LLM confidence. Each dimension got its own substratified test set. Different document types (utility bill vs bank statement vs photograph) each had their own evaluation. Different image quality tiers had their own. Different address format variants had their own.

The decision to substratify by document type came from a specific failure. Early in development, the model showed 96% aggregate accuracy on extraction. That number looked fine. When we split by document type, accuracy on utility bills was 99% but accuracy on the category we had labeled “photographs” was 83%. The natural assumption was that photographs of handwritten documents were a hard problem for the model. It turned out the category was wrong. “Photographs” included selfies, drawings, unrelated screenshots, and wrong file types. Inputs that were not valid KYC documents at all.

This changed how we thought about the pipeline. The problem was not model accuracy on a hard document type. The problem was that our input category definitions were incorrect, and an aggregate metric was masking it. The solution was not to make the model better at extracting addresses from photographs. The solution was to add upstream document classification to filter invalid submissions before extraction, then route the filtered results through separate evaluation pipelines per valid document type. After that change, the 83% figure became irrelevant because those inputs never reached the extraction model. After that, we never reported an unstratified accuracy number again.

The evaluation harness was built to the standard a regulator would require. That is the correct engineering decision regardless of whether that specific regulator has asked yet.

I set this as the team's policy: we do not report aggregate accuracy. Every metric is stratified by document type, extraction field, quality tier, and format variant. This is more expensive to maintain and harder to explain to stakeholders who want a single number. But a single number is not defensible to a regulator. We built to that standard before anyone asked for it, because retrofitting compliance onto a production pipeline is always more expensive than building to compliance standards from the start.

The Regression Problem Nobody Talks About

Models are stochastic. The same prompt with the same input can produce different extractions across runs. Adding a new function: a new document type, a new address format variant could silently break existing behavior.

The most vivid example: we added support for a new bank statement format. The new format extracted correctly. But the model's confidence calibration shifted on an unrelated existing document type, causing roughly 1,200 submissions that would have been auto approved to flag for manual review instead. The evaluation harness caught it because it ran the full test suite on every model update, not just the new function.

The Pipeline Architecture

Document Flow

Document classification determined the type: utility bill, bank statement, or photograph. Field extraction pulled the relevant data. Fraud detection assessed authenticity. Confidence scoring determined whether the decision could be automated. Routing sent the result to approval, decline, or CS review.

Fraud detection lived inside the VLM extraction step, not as a separate classifier. The reason: fraud signals like handwriting alterations, inconsistent document formatting, and spurious details are content dependent. A separate classifier would need features engineered from VLM outputs anyway. Baking it into the extraction step means the same model that reads the document also assesses its authenticity.

The tradeoff: extraction and fraud decisions share error modes. If the model misses a field, it might also miss the fraud signal in that field. In practice, the shared error modes did not produce correlated failures that escaped the evaluation harness. The simpler architecture was the better value.

The 98% Metric

98% of automated decisions were correct. This is overall accuracy: true positives plus true negatives across all automated decisions. Precision (what fraction of approved submissions were correct) and true negative rate (what fraction of declined submissions were correct) both feed into this number.

Timeouts that succeeded on retry are included. The less than 2% that went to CS are excluded from the 98% because those are not automated decisions.

Scale: 80K+ customers per month. All onboarding KYC ran through the system first.

The Curation Insight

This is the single most important decision in the pipeline, and it applied across every model I built here.

We had two training data sources: roughly 10K hand curated examples corrected by humans, and roughly 50K auto labeled ones from the reference model's raw outputs. We trained separate models on each and compared.

The curated model outperformed the auto labeled one on every metric that mattered. Format adherence was higher. Edge case handling: handwritten addresses, poor quality scans, unusual bill formats was meaningfully better. The gap showed up in the evaluation harness immediately.

Why: the auto labeled data amplified the reference model's systematic errors. If the API teacher consistently misread a specific bill format, training on its outputs baked that error into the student model. The curated data corrected those errors, so the student learned to outperform the teacher.

Each curated example was worth roughly five auto labeled ones in terms of downstream accuracy improvement. If you are deciding between labeling 10K examples yourself and pulling 50K from an API, label the 10K. The API's errors are invisible to you until your evaluation harness catches them in production.

This same pattern showed up in a different project at CB Insights where we fine tuned Qwen 3 32B for financial metric extraction. In that case it was 15K curated examples. The ratio was similar. The mechanism was the same. It is a consistent finding across engagements, not a one off result.

What Didn't Work

The initial system was built for approvals only, not declines. This was a deliberate risk decision, not a technical limitation. Under a one week deadline with CBN compliance pressure, I chose the path that maximized impact while minimizing downside. A false approval meant a document routed for additional verification. A false decline meant a lost customer. Those risks are asymmetric, and I weighted them accordingly in the initial scope. The decline capability came later, after we had enough production data to calibrate confidence thresholds for both directions safely.

Adding decline logic also required rebuilding the evaluation harness. The system could tell you with high confidence when a document was valid and should be approved. It could not tell you with the same confidence when a document was invalid and should be declined. Those are different problems with different evaluation requirements, and we designed for one but not the other initially. The scoping call was the right one.

Timeline: 1 week for the MVP under CBN deadline. 3 to 4 months to production grade accept and reject capabilities. The model was running in a week. The infrastructure around it: evaluation, rejection handling, confidence calibration, compliance documentation took months. That is not a weakness. That is the point.

Cost

Before: 8 CX workers spending roughly an hour per day each on manual KYC review. Total around 176 hours per month.

After: 1 CX worker spending roughly two hours per month handling the less than 2% of edge cases that routed to CS.

Roughly a 90x reduction in human review time, while maintaining or improving decision accuracy. The exact multiple depends on what you count, but the direction is clear.

The CX team went from being the primary KYC processors to being exception handlers. That changed their workflow, their skill requirements, and their relationship to the rest of the organization. The engineering was the straightforward part. Getting the team to trust automated decisions, to shift from processing every case to handling only edge cases, to accept that the system was more consistent than they were, that was the harder conversation.

If you are processing 80K submissions a month, manual review at 7 days per submission means your review queue never empties. You are not scaling headcount. You are scaling your compliance bottleneck. Automation is not a cost play. It is the only way the process works at all.

If you are running KYC review at scale and want to understand what automation looks like for your document types and regulatory environment, I do targeted discovery engagements. We map your current pipeline, identify which steps are automation ready and which need human judgment, and scope the evaluation infrastructure you would need to make the decision defensible. The model is the easiest part. Everything around it is where the engineering goes.

If you're considering KYC automation at scale, I'm happy to talk through it.

← Back to home