POC Specification

Unstructured Data Extraction & Quality Improvement
Client: Insurance Client
Contact: SIVA (AI Lead)
Timeline: 5-day Prototype + 30-day POC
Insurance Client
Use Case 1
Confidential

Executive Summary

Problem: Insurance Client struggles to reliably extract structured fields from unstructured policy documents (PDS, schedules, endorsements). Poor extraction cascades downstream, causing delays, rework, and critical errors.

Solution: Deploy an AI-powered document extraction platform using Kyndryl's Agent Builder to automatically extract key policy information, embed quality checks, and deliver trusted structured data to downstream systems.

Business Outcomes:
95% Extraction Accuracy
60% Processing Speed Improvement
80% Error Reduction

Phase 4: Prototype (5 Days)

Rapid Validation

Interactive Demo

Browser-based prototype demonstrating:

  • Policy document upload workflow
  • Automated extraction of key fields
  • Real-time quality scoring
  • Dual-persona UI (Claims Adjuster + Underwriter)
  • Side-by-side document vs. extracted data comparison
  • Human review and correction workflow

Key Features

Success Criteria (Phase 4)

Phase 5: POC (30 Days)

Production Validation

Data: 50+ real Insurance Client policy documents (PDS, schedules, endorsements) + synthetic variations
Integration: Azure Document Intelligence + LLM extraction + downstream system APIs

Architecture: 10-Step Processing Pipeline

Step 1: UPLOAD HANDLER └─ Receives uploaded policy documents (PDF, JPG, PNG) Step 2: DOCUMENT IDENTIFIER └─ Detects document type & extracts metadata Step 3: OCR / LAYOUT EXTRACTION ⭐ [NEW] └─ Azure Document Intelligence / Claude Vision └─ Extracts text + spatial layout Step 4: CLAUSE / SECTION PARSER ⭐ [NEW] └─ Identifies policy sections & hierarchical structure └─ Extracts condition boundaries Step 5: POLICY EXTRACTOR └─ LLM-powered field extraction (policy#, date, coverage, exclusions) └─ Confidence scoring per field Step 6: QUALITY CHECKER └─ Validates fields & detects anomalies └─ Missing field detection & consistency checks Step 7: DATA NORMALIZER └─ Standardizes formats (dates, amounts, text casing) └─ Converts to canonical forms Step 8: CLAUSE INDEXER └─ Indexes normalized clauses and sections for rapid retrieval └─ Creates structured clause metadata (section#, type, hierarchy) └─ Enables fast lookup and cross-reference capabilities Step 9: EMBEDDING / VECTORIZATION └─ Converts normalized structured data into vector embeddings └─ Uses embedding models (OpenAI, Azure OpenAI embeddings) └─ Enables semantic search, similarity matching, ML workflows Step 10: VECTOR STORE └─ Stores embeddings in Vector Database (Azure Cognitive Search, Pinecone, etc.) └─ Indexes vectors for semantic retrieval and similarity search └─ Links vectors to source policy records for traceability Step 11: RULES TRANSFORMER └─ Converts extracted clauses → machine-executable rules └─ PolicyAsCode format with vector-backed context Step 12: POLICY STORE └─ PostgreSQL: Structured data + clause metadata └─ MongoDB: Audit trail & full document metadata └─ Redis: Cache for fast lookups Step 13: RULE EVALUATOR / DECISION ENGINE └─ Claims eligibility checking └─ Underwriting decisioning └─ Benefit calculation

Data Flow Summary

Step Component Input Output Purpose
1 Upload Handler PDF/Image Stored document Ingestion & validation
2 Document ID Stored file Type + metadata Route to pipeline
3-4 OCR & Parser Image Sections + boundaries Text extraction & structure
5-6 Extractor & QC Structured sections Extracted fields + quality score LLM extraction & validation
7-10 Normalizer → Vector Store Validated fields Indexed vectors + clause metadata Standardization, indexing, vectorization, storage
11-13 Rules Transformer → Decision Vectors + clauses Decision + audit log Rule generation and business logic execution

Key Capabilities

Success Criteria (Phase 5)

Technology Stack

Infrastructure

Document Processing

Agent Architecture

Infrastructure

Resources & Timeline

Project Delivery

Kyndryl Team

Role Duration FTE
Engagement Lead 10 weeks 0.5
AI/Extraction Architect 10 weeks 0.8
LLM/Integration Engineer 10 weeks 1.0
Azure/DevOps 10 weeks 0.5
QA/Testing 10 weeks 0.4

Timeline

Risk Mitigation

Risk Likelihood Impact Mitigation
Document quality / OCR accuracy Medium High Azure DI + LLM validation; test diverse samples
LLM hallucinations Medium High Quality gates; human-in-loop for low confidence
Policy domain complexity Medium High Close collaboration with underwriting/claims teams
Azure integration delays Low Medium Pre-stage Azure resources; test environment setup
Data privacy/confidentiality Medium High Anonymize samples; Privacy Act compliance; governance approval

KAF Non-Functional Requirements

Enterprise AI Governance

Kyndryl Agentic Framework (KAF) provides the enterprise control plane for Agentic AI, addressing trust, safety, cost, and compliance gaps that Kubernetes does not natively solve. The Insurance Client UC1 extraction pipeline leverages KAF capabilities:

KAF Dimension Requirement for AIA UC1 How It Enables Success
Agent Identity & Trust Secure agent-to-agent communication (A2A) Each extraction agent (Indexer, Vectorizer, Transformer, etc.) has verified identity, certificates, mutual trust
Agent Lifecycle Register, version, retire agents dynamically Hot-swap agents (e.g., upgrade LLM model) without stopping the 13-step pipeline
A2A Protocol & Routing JSON-RPC 2.0 standardized messaging Extraction steps communicate reliably; guarantees message delivery, routing
Tool Governance (MCP) Centralized tool catalog with permissions Agents discover & execute tools (Azure DI, LLM, Vector DB) with audit trails
LLM Management Model routing, versioning, A/B testing Switch between Claude, GPT-4, Azure OpenAI; test model changes before production
Token Economics Cost attribution & budgets per agent Track LLM tokens per extraction step; alert when over budget; optimize cost
Prompt Security Injection & jailbreak protection Prevent prompt manipulation attacks; secure field extraction from documents
Output Guardrails Hallucination & PII checks Validation gates on extracted data; detect anomalies; mask sensitive info
Memory & State Short & long-term context Agents retain cross-document context for consistency; session state for auditing
RAG Infrastructure Governed embeddings & retrieval Vector store (Step 10) with access controls, refresh policies, audit trail
Human-in-the-Loop Approval workflows for flagged extractions Low-confidence fields (< 85%) escalate to reviewers; feedback improves models
Explainability & Audit Decision traces & audit logs Full trace of extraction decisions (why field extracted, which LLM, which version, confidence score)
Agent Observability Per-agent metrics & cost tracking Monitor latency, accuracy, cost per step; identify bottlenecks; optimize pipeline
Multi-Tenancy Tenant-isolated agents & policies Insurance Client data isolated; separate policies for claims vs. underwriting agents
Graceful AI Degradation Fallback when models fail If LLM unavailable, escalate field to human; don't fail entire pipeline
Responsible AI Bias & fairness enforcement Ensure extraction unbiased across customer demographics; policy compliance
Testing & Simulation Synthetic data for regression testing Test pipeline with synthetic 500K documents before production; validate accuracy targets

KAF Governance Summary: Every extraction agent in the 13-step pipeline is governed by KAF controls — identity verification, message routing guarantees, cost tracking, tool permissions, output validation, human escalation, audit trails, and explainability. This enables Insurance Client to trust, control, explain, and safely run extraction at enterprise scale.

View Related Documents

← Return to Table of Contents for Executive Summary, Architecture Diagram, and Sprint Plan.