AI & AutomationJune 13, 20256 min read

How to Train Your AI Agent on Company Documents (The Right Way)

Dumping a PDF into a vector database is not a knowledge base. Chunking, metadata hygiene, retrieval tuning, and answer grounding each add measurable accuracy. Here's the full process.

Lawrence

Founder, Chatzuri

The fastest way to build an underperforming AI agent is to upload your documents without thinking about how they'll be used at inference time. The vector database doesn't care about your document structure. It will index whatever you give it. Getting good results requires active document preparation, not passive uploading.

Step 1: Decide What Not to Upload

Every irrelevant document uploaded to your knowledge base increases retrieval noise. When the agent is searching for 'refund policy,' it should find your refund policy — not a Board of Directors meeting agenda that contains the word 'refund' in passing. Before upload, go through your document library and remove: internal-only documents, meeting notes, financial reports, anything that contains incidental matches for support query keywords.

Step 2: Clean Your Documents Before Upload

PDF exports from presentations carry formatting artefacts that confuse chunk parsers. Word documents carry revision marks and comment text. Spreadsheets convert to tables that lose column context when chunked naively. The right preparation depends on the document type:

Policy documents: export as clean plain text or Markdown; preserve heading hierarchy
FAQs: structure as explicit Q&A pairs with a blank line between each
Product catalogues: convert to a structured format — one product per section, consistent attribute naming
Pricing tables: convert to plain text lists rather than grid tables
Process guides: use numbered steps; each step should be self-contained

Step 3: Add Metadata to Every Document

Metadata is what allows you to filter retrieval results — to surface only the currently-valid version of a policy, to restrict certain documents to specific agent configurations, or to weight recently-updated content more heavily. At minimum, tag every document with: last updated date, document category (policy, FAQ, product, process), and market applicability (if you operate across multiple markets).

Step 4: Test Retrieval Before You Test the Agent

Before evaluating the AI agent's responses, test the retrieval system in isolation. For each of your top 20 query types, check what documents are being retrieved. If the wrong chunks are coming back — or nothing is coming back — the agent cannot give a good answer regardless of the model quality. Most AI platforms, including Chatzuri, expose the retrieved chunks used to generate each response. Use this to debug retrieval separately from response quality.

Step 5: Create a Ground Truth Test Set

Before launch, build a test set of 50–100 question-answer pairs: real questions your customers have asked (from historical tickets) with their known-correct answers. Run the agent against this test set and measure accuracy. Your launch threshold should be at least 85% accuracy on this set. Track this number monthly as you update the knowledge base — it's the most reliable ongoing quality signal you have.

The 24-hour cycle

After a major product or policy change, update your knowledge base within 24 hours. Customers who contact support the day of a change and receive outdated information are the most likely to escalate and the least likely to be satisfied.

Ready to build your AI agent?

Deploy in under 10 minutes — no code required

Join 2,000+ businesses using Chatzuri to automate customer support across WhatsApp, SMS, Telegram, and more.

Build for free

Back to Blog

How to Train Your AI Agent on Company Documents (The Right Way)

Step 1: Decide What Not to Upload

Step 2: Clean Your Documents Before Upload

Step 3: Add Metadata to Every Document

Step 4: Test Retrieval Before You Test the Agent

Step 5: Create a Ground Truth Test Set

More from the blog

Building a Knowledge Base That Actually Works: Lessons from 2,000+ Deployments

Multi-Model AI: Where GPT-4o, Claude, and Gemini Each Excel

From Chatbot to AI Agent: The Real Difference Explained

How to Train Your AI Agent on Company Documents (The Right Way)

Step 1: Decide What Not to Upload

Step 2: Clean Your Documents Before Upload

Step 3: Add Metadata to Every Document

Step 4: Test Retrieval Before You Test the Agent

Step 5: Create a Ground Truth Test Set

More from the blog

Building a Knowledge Base That Actually Works: Lessons from 2,000+ Deployments

Multi-Model AI: Where GPT-4o, Claude, and Gemini Each Excel

From Chatbot to AI Agent: The Real Difference Explained