How to Train Your AI Agent on Company Documents (The Right Way)
Dumping a PDF into a vector database is not a knowledge base. Chunking, metadata hygiene, retrieval tuning, and answer grounding each add measurable accuracy. Here's the full process.
Lawrence
Founder, Chatzuri
The fastest way to build an underperforming AI agent is to upload your documents without thinking about how they'll be used at inference time. The vector database doesn't care about your document structure. It will index whatever you give it. Getting good results requires active document preparation, not passive uploading.
Step 1: Decide What Not to Upload
Every irrelevant document uploaded to your knowledge base increases retrieval noise. When the agent is searching for 'refund policy,' it should find your refund policy — not a Board of Directors meeting agenda that contains the word 'refund' in passing. Before upload, go through your document library and remove: internal-only documents, meeting notes, financial reports, anything that contains incidental matches for support query keywords.
Step 2: Clean Your Documents Before Upload
PDF exports from presentations carry formatting artefacts that confuse chunk parsers. Word documents carry revision marks and comment text. Spreadsheets convert to tables that lose column context when chunked naively. The right preparation depends on the document type:
- Policy documents: export as clean plain text or Markdown; preserve heading hierarchy
- FAQs: structure as explicit Q&A pairs with a blank line between each
- Product catalogues: convert to a structured format — one product per section, consistent attribute naming
- Pricing tables: convert to plain text lists rather than grid tables
- Process guides: use numbered steps; each step should be self-contained
Step 3: Add Metadata to Every Document
Metadata is what allows you to filter retrieval results — to surface only the currently-valid version of a policy, to restrict certain documents to specific agent configurations, or to weight recently-updated content more heavily. At minimum, tag every document with: last updated date, document category (policy, FAQ, product, process), and market applicability (if you operate across multiple markets).
Step 4: Test Retrieval Before You Test the Agent
Before evaluating the AI agent's responses, test the retrieval system in isolation. For each of your top 20 query types, check what documents are being retrieved. If the wrong chunks are coming back — or nothing is coming back — the agent cannot give a good answer regardless of the model quality. Most AI platforms, including Chatzuri, expose the retrieved chunks used to generate each response. Use this to debug retrieval separately from response quality.
Step 5: Create a Ground Truth Test Set
Before launch, build a test set of 50–100 question-answer pairs: real questions your customers have asked (from historical tickets) with their known-correct answers. Run the agent against this test set and measure accuracy. Your launch threshold should be at least 85% accuracy on this set. Track this number monthly as you update the knowledge base — it's the most reliable ongoing quality signal you have.
The 24-hour cycle
After a major product or policy change, update your knowledge base within 24 hours. Customers who contact support the day of a change and receive outdated information are the most likely to escalate and the least likely to be satisfied.
Ready to build your AI agent?
Deploy in under 10 minutes — no code required
Join 2,000+ businesses using Chatzuri to automate customer support across WhatsApp, SMS, Telegram, and more.
Build for free