What I Learned Building RAG Search for Healthcare: The Axlow Architecture | Ryan Katayi

Axlow started with a problem that was easy to understand and hard to solve well.

Healthcare professionals needed answers buried inside payer rules, Medicare documentation, policy PDFs, and dense regulatory pages. The information existed. That was not the issue.

The issue was that finding it was slow.

You could spend an hour opening PDFs, searching for terms that may or may not match the wording in the policy, jumping between sections, and still walk away unsure whether you found the right rule. That is frustrating in any field. In healthcare, it is worse because the cost of being wrong is not just wasted time.

The ask sounded simple:

Give users a search box where they can ask a question in plain English and get a useful answer in seconds.

But the real requirement was stricter:

The answer had to be grounded in the source material.

No confident guessing. No "the policy probably says." No answer that looked impressive but could not be traced back to a document.

That is why Axlow used a RAG architecture.

Retrieval Augmented Generation is one of those phrases that can make a simple idea sound more mysterious than it is. The version I care about is this:

Break trusted documents into searchable chunks.
Store those chunks with metadata and embeddings.
Retrieve the most relevant chunks for a user question.
Give the model only that context.
Make the answer cite the sources.

The model is not the source of truth. The documents are.

That distinction changes almost every design decision.

Why a normal chatbot was the wrong product

If you put a generic chatbot in front of healthcare policy documents, it will happily sound helpful.

That is the danger.

For casual questions, a slightly fuzzy answer may be fine. For payer rules, coverage criteria, billing policies, or documentation requirements, "slightly fuzzy" is not acceptable. Users need to know where the answer came from.

So the product was never just "chat with your PDFs."

It was closer to:

"Search this messy policy universe, explain the answer clearly, and show me exactly which source supports it."

That means the citation experience is not a nice extra. It is part of the core feature.

In early RAG prototypes, people often spend all their attention on the generated answer. I think that is backwards for serious domains. The answer matters, obviously. But the source trail matters just as much.

A user should be able to say:

Which document did this come from?
Which section?
Was this Medicare or a commercial payer?
Is the source current?
Can I open the original and verify it myself?

If the system cannot answer those questions, it is not trustworthy enough.

The architecture, in plain English

The Axlow-style pipeline has four major pieces:

ingestion
retrieval
generation
verification

The ingestion pipeline turns documents into structured, searchable data.

The retrieval layer finds the most relevant chunks for a question.

The generation layer writes the answer using those chunks.

The verification layer gives the user and the system a way to judge whether the answer is supported.

Most RAG mistakes happen because one of those layers is treated as "just implementation."

It is not.

Ingestion: boring work that decides answer quality

The first hard part of RAG is not the model.

It is getting the documents into a shape where retrieval can work.

Healthcare policy documents are not clean blog posts. They have headers, footers, tables, weird page breaks, repeated boilerplate, section numbers, cross-references, and important context split across pages.

If ingestion is sloppy, retrieval will be sloppy. If retrieval is sloppy, the model gets bad context. If the model gets bad context, the final answer either misses the point or starts improvising.

The ingestion pipeline needs to preserve useful metadata:

document title
payer or source
policy type
publication or effective date
page number
section heading
original file URL
chunk order
surrounding context

That metadata becomes important later for filtering, citations, recency, and debugging.

A simplified chunk shape might look like this:

type PolicyChunk = {
  id: string;
  documentId: string;
  sourceName: string;
  documentTitle: string;
  sectionTitle?: string;
  pageNumber?: number;
  chunkIndex: number;
  content: string;
  embedding: number[];
  effectiveDate?: string;
  sourceUrl?: string;
};

The text itself is only part of the record. The context around the text is what lets the product feel trustworthy.

Chunking is not set-and-forget

Chunking sounds like a mechanical problem until you watch retrieval fail.

Chunks that are too small lose context. A tiny paragraph may not include the condition, exception, or definition that makes it meaningful.

Chunks that are too large become noisy. They match too many things and make it harder for the model to find the exact answer.

For dense policy text, I prefer starting with a moderate chunk size and overlap, then testing against real questions.

Something like this is a starting point, not a universal answer:

function chunkText(text: string, size = 900, overlap = 150) {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + size, text.length);
    chunks.push(text.slice(start, end).trim());
    if (end === text.length) break;
    start += size - overlap;
  }

  return chunks.filter(Boolean);
}

In production, I want chunking to respect headings and sections where possible. Splitting blindly by character count can cut a rule in half right where the important exception begins.

For healthcare search, I also like storing nearby context. Even if retrieval returns one chunk, the UI or answer step may need the previous and next chunk to make the source readable.

Embeddings make search semantic, not magical

Embeddings turn text into vectors, which makes semantic search possible. That is useful because users do not always ask questions using the exact wording from a policy.

OpenAI's embeddings docs describe embeddings as numerical representations used for things like search, recommendations, clustering, and classification. Their current recommended model, text-embedding-3-small, produces 1536-dimensional vectors with better quality and lower cost than the older ada-002. In RAG, the common pattern is to embed document chunks, embed the user's question, then compare those vectors to find related chunks.

That gets you past simple keyword matching.

But embeddings do not understand your product requirements by themselves.

They can retrieve related text. They cannot decide whether a source is current, whether the payer matters, whether a policy has been superseded, or whether the retrieved context is enough to answer safely.

That is product logic.

The vector search layer might answer:

"Which chunks are semantically close to this question?"

The application still has to ask:

"Are these chunks allowed for this user, current enough, specific enough, and useful enough to answer?"

pgvector is enough until it is not

For many early RAG products, Postgres plus pgvector is a very reasonable place to start.

Supabase supports pgvector, which lets you store embeddings and perform vector similarity search inside Postgres. That is attractive because your documents, metadata, permissions, and app data can live in one system.

That matters more than people think.

If your authorization model is in Postgres but your vectors live somewhere else, you need to be very careful not to retrieve documents a user should not see. In a healthcare or enterprise setting, retrieval permissions are not optional.

The simple version:

create extension if not exists vector;

create table policy_chunks (
  id uuid primary key default gen_random_uuid(),
  document_id uuid not null,
  organization_id uuid not null,
  content text not null,
  metadata jsonb not null default '{}',
  embedding vector(1536)
);

Then your retrieval query can combine similarity with normal filters:

select
  id,
  document_id,
  content,
  metadata,
  1 - (embedding <=> query_embedding) as similarity
from policy_chunks
where organization_id = current_organization_id
order by embedding <=> query_embedding
limit 8;

That is the appeal: semantic search plus relational constraints.

Dedicated vector databases can absolutely make sense. Pinecone, Qdrant, Weaviate, and others give you scaling, indexing, hybrid retrieval features, and operational tooling. But I do not reach for extra infrastructure until the product needs it.

Start simple enough to evaluate. Scale when the bottleneck is real.

Retrieval is where the product either earns trust or loses it

The easiest RAG demo is:

Embed everything.
Retrieve top 5 chunks.
Send them to the model.
Print the answer.

That can work for a demo. It is usually not enough for serious search.

Healthcare policies include acronyms, codes, payer-specific names, and exact phrases. Pure semantic search can miss exact terminology. Pure keyword search can miss paraphrased questions.

That is why hybrid search is often worth testing: combine vector similarity with keyword or full-text search, then rerank or merge results.

The retrieval layer should also respect filters:

payer
document type
date
jurisdiction
organization access
policy status

A good answer from the wrong payer is a bad answer.

That seems obvious, but it is exactly the sort of mistake a generic RAG demo will make.

The prompt should be boring and strict

I do not want a creative model in this part of the product.

I want a careful one.

The generation prompt should make the hierarchy clear:

You answer healthcare policy questions using only the provided sources.

Rules:
- Use only the context below.
- If the context does not answer the question, say you do not know.
- Cite every factual claim with the source id.
- Do not invent policy details.
- Do not merge rules from different payers unless the user asked for comparison.
- Keep the answer concise, then list the supporting sources.

Question:
{{question}}

Context:
{{retrieved_chunks}}

This does not eliminate hallucination. Nothing about RAG magically does that.

But it gives the model less room to wander and gives the product a structure to validate.

For higher-risk answers, I also like forcing the model into a structured response:

type RagAnswer = {
  answer: string;
  confidence: "high" | "medium" | "low";
  citations: {
    sourceId: string;
    quote: string;
    pageNumber?: number;
  }[];
  missingInformation?: string[];
};

When the model has to name missing information, it is less likely to cover uncertainty with confident prose.

Citations are a product feature

The citation UI may be more important than the answer UI.

Users need to inspect the source quickly. If citations are buried, vague, or hard to open, the product asks users to trust the AI. That is the wrong ask.

The better ask is:

"Here is the answer, and here is the evidence."

Good citations should include:

document title
source or payer
section name if available
page number if available
highlighted excerpt
link to open the original document

For Axlow-like products, I want users to move from answer to source without friction. The answer gets them oriented. The source lets them decide.

That is how you build trust in a domain where trust is earned slowly.

Evaluation should start earlier than feels necessary

If I built the same system again, I would invest in evaluation earlier.

Not fancy evaluation. A simple set of real questions with expected source documents.

For example:

type RetrievalEval = {
  question: string;
  expectedDocumentIds: string[];
  mustIncludeTerms?: string[];
  shouldNotIncludeDocumentIds?: string[];
};

Then I would run that set whenever chunking, embeddings, retrieval filters, or ranking changed.

The questions should include:

easy exact-match questions
paraphrased questions
acronym-heavy questions
payer-specific questions
questions with no answer
questions where two policies look similar but only one applies

The goal is not to make a perfect benchmark. The goal is to notice when an "improvement" makes retrieval worse.

Without evaluation, RAG development becomes vibes with embeddings.

What I would do differently next time

Three things.

First, I would build a small retrieval evaluation set before polishing the answer UI.

It is tempting to make the generated answer look beautiful early. But if retrieval is weak, the beautiful answer is just well-formatted uncertainty.

Second, I would preserve more document structure during ingestion.

Headings, section hierarchy, tables, and page numbers matter. If you flatten everything too aggressively, you spend the rest of the project trying to recover context you threw away.

Third, I would build source inspection sooner.

Users do not just want an answer. They want to know whether they can rely on it. The original document viewer, citation highlights, and source metadata are not secondary for healthcare search. They are central.

The biggest lesson

RAG is not magic.

It is a search system with a language model at the end.

The model gets the attention because it writes the sentence users see. But the quality of that sentence depends on everything before it: document parsing, chunking, metadata, embeddings, retrieval filters, ranking, permissions, and citation design.

If those pieces are weak, the model becomes a very articulate way to hide bad search.

If those pieces are strong, the model becomes useful because it is grounded.

That was the real lesson from Axlow.

The goal was never to make healthcare professionals trust an AI answer.

The goal was to help them find the right source faster, understand it faster, and still stay in control of the final judgment.

That is the version of AI search I trust.