System online · Lahore
Case · 05 / 05
Anonymized
Document AI · Arabic · Tier A · 7 weeks

Arabic RAG pipeline over 100 regulatory PDFs.

Open-source Arabic LLM stack with hierarchical retrieval over Milvus. Bilingual querying, Arabic-first prompting, on-prem-ready by design. Built without commercial model dependencies.

Headline outcome 100 PDFs indexed on a fully open-weight stack
EngagementTier A · 7 weeks
Team1 founder + 1 contractor
StackMilvus, Llama, Python
StatusShipped · in production
01The problem

A regulatory team needed to query 100 Arabic PDFs without sending data to a US vendor.

The client's compliance team works across a corpus of regulatory and policy documents — most in formal Arabic, some bilingual Arabic-English. Reading the corpus end-to-end was a multi-day exercise; finding the specific clause that answered a specific question was worse. Off-the-shelf RAG built on commercial LLMs would have solved retrieval, but the data couldn't leave the client's infrastructure.

That ruled out OpenAI, Anthropic, and every other US-hosted commercial model. The system had to run on open-weight models, on infrastructure the client controlled, with retrieval that respected document-level access permissions. And it had to be Arabic-first — most open Arabic LLMs at the time of build were tuned on translated English data, which produced answers that read as awkward or culturally off.

The team also wanted bilingual querying. A user might ask a question in English about an Arabic document, or vice versa. The system had to handle either direction without losing context or citation accuracy.

02What we built

A hierarchical retrieval pipeline on an open-weight Arabic stack.

Fig. 02.E · Bilingual RAG topology Production
QUERY · ROUTE · RETRIEVE · ANSWER AR query EN query Lang router + AR embed Doc-level title + summary Chunk-level overlap windows Milvus hybrid retrieval Llama AR on-prem Cited answer

The pipeline starts with a language router. Every incoming query gets classified — Arabic, English, or mixed — and routed through an Arabic-tuned embedding model. English queries get a translation pass before embedding so the vector representation lands in the same space as the Arabic corpus. This is the part most off-the-shelf solutions get wrong, and it's why bilingual queries usually return rough results.

Retrieval is hierarchical. A doc-level pass finds the candidate documents — title + machine-summary embedded as a single vector, fast filter. A chunk-level pass then retrieves overlapping windows from the candidate set with sentence-level granularity. Milvus handles both tiers with hybrid search (dense + sparse), and metadata filtering enforces the access permissions the compliance team set on each document.

The Llama-based Arabic model generates answers from the retrieved chunks. Every answer carries citations back to the source document and page. The model runs on the client's GPU infrastructure — no data leaves their network. We benchmarked half a dozen open Arabic LLMs against the client's own evaluation set and picked the one that scored best on formal-Arabic citation accuracy.

03How we built it

Four phases. Seven weeks.

01 · Map

Two-day discovery

Worked with the compliance team to understand the corpus, the question types, and the constraints. Built an evaluation set of 80 question-answer pairs with grounded citations as the project's accuracy target.

Days 1 — 2
02 · Build

Ingest + retrieval first

Built the document ingest pipeline, the language router, and the hierarchical Milvus retrieval before any generation. Tuned retrieval against the eval set until top-3 chunks contained the answer 90% of the time.

Weeks 1 — 3
03 · Wire

Model selection + bilingual layer

Benchmarked open Arabic LLMs against the eval set. Built the bilingual query layer with English-to-Arabic translation pre-embed. Iterated on prompt structure for citation discipline — Arabic answers with English citation tags read cleaner than the inverse.

Weeks 4 — 5
04 · Ship

On-prem deploy

Deployed entirely inside the client's GPU infrastructure. Loom walkthrough in Arabic and English, runbook, 30-day support tail. The compliance team's reviewers had final say on every cited answer for the first month.

Weeks 6 — 7
04Stack & tradeoffs

Why these tools.

Open-weight everything. The client's data couldn't leave their network — that ruled out commercial LLMs by definition. We benchmarked Arabic-tuned variants of Llama against the eval set; the winner balanced citation discipline with formal-Arabic fluency. The same family also gave us a path to fine-tuning later if the client wanted to specialize on their corpus.

Milvus for the vector store. The client's IT was comfortable running it on-prem; pgvector was an option but Milvus's hybrid search and metadata filtering made the document-level access controls cleaner to implement. Python for the service layer because the client's existing data team worked in Python, which made post-handoff support and extension straightforward.

Considered and rejected: a translate-everything-to-English approach (lost meaning on legal terms-of-art), a single flat-retrieval index (couldn't enforce per-document access controls), and a smaller Arabic model fine-tuned in-house (timeline too long for the scope). The hierarchical retrieval was the unlock — most "RAG over Arabic PDFs" approaches we evaluated failed on either the language layer or the access-control layer; very few tried to solve both.

05Outcomes

What changed after deploy.

Top-3 retrieval accuracy ~90% on eval set
Vendor lock-in Zero no commercial model dep
Bilingual coverage AR ↔ EN either direction

Illustrative ranges. Specific client metrics are confirmed under NDA. Numbers shown reflect reported outcomes at handover.

More work

Other systems we've shipped.

Want one of these for your team?

30-min scope call. By the end you'll know what we'd build, in what order, what it costs.