Want one of these for your team?
30-min scope call. By the end you'll know what we'd build, in what order, what it costs.
Open-source Arabic LLM stack with hierarchical retrieval over Milvus. Bilingual querying, Arabic-first prompting, on-prem-ready by design. Built without commercial model dependencies.
The client's compliance team works across a corpus of regulatory and policy documents — most in formal Arabic, some bilingual Arabic-English. Reading the corpus end-to-end was a multi-day exercise; finding the specific clause that answered a specific question was worse. Off-the-shelf RAG built on commercial LLMs would have solved retrieval, but the data couldn't leave the client's infrastructure.
That ruled out OpenAI, Anthropic, and every other US-hosted commercial model. The system had to run on open-weight models, on infrastructure the client controlled, with retrieval that respected document-level access permissions. And it had to be Arabic-first — most open Arabic LLMs at the time of build were tuned on translated English data, which produced answers that read as awkward or culturally off.
The team also wanted bilingual querying. A user might ask a question in English about an Arabic document, or vice versa. The system had to handle either direction without losing context or citation accuracy.
The pipeline starts with a language router. Every incoming query gets classified — Arabic, English, or mixed — and routed through an Arabic-tuned embedding model. English queries get a translation pass before embedding so the vector representation lands in the same space as the Arabic corpus. This is the part most off-the-shelf solutions get wrong, and it's why bilingual queries usually return rough results.
Retrieval is hierarchical. A doc-level pass finds the candidate documents — title + machine-summary embedded as a single vector, fast filter. A chunk-level pass then retrieves overlapping windows from the candidate set with sentence-level granularity. Milvus handles both tiers with hybrid search (dense + sparse), and metadata filtering enforces the access permissions the compliance team set on each document.
The Llama-based Arabic model generates answers from the retrieved chunks. Every answer carries citations back to the source document and page. The model runs on the client's GPU infrastructure — no data leaves their network. We benchmarked half a dozen open Arabic LLMs against the client's own evaluation set and picked the one that scored best on formal-Arabic citation accuracy.
Worked with the compliance team to understand the corpus, the question types, and the constraints. Built an evaluation set of 80 question-answer pairs with grounded citations as the project's accuracy target.
Days 1 — 2Built the document ingest pipeline, the language router, and the hierarchical Milvus retrieval before any generation. Tuned retrieval against the eval set until top-3 chunks contained the answer 90% of the time.
Weeks 1 — 3Benchmarked open Arabic LLMs against the eval set. Built the bilingual query layer with English-to-Arabic translation pre-embed. Iterated on prompt structure for citation discipline — Arabic answers with English citation tags read cleaner than the inverse.
Weeks 4 — 5Deployed entirely inside the client's GPU infrastructure. Loom walkthrough in Arabic and English, runbook, 30-day support tail. The compliance team's reviewers had final say on every cited answer for the first month.
Weeks 6 — 7Open-weight everything. The client's data couldn't leave their network — that ruled out commercial LLMs by definition. We benchmarked Arabic-tuned variants of Llama against the eval set; the winner balanced citation discipline with formal-Arabic fluency. The same family also gave us a path to fine-tuning later if the client wanted to specialize on their corpus.
Milvus for the vector store. The client's IT was comfortable running it on-prem; pgvector was an option but Milvus's hybrid search and metadata filtering made the document-level access controls cleaner to implement. Python for the service layer because the client's existing data team worked in Python, which made post-handoff support and extension straightforward.
Considered and rejected: a translate-everything-to-English approach (lost meaning on legal terms-of-art), a single flat-retrieval index (couldn't enforce per-document access controls), and a smaller Arabic model fine-tuned in-house (timeline too long for the scope). The hierarchical retrieval was the unlock — most "RAG over Arabic PDFs" approaches we evaluated failed on either the language layer or the access-control layer; very few tried to solve both.
Illustrative ranges. Specific client metrics are confirmed under NDA. Numbers shown reflect reported outcomes at handover.
30-min scope call. By the end you'll know what we'd build, in what order, what it costs.