Insights & Ideas

Insights on AI evaluation, experimentation, and building reliable AI products.

From 94% to 100% on Scanned German Documents

A health insurer's vision model read 94% of scanned member documents correctly. For numbers that set what people pay each month, that isn't enough. Here's how an evaluation loop in elluminate found the six failures and fixed them, one experiment at a time, all the way to 107/107.

Dario Klingenberg June 30, 2026 8 min read

Case Study Evaluation Document Parsing Vision Models

From 94% to 100% on Scanned German Documents

Latest articles

Gerriet Backer Jun 23, 2026 6 min read

Nobody Knows How to Do This Yet

EU AI Act technical documentation under Annex IV is mandatory, has no template, and usually lands on one overworked person's desk. We're shipping elluminate's Compliance Documentation Assistant in beta: a guided Classify → Document → Evaluate → Report workflow that shows the legal basis for every field and ties the claims to real evaluation evidence.

EU AI Act Compliance +2 more

Gerriet Backer Jun 18, 2026 8 min read

You Became the Provider Without Noticing

The EU AI Act has a clause that turns teams who only wrote a system prompt into the provider of a high-risk AI system — no shipping required. How Article 25(1)(c) works, what Annex IV then asks of you, and why settling your role first makes the rest of the documentation fall into place.

EU AI Act Compliance +2 more

Daniel Albensoeder Jun 16, 2026 7 min read

The Expert Trap: What a Silly Movie Game Taught Us About Prompting

We turned the internet game 'Explain a Film Plot Badly' into an LLM evaluation across five models. Telling a model it was a world-class movie expert made it worse every single time — and so did upgrading to a newer model. Here's why, and why you only catch it with an eval.

Prompt Engineering Evaluation +2 more

Dominik Römer Jun 2, 2026 8 min read

From the Customer's Side of the Table: What a Forward Deployed Engineer Actually Does

What a Forward Deployed Engineer actually does at ellamind: onboarding customers onto elluminate, carrying the friction they don't file back into the roadmap, and closing the loop between a customer's reality and the product.

Forward Deployed Engineering Customer Success +2 more

Stefano Arcaro May 5, 2026 9 min read

How a 52-Line CLAUDE.md Cut Our Claude Code Bill

A CLAUDE.md file's value isn't the facts it supplies. It's the workflow it imposes. We ran Claude Opus 4.6 on ten Django tasks with and without our project's instructions, and the cost gap came almost entirely from one workflow choice the file encodes.

Claude Code AI Coding +3 more

Raphael Huppertz Apr 23, 2026 12 min read

How ellamind Used Evaluation to Confidently Switch A Production AI Model

When a German health care provider needed to replace the LLM powering its customer-facing AI platform, structured evaluation turned a risky model switch into a data-driven decision. Uncovering fabricated personal data, leaked system instructions, and costly wrong reimbursement decisions along the way.

Case Study Evaluation +3 more

Dominik Römer Apr 20, 2026 8 min read

Your Chatbot Aces Single Questions, But Can It Hold a Conversation?

How N+1 evaluation catches the failures that single-turn testing misses, and how to set it up in elluminate.

Evaluation Multi-Turn +2 more

Maximilian Idahl Mar 31, 2026 13 min read

Acing Chemistry, Failing Biology: GPT-5.4 vs Claude vs Gemini

We re-ran OpenAI's FrontierScience benchmark on GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro using elluminate. Here's where the latest frontier models stand on the hardest public science benchmark.

Evaluation Benchmarks +3 more

Benedikt Droste Mar 30, 2026 10 min read

We Gave an AI Agent a PKV Claims Inbox: How We Doubled Accuracy Through Evaluation

How structured evaluation turned a capable but unreliable AI agent into one that processes PKV claims at 93% accuracy. Same model, same cases, different instructions.

AI Agents Evaluation +2 more

Gerriet Backer Mar 16, 2026 9 min read

Who Owns Quality When AI Does the Work?

Quality ownership drifts when AI systems fail quietly. Learn why plausible-sounding outputs are dangerous, what ownership actually means for AI teams, and how to scale evaluation rigor with risk.

AI Evaluation Quality Assurance +1 more

René Fleschenberg Mar 4, 2026 12 min read

How We Use Claude Code for Kubernetes & OpenTofu

Running Kubernetes in production on multiple cloud providers means juggling OpenTofu configurations, Helm charts, and deployment pipelines. Here's how we use Claude Code as an infrastructure copilot with safety guardrails, custom skills, and encoded domain knowledge.

Infrastructure DevOps +3 more

Stefano Arcaro Feb 26, 2026 8 min read

Not All Chinese LLMs Censor: Kimi vs DeepSeek

We tested 168 sensitive China-related topics across 10 LLMs. One Chinese model matched GPT-5.2 and Claude. Another rewrote the Tiananmen massacre as state-approved fiction.

Evaluation Model Comparison +2 more

Dominik Römer Feb 19, 2026 10 min read

How to Evaluate RAG Applications: A Health Insurance Case Study

A complete framework for RAG evaluation covering test set design, targeted criteria for retrieval and generation, experiment analysis, and continuous production monitoring.

RAG Evaluation +2 more

Stefano Arcaro Feb 17, 2026 7 min read

Iterate Faster on AI Evaluation with MCP

Your evaluations say your AI is perfect. You know it's not. Here's how we used MCP to iterate rapidly and surface real limitations.

MCP Evaluation +2 more

Raphael Huppertz Feb 11, 2026 7 min read

Bring Your Own Dataset: Langfuse in elluminate

Import your Langfuse datasets directly into elluminate. Turn production traces into structured evaluations - no export scripts or CSV wrangling required.

Integration Langfuse +3 more

Björn Plüster Jan 26, 2026 8 min read

Binary vs Likert Scales: Why We Use Yes/No Evals

Binary pass/fail evaluations beat Likert scales for LLM and agent evaluation. Here's why, and how to keep nuance without the inconsistency.

Evaluation Best Practices +3 more

Björn Plüster Jan 2, 2026 9 min read

What Makes a Good Test Set?

How we built a test set for a German health insurer's AI search—from 50 real user queries to 80 cases, 57 experiments, and a pass rate that climbed from 35% to over 80%.

Testing Evaluation +2 more

Alicia García Aug 25, 2025 12 min read

Getting started with elluminate for Evaluations

Learn how to systematically test and improve your AI prompts using elluminate's evaluation platform. Walk through a complete example using pizza toppings to understand prompt templates, collections, criteria, and experiments.

Getting Started Evaluation +1 more

Unlock the power of AI

See how our products can help you evaluate, deploy, and monitor AI agents with confidence.

Request demo Send email