The Expert Trap: What a Silly Movie Game Taught Us About Prompting

Daniel Albensoeder

Daniel Albensoeder

7 min read

There’s a long-running internet game called Explain a Film Plot Badly: describe a famous movie as misleadingly as possible and let everyone else guess. “Rampant inflation causes a housing crisis” is Up . “The world’s machinery is threatened by malfunctioning batteries” is The Matrix . The fun is in the misdirection: every clue is technically true and deliberately framed to point the wrong way.

It’s also a near-perfect test of lateral reasoning — exactly the kind of task where intuition about “what makes a model better” tends to be wrong. So we turned the game into an evaluation. Prompting a model to act as a world-class movie expert made it worse — and the effect held across every model we tried. More curious still: upgrading to the newer, benchmark-topping Opus 4.8 made it also worse, not better. Both are moves everyone recommends; neither is what people expect — until you evaluate it.

The Setup

The game runs in reverse for the model: it gets the misleading one-liner and has to guess which film is being described — it’s solving the riddle, not writing it. We collected 50 real “bad descriptions” of well-known films from around the internet, each paired with its actual title as ground truth.

We ran those 50 descriptions against five models — Claude Opus 4.8 and 4.7, Claude Sonnet 4.6, GPT-5.5, and the locally-hostable open-weight Mistral Small 3.2 24B — each repeated over multiple runs for stability. The only thing we varied between experiments was the prompt.

To score the answers we used elluminate, our evaluation platform, with two binary pass/fail criteria judged by an LLM:

  • Correct Movie — did the final answer name the right film (allowing minor title variations)?
  • Considered Correct Movie — did the model ever mention the right film while reasoning, even if it picked something else? This criterion separates a near miss, where the right film made the model’s shortlist but lost out to another guess, from a clean miss, where the title never surfaced at all.

The Expert Trap

Our intuition went where everyone else’s goes: prime the model with expertise. “You are a world-class X” opens countless tutorials, vendor docs, and corporate prompt libraries. So one prompt opened with a confident persona — “You are a legendary cinephile and film historian who has watched over 10,000 movies… famous for your ability to identify any movie from the most obscure or misleading descriptions.” Surely a simulated expert would do better.

It did the opposite. We compared that “Movie Expert” persona against a prompt that instead told the model to think like the person who wrote the bad description:

“The writer KNOWS the movie and is deliberately making it sound like something else. For each clue, think: what iconic movie element could this be a twisted version of? Then work backwards from famous movies to see which one fits ALL the clues.”

Same models, same samples, same judge. The numbers below are the share of the 50 films named correctly — and the expert persona lost on every single one:

Model”Think Like the Writer""Movie Expert”Difference
GPT-5.586%81%−5
Claude Opus 4.865%55%−10
Claude Opus 4.765%61%−4
Claude Sonnet 4.664%57%−7
Mistral Small 3.2 24B32%16%−16

Several percentage points of accuracy, gone — just from a flattering opening sentence. The effect held across every model we tried: GPT-5.5, Opus 4.7, and Sonnet each gave up four to seven points, and the gap widened to double digits on Opus 4.8 and Mistral. The direction is consistent across repeated runs — though the smaller four-to-seven-point gaps sit closer to run-to-run wobble than the double-digit ones do.

Why Expertise Backfires

The failure cases tell the story. When we inspected where the expert persona went wrong, the pattern was almost always the same: the model locked onto a confident answer and never even considered the right film.

  • “Drinking coffee reveals that everything is a lie” ( The Usual Suspects ) → confidently answered The Matrix
  • “An estranged daughter is knocked up by her father’s namesake” ( The Big Lebowski ) → confidently answered Knocked Up
  • “A cosplayer enters the main chamber of the House of Representatives, appears to die, but runs for city council instead” ( Dave ) → confidently answered Mr. Smith Goes to Washington

The judge flagged these as failing both “Correct Movie” and “Considered Correct Movie” — the right answer never showed up in the reasoning at all. The expert persona encourages exactly the wrong behavior for this task: it rewards a fast, authoritative first guess and discourages the broad, doubt-driven search that the puzzle actually requires.

This is the part the “Considered” criterion makes visible. It’s not just that the expert persona picks the wrong movie more often — it brings up the right movie less often in the first place. Across every model whose reasoning we can read, the expert persona dropped the rate at which the correct film was ever mentioned:

ModelConsidered the right film — WriterConsidered the right film — ExpertDifference
Claude Opus 4.867%56%−11
Claude Opus 4.771%65%−6
Claude Sonnet 4.668%62%−6
Mistral Small 3.2 24B47%25%−22

GPT-5.5 doesn’t expose its reasoning trace, so the “Considered” criterion can’t be measured for it.

A wrong final answer can be bad luck. A right answer that never enters the reasoning is a search problem — the model closed the door before it got to the correct candidate. That is what “act like an expert” does here: it narrows the search instead of widening it.

The prompts that do well share one habit — and it’s the opposite of the expert’s. Whether they tell the model to work backwards from whoever wrote the clue, to latch onto a film’s most iconic scene, or to list its top few guesses before answering, they all push it to widen the search and hold off on committing. The expert persona does the reverse: it rewards a fast, confident single answer. That’s the trap — on a puzzle built from misdirection, staying open and weighing more films beats sounding certain about the first one.

We Tested Ten Prompts, Not Two

The head-to-head above wasn’t cherry-picked — it’s one instance of a broader lesson: the best prompt depends on the model. We ran the same ten prompt strategies — from a bare-bones minimal prompt to elaborate multi-step routines like “generate and eliminate” and “rank your top three by confidence” — across all five models. Accuracy on the 50 samples, sorted by Opus 4.8:

Prompt strategyOpus 4.8Opus 4.7Sonnet 4.6GPT-5.5Mistral 24B
Think Like the Writer65%65%64%86%32%
Generate and Eliminate64%64%60%82%20%
Adversarial Debrief61%74%56%83%21%
Iconic Scenes Focus60%76%60%88%27%
Top 3 with Confidence59%72%62%85%24%
Gut Then Reconsider59%64%58%78%24%
Every Word is a Lie57%54%54%82%12%
Movie Expert persona55%61%57%81%16%
Genre Inversion55%58%58%80%10%
Minimal (no guidance)43%52%50%76%16%

Three things stand out. First, the best prompt depends on the model — and on the model version. “Iconic Scenes” tops the sweep on Opus 4.7 and GPT-5.5, but on Opus 4.8 it slides to mid-table and “Think Like the Writer” takes the lead — the same prompt that already topped Sonnet and Mistral. Each winner is merely mid-table on the models it doesn’t win. Tune a prompt on one model, move to another — or just let the vendor ship a new version — and your “best” choice quietly stops working.

Second, the expert persona is a poor default everywhere. It never wins on any model; on every one it lands near the bottom of the pack, behind the structured lateral strategies, and on the smaller and older models it sinks to the level of the do-nothing “minimal” prompt. “Act like an expert” buys you nothing here — and often costs you. It’s not the worst thing you can do; it’s just never the right one.

Third, prompt choice matters more the weaker the model. It never stops mattering — even on GPT-5.5 the best and worst prompts are 12 points apart — but there the gap is comparatively small. On Mistral it’s huge: the best prompt scores roughly three times the worst. The smaller the model, the more the exact words you choose move the result.

A Newer Model Isn’t a Better Guesser

Look down the two Opus columns. Claude Opus 4.8 is the newer, generally stronger model — it beats 4.7 on most standard benchmarks. On this task it scores lower, on seven of the ten prompts (tied on two more, ahead on just one), and the 4.7 champion “Iconic Scenes” drops from 76% to 60%.

It’s the expert trap again, one level up. Reading the traces, 4.8 commits earlier: it locks onto one confident reading and writes a fluent case for it instead of enumerating candidates. Told a cosplayer “enters the main chamber of the House of Representatives, appears to die, but runs for city council” ( Dave ), 4.7 works through the options and lands on Dave; 4.8 latches onto “Run, Forrest, run!”, answers Forrest Gump, and never reconsiders. The “Considered” criterion confirms it isn’t bad luck — 4.8 brings up the right film less often than 4.7 (74% → 62% on “Iconic Scenes”). A more decisive model narrows the search the same way “act like an expert” does.

That’s the uncomfortable part. “Upgrade to the newest model” is the other piece of folklore everyone trusts, right next to “act like an expert.” Here the newer model made this product worse — and only the eval caught it. A higher score on a public leaderboard is not the same as the right behavior for your task, and the only way to know which way an upgrade moves your numbers is to measure them.

The Takeaway

As we noted at the top, “act as an expert” advice is everywhere. For this task it was actively harmful — and we only know that because we ran an eval: a fixed test set, defined pass/fail criteria, an LLM judge applied to every response, repeated runs for stability. Read the expert-persona outputs on their own and they look great: fluent, confident, well-reasoned. You would ship them. The problem only becomes visible when you score them against ground truth at scale.

That is the whole argument for evals in one silly example. Prompt-engineering folklore is full of plausible-sounding rules that quietly cost you accuracy, and intuition can’t tell you which prompt wins — especially when the winner changes from one model to the next, or when a newer model quietly guesses worse. Reading a handful of nice-looking responses won’t either. An eval can — in this case, in about the time it takes to read this post.

If “we think this prompt is better” is currently how prompt decisions get made on your team, that’s the gap evidence-driven evaluation is meant to close.


At ellamind, we build elluminate, the evaluation platform that turns “we think it works” into “we know it works.” This entire experiment — collection, criteria, ten prompts, five models — was built and run inside it, and the docs walk through running a sweep like this yourself. If you’re making prompt and model decisions on intuition, we’d love to talk.

More articles

Unlock the power of AI

See how our products can help you evaluate, deploy, and monitor AI agents with confidence.