Atlas note

Evals should test the workflow, not the demo

Atlas made this clear to me: an agent can pass the prompt and still fail the product if the eval leaves out state, files, side effects, or the user's real input format.

My first evals for Atlas were mostly question-answer tests. Ask the agent a question, grade the answer, improve the prompt or the tool, run it again. That was useful. It caught obvious gaps. It also gave me too much confidence.

The real system was not a clean prompt. It was a workflow: messages, screenshots, local files, database writes, long-running sessions, follow-up questions, and a human who expected the system to remember what it had actually done. Once I tested that whole loop, new failure modes appeared.

The eval gets more useful when it replays the workflow shape, not just the final chat answer.

The input modality matters

A text eval can pass while the product fails on screenshots, PDFs, or forwarded messages. The user's real input format is part of the task. If the eval only tests the cleaned-up version, it is mostly grading the demo path.

Atlas improved when evals included the messy inputs the system really saw. It also improved when "wrong examples" were made explicit. Not just "parse this correctly," but "do not turn this label, axis, or annotation into a real record." Negative examples matter because agents are eager to be helpful.

State leaks across tests

Multi-turn agents carry memory. Databases carry writes. Gateways carry sessions. If an eval mutates state and the next question reads that state, the second question is no longer independent. It may fail because the first test succeeded.

This changed the eval setup for me. Resetting the chat session was not enough. I also had to think about tool memory, database rows, generated files, and local caches the agent could inspect. The eval runner became part of the system under test.

The grader can only grade what it can observe

One Atlas eval expected the agent to write a safety banner into a generated file. The behavior was correct: the file contained the banner. But the grader only saw the chat response, not the file. From the grader's point of view, the banner was missing.

That plateau was useful because it separated product correctness from eval observability. If the behavior lives in a file, database row, or downstream message, the grader needs access to that artifact. Asking the agent to summarize the artifact back into chat can quietly change the product just to satisfy the eval.

Better eval questions

Did the right side effect happen, not just the right explanation?
Can the grader see the artifact the rubric is checking?
Does the test use the user's real input format?
Is state reset between runs, including memory and generated files?
Are expected answers resolved from live data when the data changes?

My current rule is that evals should feel less like trivia and more like replay. Start from a real failure, preserve the workflow shape, and grade the observable contract. Otherwise it is easy to get a system that is excellent at the demo and still unreliable in daily use.