GenAI Zürich 2026
Testing LLM Outputs: Caging the Wind or Just Another Day in the Office?
About this talk
As LLM-based applications scale and teams grow, you can no longer rely on intuition to know if things work. This talk covers Adobe's journey from a simple LLM app to a sophisticated skills-based system, the shift to rigorous testing with Promptfoo, and lessons learned managing systems that feel unpredictable.
Key takeaways
- Why testing LLM outputs is different from traditional software testing, and what that means for your workflow
- How to set up evaluation-driven development with Promptfoo to catch regressions before they reach users
- Practical patterns for scaling LLM testing as your application grows from a single prompt to a multi-skill system
- Lessons from running this at Adobe: what worked, what surprised us, and what we'd do differently
Resources
- Promptfoo - LLM evaluation framework
- Claude Agent SDK plugin support - promptfoo/promptfoo#6377
- Redact exported config secrets - promptfoo/promptfoo#7974