Why AI agents fail in production (and it’s rarely the model)

AI agents rarely fail because their models are not “smart enough”. In open source AI systems, failures typically occur later, when demos are deployed to production.

In early demos, a single large language model (LLM), a clean prompt loop, and a cloud endpoint are often enough. However, once that same system is exposed to real workloads with unpredictable latency, cost controls, data locality, and regulatory constraints, the cracks appear quickly. What appeared to be intelligence turns out to be fragility.

One of our speakers on the AI track, Luca Bianchi, mentioned that “one of the biggest problems that we have when moving from demos or POCs to production is the fact that we need performance.” He explained that general-purpose LLMs tend to consume a large number of tokens, and that adding more models and chaining them into pipelines makes the system slower and more expensive.

This is the gap Luca addressed in his OCX session, Small LLMs at the Edge Are the Engine for Open Source Scalable AI Agents. His argument was that model size, number, and kind are not given aspects, but architectural decisions, and there is no one-size-fits-all approach.

Rather than scaling a single model, Luca advocated for using multiple, smaller, domain-focused models. These are named Small Large Models (SLMs). These models are intentionally narrow. As Luca noted, “if we are building language models tailored for the legal domain, we don’t need that model to be able to talk about chemistry, or maths, or biology.” When deployed at the edge or in hybrid cloud/edge environments, these open source models make agentic systems more predictable, more cost-effective to run, and easier to control.

The problem becomes even more visible when teams attempt to build agentic systems. In production, Luca explained, “you end up building a complex pipeline of many different models,” handling retrieval, re-ranking, filtering, and orchestration. When those pipelines rely on large, centralised models, the entire pipeline will take a very long time.

This architectural reality highlights a broader theme at OCX 26. In an interview, John Ellis, speaking about his session on software trust, said, “The industry has quietly redefined trust to mean: it passed tests at a point in time, which is not enough.” In AI systems, passing a demo is increasingly mistaken for being production-ready.

Watch the recording of this session on our YouTube channel.