Insights

AI Models in Production: What Commerce Companies Are Actually Shipping

Daniel Nguyen 10 June 2024

There's a significant gap between how AI in commerce is described in a product pitch and what the engineering work of getting an ML model into production actually looks like. The demo shows a personalised recommendation feed that surfaces the right product at the right moment. What the demo doesn't show is the feature engineering pipeline that prepares the training data, the latency budget negotiation that determines how much inference time the product team can actually afford, the cold-start handling for new users and new products, the model monitoring infrastructure that detects drift before it becomes a user-visible degradation, and the rollout strategy that allows the system to be A/B tested against a control without the experimental design corrupting the training data. Those are not edge cases. They are the substance of operating an ML system in production commerce environments.

From conversations with engineering teams in our portfolio and the broader commerce ecosystem, the pattern we see consistently is that the timeline from "we have a working model in a notebook" to "we have a model serving production traffic with monitoring and rollback" is roughly three to five times longer than the initial estimate, and the work in that gap is mostly not the ML work — it's the data infrastructure, the serving infrastructure, and the organisational alignment around what metrics the model is actually being optimised for. A personalisation model that optimises for click-through rate in the A/B test may degrade revenue per session over a longer horizon because it surfaces novelty-seeking content that generates clicks but not purchases. Setting the right objective function requires a conversation between the data science team and the commercial team that frequently doesn't happen until after the model has been in production long enough for the misalignment to show up in the metrics.

The teams in our portfolio that are furthest along in production ML deployment share a few structural characteristics. First, they treat the data pipeline as a first-class engineering investment rather than a precondition that gets built when needed. Second, they have clear ownership at the intersection of ML and product — a role that can translate between "the model's loss function" and "what does the product actually need to do differently." Third, they have instrumentation that allows them to understand model behaviour at the individual prediction level, not just in aggregate A/B metrics. That last point is often underinvested in, and it's the capability that allows fast iteration on model behaviour without needing to run a new experiment every time a hypothesis needs testing.

The specific challenge in APAC commerce ML is the data sparsity problem in markets outside Australia and Singapore. A personalisation model trained primarily on Australian consumer behaviour doesn't transfer cleanly to Indonesian or Filipino consumer behaviour — the purchase patterns, the price sensitivity thresholds, the category preferences, and the platform interaction patterns are different in ways that matter for model performance. The companies that handle this well are building multi-market training architectures that allow shared representation learning where the patterns genuinely transfer (basic recommendation signals, image understanding for product classification) while maintaining market-specific fine-tuning where they don't. That's a more complex ML architecture than a single-market model, and it requires enough data volume in each market to support fine-tuning — which means the market expansion sequencing decision has ML data strategy implications that are often not factored into the product roadmap.

We're cautious about the trend toward large-model fine-tuning as the default architecture choice for commerce AI. Foundation model fine-tuning for specific commerce tasks is attractive because it reduces the need for large proprietary training datasets, but it introduces a dependency on model providers and a fine-tuning lifecycle management problem that isn't trivial at scale. The companies that will have the strongest defensible positions in commerce AI are the ones accumulating proprietary training data from their production deployments — the behavioural signals that no external model can replicate and that compound in value as the data asset grows. Building toward that data moat from the start, even when foundation model fine-tuning would be faster in the short term, is the right long-horizon architecture decision.