Insights

How LLMs Are Changing Commerce Search: Early Evidence from the Portfolio

Daniel Nguyen 29 April 2025

The theoretical case for LLM-powered search in retail has been compelling for a while. The ability to understand a query at the intent level — not just the keyword level — and to return results that reflect the shopper's underlying need rather than a string match against product titles is exactly the capability gap that plagued keyword search systems. The question was always whether LLM inference latency and cost could be brought into a range that made production retail search viable. Over the course of 2024 and into 2025, that question has been getting answered affirmatively in production deployments, and the early data from companies operating these systems is worth examining carefully.

The pattern we've observed across portfolio companies using LLM-enriched search is that the performance gains are strongest in two specific query types. The first is long-tail or descriptive queries — searches like "casual dress for a beach wedding in Queensland in January" — where keyword systems fail to decompose the query into meaningful ranking signals and LLM understanding of the multi-attribute intent produces significantly more relevant results. The second is negative and exclusion queries — "running shoes but not Nike" or "a wine gift under $80 that isn't a red" — where keyword systems have traditionally struggled to represent negative constraints cleanly in their ranking models. LLMs handle both of these query patterns substantially better than BM25-based systems with query expansion. Conversion rate improvement in these query segments runs materially ahead of the overall search conversion improvement, which suggests the gains are genuine and attributable to the query understanding capability rather than general ranking quality improvements.

The implementation architecture that's performing best in production is a hybrid retrieval model — a fast first-stage retrieval using dense vector embeddings (where the query and product descriptions are both embedded in a shared semantic space) followed by an LLM-based cross-encoder re-ranking step that operates on the top-N candidates from the first stage. The full LLM inference runs only on a small candidate set, which keeps the end-to-end latency within acceptable limits for commerce search — typically under 200 milliseconds for the combined pipeline. Running LLM inference over a full catalogue on every query is still cost-prohibitive for mid-market retailers, which means the first-stage retrieval quality and the N value for re-ranking are important tuning parameters that affect both performance and cost.

The caution we'd add is that LLM search performance is heavily dependent on catalogue quality. A dense retrieval model that embeds product descriptions in a semantic space can only return relevant results if the product descriptions contain meaningful semantic content. A product described as "Item BLK-034-XL, black, available in XS-XL" doesn't provide the semantic richness that LLM retrieval can exploit. The retailers seeing the strongest search performance improvements from LLM-native search are the ones who also invested in catalogue enrichment — structured attribute extraction, AI-generated product descriptions where human-written descriptions are thin, and consistent category taxonomy. The LLM search layer and the catalogue quality investment are complementary, and the retailers who treat them as independent projects typically see less of the available improvement than those who address both together.

The roadmap item we're watching most closely in portfolio companies is the shift from LLM-enriched search — where the LLM improves a query-response interaction — to LLM-native discovery, where the interaction model itself changes. A consumer who can maintain a conversational context across multiple search queries within a shopping session ("show me the ones that would work for an office environment" as a follow-on to an initial query) is engaging with the discovery surface in a qualitatively different way from keyword search. The infrastructure requirements for that conversational state management — session context, history-aware re-ranking, reference resolution across turns — are non-trivial, and the companies building toward that interaction model are the ones we expect to define the category over the next several years.