10.6x Lower Token Cost with Knowledge Graphs

Summary

Why this benchmark exists

Browser agents are usually evaluated on completion: did the agent eventually reach the right answer? For production use, completion is not enough. The runner also has to stay affordable, predictable, and fast enough to use in a real workflow.

Knowledge graphs change that operating model. StableBrowse builds site knowledge at runtime as the agent observes pages, controls, and responses, then lets the LLM plan against a compact graph of what the browser has learned.

Results

Average cost and speed

Lower is better for both metrics. Token cost captures how much context the runtime asks the model to process. Speed captures elapsed time for a completed browser flow.

Average token cost

StableBrowse

12k

Codex

70k

Playwright MCP

173k

StageHand

140k

Average speed (seconds)

StableBrowse

25s

Playwright MCP

72s

Codex

84s

StageHand

90s

Runner	Average token cost	Average speed (seconds)	Tokens saved by StableBrowse	Time saved by StableBrowse	Runtime pattern
StableBrowse	12k	25s	Baseline	Baseline	LLM plans against reusable site knowledge graphs
Codex	70k	84s	58k fewer tokens	59s faster	General browser agent loop
StageHand	140k	90s	198k fewer tokens	65s faster	Instruction-driven browser automation
Playwright MCP	173k	72s	161k fewer tokens	47s faster	Tool-driven browser control through MCP

Average tokens saved 116k

Mean reduction versus Codex, StageHand, and Playwright MCP.

Average time saved 57s

Mean speedup versus the same competitor set.

Average time taken 25s

StableBrowse average elapsed runtime per benchmark flow.

Methodology

What we counted

We used browser tasks that require real navigation and state changes. A flow could involve search, filters, product detail pages, availability checks, structured extraction, map/list views, or stopping before checkout.

Task unit

One natural-language instruction executed end-to-end on one live website.

Token metric

Average model tokens consumed by the runtime while completing the flow.

Speed metric

Average elapsed seconds from task start to final answer.

Success boundary

The runner must return the requested structured answer or clearly reach the requested state.

Graph Coverage

Representative KG-backed websites

The public list below only includes domains with local knowledge graph artifacts in the StableBrowse index. It mixes ecommerce, marketplaces, real estate, travel, local search, developer sites, and information retrieval.

Prompt Mix

Example prompts

Prompts were intentionally varied so the benchmark would exercise different page types and control patterns. These are representative examples, not the full prompt set.

Beauty retail

Find a mineral sunscreen under $40, prefer sensitive-skin products, and return three options with price and SPF.

Furniture

Search for a queen bed frame, filter to storage beds if available, and compare price, color, and delivery signal.

Footwear

Find men's running shoes in size 10.5, sort or filter where possible, and return visible sale options.

Real estate

On a listings page, narrow to two-bedroom rentals and return three visible options with price and neighborhood.

Electronics

Search for mirrorless cameras, narrow to a price band, and return product names with body-only or kit indicators.

Office supplies

Find ink cartridges for a specific printer family and return in-stock products with pack size and price.

Marketplace

Search used desk chairs, prefer local pickup, and return listings with price, condition, and location when visible.

Local search

Find coffee shops near a neighborhood, compare ratings and review counts, and return three options.

Auditability

What can be inspected

One reason knowledge graphs are useful operationally is that they leave behind artifacts. Instead of a browser agent improvising through a long hidden chain of screenshots and retries, the graph records what the system believes about a site: page states, capabilities, stable controls, and tested transitions.

Site graph

The graph captures reusable knowledge about search, filters, product cards, detail pages, carts, maps, listings, and other high-value surfaces.

Capability traces

Each promoted action can be inspected as a capability rather than buried inside a one-off prompt or browser transcript.

LLM handoff

Execution can be reviewed as a handoff from natural-language intent to a smaller set of graph-backed actions.

Failure Modes

What the graph removes from the hot path

General browser agents often spend their budget rediscovering the same facts: which input is search, which filters are active, whether a modal is blocking the page, where product cards begin, and which controls are safe to click.

Repeated page interpretation

Without a graph, each run asks the model to infer site structure from fresh snapshots. That increases tokens even when the task is routine.

Ambiguous controls

Commerce sites reuse labels like Sort, Size, Color, Close, Apply, and Add. The graph stores control context so the model is not guessing from a flat page.

Stateful flows

Filters, variants, zip-code gates, carts, maps, and date pickers all change page state. StableBrowse promotes these transitions into known capabilities.

Architecture

Runtime knowledge graph construction

StableBrowse builds the graph while browser work is happening. The runtime turns observed page states, controls, and successful transitions into a compact model the LLM can use during the same flow and across repeat workflows.

1. Observe live surfaces

Visit home, search, category, product, cart, and booking surfaces as the agent encounters them.

2. Identify page states

Cluster result pages, PDPs, filters, date pickers, variant selectors, modals, and account boundaries.

3. Promote capabilities

Expose stable actions such as search, apply filter, extract cards, inspect variants, or read availability.

4. Verify with an LLM

Validate that an LLM can call the graph and complete a task without re-reading the whole page every step.

Economics

Why runtime graphs reduce spend

Because StableBrowse turns live observations into reusable graph context, the model can work from known page capabilities instead of repeatedly loading large DOM snapshots and inferring the same actions.

runtime graph context = observed states + verified capabilities

In the benchmark, StableBrowse averaged 12k tokens per flow. Codex averaged 70k, Playwright MCP averaged 173k, and StageHand averaged 140k. Across those three comparisons, StableBrowse saved an average of 116k tokens per flow and completed tasks an average of 57 seconds faster.

The savings come from reducing repeated inference. As StableBrowse observes stable controls and page states, it keeps the decision surface smaller and gives the model more direct ways to act.

Interpretation

How to read the result

This benchmark is not saying every website becomes easy once a graph exists. It is showing that a reusable site representation changes the shape of the problem. The browser still has to deal with live pages, popups, inventory changes, and anti-automation behavior. The difference is that the LLM starts with a map.

That map is why StableBrowse can spend fewer tokens per run. The model is not repeatedly asked to infer the same selectors, content regions, and safe actions. It receives a smaller decision surface and calls capabilities that the runtime has observed and verified.

Takeaway

Production agents need reusable website memory.

Browser agents can operate websites directly. StableBrowse changes the cost structure by storing verified site knowledge in a graph and letting the LLM call known capabilities.

Benchmark Scope

Disclaimer

This is an internal benchmark study, not a third-party certification or a leaderboard submission. The setup is designed to measure the effect of reusable website knowledge on browser-agent cost and latency.

StableBrowse uses external knowledge injection through site-specific knowledge graphs, plus runtime orchestration that exposes verified capabilities to the LLM. That is the point of the benchmark: we are measuring a production architecture where the agent is allowed to remember websites, not a constrained setting where every runner must start from a blank browser state with no prior site memory.

Results should therefore be read as an operational comparison for repeated website workflows. They should not be interpreted as a claim that StableBrowse would meet the rules of benchmarks that prohibit internet access, external tools, client-side harnesses, or hand-authored knowledge during evaluation.