StableBrowse Research
10.6x lower token cost with knowledge graphs
We ran browser-agent flows across 150 sites and compared StableBrowse's knowledge graph runtime against Codex, StageHand, and Playwright MCP. The benchmark measured whether a runtime could complete realistic browsing tasks while controlling two production costs: tokens and wall-clock time.
sites in benchmark suite
avg tokens per StableBrowse flow
avg StableBrowse time taken
avg tokens saved vs other runners
Summary
Why this benchmark exists
Browser agents are usually evaluated on completion: did the agent eventually reach the right answer? For production use, completion is not enough. The runner also has to stay affordable, predictable, and fast enough to use in a real workflow.
Knowledge graphs change that operating model. StableBrowse builds site knowledge at runtime as the agent observes pages, controls, and responses, then lets the LLM plan against a compact graph of what the browser has learned.
Results
Average cost and speed
Lower is better for both metrics. Token cost captures how much context the runtime asks the model to process. Speed captures elapsed time for a completed browser flow.
Average token cost
Average speed (seconds)
| Runner | Average token cost | Average speed (seconds) | Tokens saved by StableBrowse | Time saved by StableBrowse | Runtime pattern |
|---|---|---|---|---|---|
| StableBrowse | 12k | 25s | Baseline | Baseline | LLM plans against reusable site knowledge graphs |
| Codex | 70k | 84s | 58k fewer tokens | 59s faster | General browser agent loop |
| StageHand | 140k | 90s | 198k fewer tokens | 65s faster | Instruction-driven browser automation |
| Playwright MCP | 173k | 72s | 161k fewer tokens | 47s faster | Tool-driven browser control through MCP |
Mean reduction versus Codex, StageHand, and Playwright MCP.
Mean speedup versus the same competitor set.
StableBrowse average elapsed runtime per benchmark flow.
Methodology
What we counted
We used browser tasks that require real navigation and state changes. A flow could involve search, filters, product detail pages, availability checks, structured extraction, map/list views, or stopping before checkout.
One natural-language instruction executed end-to-end on one live website.
Average model tokens consumed by the runtime while completing the flow.
Average elapsed seconds from task start to final answer.
The runner must return the requested structured answer or clearly reach the requested state.
Graph Coverage
Representative KG-backed websites
The public list below only includes domains with local knowledge graph artifacts in the StableBrowse index. It mixes ecommerce, marketplaces, real estate, travel, local search, developer sites, and information retrieval.
Prompt Mix
Example prompts
Prompts were intentionally varied so the benchmark would exercise different page types and control patterns. These are representative examples, not the full prompt set.
Find a mineral sunscreen under $40, prefer sensitive-skin products, and return three options with price and SPF.
Search for a queen bed frame, filter to storage beds if available, and compare price, color, and delivery signal.
Find men's running shoes in size 10.5, sort or filter where possible, and return visible sale options.
On a listings page, narrow to two-bedroom rentals and return three visible options with price and neighborhood.
Search for mirrorless cameras, narrow to a price band, and return product names with body-only or kit indicators.
Find ink cartridges for a specific printer family and return in-stock products with pack size and price.
Search used desk chairs, prefer local pickup, and return listings with price, condition, and location when visible.
Find coffee shops near a neighborhood, compare ratings and review counts, and return three options.
Auditability
What can be inspected
One reason knowledge graphs are useful operationally is that they leave behind artifacts. Instead of a browser agent improvising through a long hidden chain of screenshots and retries, the graph records what the system believes about a site: page states, capabilities, stable controls, and tested transitions.
Site graph
The graph captures reusable knowledge about search, filters, product cards, detail pages, carts, maps, listings, and other high-value surfaces.
Capability traces
Each promoted action can be inspected as a capability rather than buried inside a one-off prompt or browser transcript.
LLM handoff
Execution can be reviewed as a handoff from natural-language intent to a smaller set of graph-backed actions.
Failure Modes
What the graph removes from the hot path
General browser agents often spend their budget rediscovering the same facts: which input is search, which filters are active, whether a modal is blocking the page, where product cards begin, and which controls are safe to click.
Repeated page interpretation
Without a graph, each run asks the model to infer site structure from fresh snapshots. That increases tokens even when the task is routine.
Ambiguous controls
Commerce sites reuse labels like Sort, Size, Color, Close, Apply, and Add. The graph stores control context so the model is not guessing from a flat page.
Stateful flows
Filters, variants, zip-code gates, carts, maps, and date pickers all change page state. StableBrowse promotes these transitions into known capabilities.
Architecture
Runtime knowledge graph construction
StableBrowse builds the graph while browser work is happening. The runtime turns observed page states, controls, and successful transitions into a compact model the LLM can use during the same flow and across repeat workflows.
Visit home, search, category, product, cart, and booking surfaces as the agent encounters them.
Cluster result pages, PDPs, filters, date pickers, variant selectors, modals, and account boundaries.
Expose stable actions such as search, apply filter, extract cards, inspect variants, or read availability.
Validate that an LLM can call the graph and complete a task without re-reading the whole page every step.
Economics
Why runtime graphs reduce spend
Because StableBrowse turns live observations into reusable graph context, the model can work from known page capabilities instead of repeatedly loading large DOM snapshots and inferring the same actions.
runtime graph context = observed states + verified capabilities
In the benchmark, StableBrowse averaged 12k tokens per flow. Codex averaged 70k, Playwright MCP averaged 173k, and StageHand averaged 140k. Across those three comparisons, StableBrowse saved an average of 116k tokens per flow and completed tasks an average of 57 seconds faster.
The savings come from reducing repeated inference. As StableBrowse observes stable controls and page states, it keeps the decision surface smaller and gives the model more direct ways to act.
Interpretation
How to read the result
This benchmark is not saying every website becomes easy once a graph exists. It is showing that a reusable site representation changes the shape of the problem. The browser still has to deal with live pages, popups, inventory changes, and anti-automation behavior. The difference is that the LLM starts with a map.
That map is why StableBrowse can spend fewer tokens per run. The model is not repeatedly asked to infer the same selectors, content regions, and safe actions. It receives a smaller decision surface and calls capabilities that the runtime has observed and verified.
Takeaway
Production agents need reusable website memory.
Browser agents can operate websites directly. StableBrowse changes the cost structure by storing verified site knowledge in a graph and letting the LLM call known capabilities.
Benchmark Scope
Disclaimer
This is an internal benchmark study, not a third-party certification or a leaderboard submission. The setup is designed to measure the effect of reusable website knowledge on browser-agent cost and latency.
StableBrowse uses external knowledge injection through site-specific knowledge graphs, plus runtime orchestration that exposes verified capabilities to the LLM. That is the point of the benchmark: we are measuring a production architecture where the agent is allowed to remember websites, not a constrained setting where every runner must start from a blank browser state with no prior site memory.
Results should therefore be read as an operational comparison for repeated website workflows. They should not be interpreted as a claim that StableBrowse would meet the rules of benchmarks that prohibit internet access, external tools, client-side harnesses, or hand-authored knowledge during evaluation.
StableBrowse