A lot of routing work evaluates isolated prompts, but real agent systems are fundamentally multi-step and budget-constrained. Cool to see benchmarks moving toward execution-grounded, end-to-end evaluation instead of just token-level proxies.
TwinRouterBench is a strong step toward realistic agentic routing evaluation — especially the separation between static supervision and dynamic SWE-bench execution. Excited to see where this goes!
Excited to share that TwinRouterBench has been accepted to the #RLEval Workshop at #CAIS2026 🎉
As LLM apps become long-horizon agents, one request can trigger many model calls across planning, tool use, retrieval, coding, and verification.
That makes per-step LLM routing a
Great to see TwinRouterBench accepted to the #RLEval Workshop at #CAIS2026!
Per-step routing is quickly becoming essential infrastructure for agentic systems: each planning, coding, retrieval, and verification call should use the cheapest sufficient model without hurting final task success.
Proud to open-source TwinRouterBench and contribute a practical benchmark for this problem.
Excited to share that TwinRouterBench has been accepted to the #RLEval Workshop at #CAIS2026 🎉
As LLM apps become long-horizon agents, one request can trigger many model calls across planning, tool use, retrieval, coding, and verification.
That makes per-step LLM routing a
Conflict of interest? acknowledged!
We know our router (UncommonRoute) currently leads the leaderboard.
Open submissions, locked pricing, public scoring code. If a different router wins, the leaderboard will say so.
How do you evaluate an LLM router fairly?
Most benchmarks look at prompts, but routers operate at an agentic-step level. A router that saves money but breaks the task could be worse than no router.
We open-sourced TwinRouterBench to measure this honestly.
🧵
Run Claude Code with Commonstack in 4 steps:
- generate an API key
- set 4 environment variables
- run claude
- /status to verify
Set it up now in 5 minutes with @alex_mirran.
GPT-5.5 is live on Commonstack.ai! 🚀🚀
Use the strong reasoning and coding capabilities of GPT-5.5 in your application or with your favorite agentic harness.
DeepSeek-V4-Flash
🔹 Reasoning capabilities closely approach V4-Pro.
🔹 Performs on par with V4-Pro on simple Agent tasks.
🔹 Smaller parameter size, faster response times, and highly cost-effective API pricing.
3/n
43K Followers 3K FollowingAI l Researcher |
Finding and sharing Al tools every day
| Building tomorrow's tech |
Open to collaborations
💌 [email protected]
9 Followers 26 FollowingGrammy winner | @tft Top 10 | cofounder @othergamesxyz | Director of Chaos @apechurch | alum @umass @fredoniau | designing culture on the internet
7 Followers 32 FollowingJust another Etherean vibin. Investing in ETH & the Ethereum ecosystem over the last 10 years. Multi-decade ETH investment thesis. nfa
3 Followers 73 FollowingMaking Paintings. Email zoeyzoeyfrank with any questions or for a list of available work. Catalogue available for preorder now!👇
1.5M Followers 2 FollowingClaude is an AI assistant built by @anthropicai to be safe, accurate, and secure. Talk to Claude on https://t.co/ZhTwG8d1e5 or download the app.
519K Followers 50 FollowingThe Gemini app turns research into reality, bringing frontier AI experiences like Omni, Deep Think, Nano Banana, and more to hundreds of millions of people.
81K Followers 258 FollowingThe AI Lab behind GLM models, dedicated to inspiring the development of AGI to benefit humanity.
https://t.co/7a5aSCUNcZ
https://t.co/x14hb3klXm
4.9M Followers 4 FollowingOpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity. We’re hiring: https://t.co/dJGr6LgzPA