Excited to share that members of our lab co-authored 6 papers accepted at #ICML2026, including three Main Track and three Workshop papers 🔥🚀
📄 Accepted papers:
▪️ OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data [Main Track]
▪️ Auditing Emotion-Vector-Steered Political Bias in Open-Weight LLMs [AI4GOOD Workshop]
▪️ Reinforcement Learning of Karma Bidding Strategies [NExT-Game Workshop]
▪️ Cinematic Source Separation with Dialogue-Driven Sidechain Ducking [Workshop on Machine Learning for Audio]
▪️ Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR) [Main Track]
▪️ Preserving Plasticity in Continual Learning via Dynamical Isometry [Main Track]
@robertjakob@PatrickLanger20@gaborhollbeck@f14wn@DerRiehl@atoof_sh@deepqlearning@kev_osull@cs06thegreat@nzuma0@maxrosenblattl
Huge congratulations to everyone involved. We are looking forward to presenting these works, reconnecting with colleagues, and meeting new friends in Seoul 🇰🇷
🔗 Full links in the comments.
This is only the starting point. Obvious extensions:...
-Frontier models at full reasoning budget
-Memory and scratchpad ablations
-Allow agents to chat
-Comparison against humans
Using the traces not just to rank agents, but to train them via self-play and RL
Static benchmarks measure what a model knows.
Interactive benchmarks measure what a model keeps doing when its own actions change the state of the world.
Cattle Trade is intended to be a first step in that direction.
Paper: arxiv.org/abs/2605.14537@ETH_agent_lab@robertjakob
The lesson learned is that agent evaluation needs environments where capabilities compose under pressure.
Memory alone is not enough.
Math alone is not enough.
Tool use alone is not enough.
The question is whether the policy remains coherent when the world pushes back.
Why this matters:
Agents for procurement, negotiation, sales, portfolio management, auctions, scheduling, resource allocation. None of them solve one-shot questions.
They act in interactive systems where opponents react, incentives conflict, and state-tracking errors compound.
LLMs had many of the basic pieces. They could bid, trade, bluff, follow rules, complete sets.
They didn't reliably compose these abilities into a stable policy when other agents reacted and the game state changed after every move.
Different incentives for bidding spirals:
NEED: price out the auctioneer to close your quartet EXTRACT: auctioneer pays you via buy-right
BLOCK: win a card to stop a rival's quartet
BID UP: make a rival pay the inflated price
PUNT: speculatively win a card nobody collects yet
EXTRACT is the most cognitively demanding spiral.
You're not trying to win the card. You're modeling the auctioneer's willingness to keep it, pushing past their threshold so they pay you to walk away.
Gemini 3 Flash and Gemini 3.1 Flash Lite EXTRACT most and perform best.
Gemini 2.5 Flash Lite challenged for a goose with zero money cards. Opponent countered with 10 coins and won for free.
Gemini 3 Flash paid far above face value for a fourth quartet. Multiplicative scoring made it worth +1,800 points.
Strong agents shift gears across the game.
Gemini 3 Flash preserves cash early and ramps bid intensity nearly 10x by late game, timed to quartet-completion pressure.
Gemini 2.5 Flash Lite bids aggressively from turn one and stays flat. It skips the accumulation phase entirely.
What separates winners from losers is strategic coherence: spending efficiency, resource discipline, phase-adapted bidding, composed together.
Top agents end games broke with 4 quartets. Cash hoarders flatline at 2.
Holding money isn't the goal. Converting it is.
The strongest result isn't "LLMs are bad at games."
Broad reasoning scores don't measure strategic integration.
Same 7 LLMs, three benchmarks, three orderings. Gemini 3.1 Flash Lite: 2nd on Cattle Trade, 3rd on Chatbot Arena, last on AAI Index. GPT-5.4 Nano runs opposite.
Most LLM benchmarks ask: can the model answer the question?
We ask whether a model stays coherent across 60 turns of bidding, bluffing, bargaining, hidden offers, liquidity management, and adversarial play.
Inspired by the German card game Kuhhandel. Strategy under imperfect info
A few hundred lines of code. No pretraining. No language. No "reasoning." Just structured state tracking, cash estimates, simple bidding rules.
Across 242 games, it outranked 6 of 7 cost-efficient LLMs on TrueSkill. It doesn't forget who has money or what it's bidding against.
Public cards. Hidden money offers. Zero-value bluff cards. No change for overpayment. Buy-rights in auctions. Multiplicative scoring.
You need memory, arithmetic, opponent modeling, and timing.
All at once.
Why do models that score well on standard benchmarks still make bad strategic decisions?
In Cattle Trade, accepted at @MALGAI_ICLR2026 , LLMs play a 4-player 50+ turn game of auctions, hidden offers, bluffing, and limited cash.
Some LLMs lose to simple Python agents.
🧵
🤯 We cracked RLVR with... Random Rewards?!
Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by:
- Random rewards: +21%
- Incorrect rewards: +25%
- (FYI) Ground-truth rewards: + 28.8%
How could this even work⁉️ Here's why: 🧵
Blogpost: tinyurl.com/spurious-rewar…
342 Followers 58 FollowingWe design, develop, and evaluate AI systems that run the real world, and shape the researchers and founders who will define its future. Based at ETH Zurich.
694 Followers 2K FollowingProduct @GoogleCloud 🤖 Working on #AgenticAI & API Platform Infrastructure. Formerly @Stripe. Artificial Intelligence MS & MBA @Stanford @StanfordHAI 🌲
3K Followers 1K FollowingPhD in AI | GDE in AI/ML | CTO Intento | Author "Deep Learning with JAX"
📝 ML insights: https://t.co/ySSOXJKL7H
🤖 Daily AI paper reviews: https://t.co/yQNYyqTbBR
860 Followers 106 FollowingBuilding https://t.co/6ZEBsohKP9 | Compare LLMs across long context tests.
Managing Member @ DeX Group LLC
President @ Alkimi AI Corp
Plus others.
91 Followers 363 Followingpostdoc @aibrussels; working on providing reliable and verifiable ai mechanisms, with a strong focus on reinforcement learning
15 Followers 153 FollowingPhD student in AI (AI Lab Brussels). Topics: plasticity in neural networks, dynamical systems, AI safety, RL, evolution of cooperation.
37K Followers 1K FollowingCEO and co-founder of Core Automation
former VP of RL @ OpenAI : reasoning models, o3, o1, GPT4, ChatGPT, Codex, RL for robots
cautious AI optimist
1K Followers 360 FollowingCS PhD student @Stanford advised by @tengyuma & @tatsu_hashimoto. Former CS and Math undergraduate @Harvard. Website: https://t.co/zDpmBGVhkR
694 Followers 2K FollowingProduct @GoogleCloud 🤖 Working on #AgenticAI & API Platform Infrastructure. Formerly @Stripe. Artificial Intelligence MS & MBA @Stanford @StanfordHAI 🌲
443K Followers 6K FollowingChief Scientist, Google DeepMind & Google Research. Gemini Lead. Opinions stated here are my own, not those of Google. TensorFlow, MapReduce, Bigtable, ...
3K Followers 1K FollowingPhD in AI | GDE in AI/ML | CTO Intento | Author "Deep Learning with JAX"
📝 ML insights: https://t.co/ySSOXJKL7H
🤖 Daily AI paper reviews: https://t.co/yQNYyqTbBR
860 Followers 106 FollowingBuilding https://t.co/6ZEBsohKP9 | Compare LLMs across long context tests.
Managing Member @ DeX Group LLC
President @ Alkimi AI Corp
Plus others.
154 Followers 259 Followingprobably exercising, in nature or salsa dancing
love building dev tools. currently @anycloud_
prev @ycombinator, early @uber dev, @ucberkeley eecs.