We looked into how rankings change by task. Some models perform strongly across the board: Claude 4.5 Opus Thinking ranks #1 in Code Arena and #4 in the Text Arena. But we also see big shifts between Text and Code in both directions, with Grok 4.1 dropping -29 and MiniMax M2.1 rising +19.
GPT-5.2, GLM 4.7, DeepSeek v3.2, and MiMo v2 Flash also gained in the Code Arena rankings versus their Text performance.
2026 AI Benchmark & Capability Forecast
Core LLM Benchmarks
* ARC-AGI 2: 90% (Signaling high-level reasoning maturity)
* SWE-bench (Base): 90% (Near-complete automation of basic PRs)
* SWE-bench Pro: 73% – 74% (High-tier autonomous software engineering)
* GPQA: 94% – 99% (Saturated; exceeding expert human level)
* SimpleBench: 90% (Solid common-sense grounding) - this bench is really good at showcasing jaggedness
* LiveCodeBench: 95% – 100% (Saturated)
* MMLU Pro: 95% – 100% (Saturated)
Advanced Reasoning & Math
* Humanity’s Last Exam: 60% (Achieved without massive test-time scaling)
* Frontier Math (Tiers 1-3): 60%
* Frontier Math (Tier 4): 48%
MMMU (Multimodal): This becomes the new "GPQA-style" hurdle.
* MMMU (Val): 92%
* MMMU (Pro): 83% – 84%
Agentic & Industrial Milestones
* METR Time Horizons: Models successfully managing 12–14 hour long-form tasks (at 50–80% success rates).
* Artificial Analysis Index: Top models hitting the 80–83 range.
* Software Engineering: We reach a tipping point in coding automation, setting the stage for a "noticeable takeoff" in 2027.
* Continuous Learning: The first light implementations of models that learn "on the fly" begin appearing late in the year.
Macro Predictions & Industry Shifts
* The Labs & AGI: No lab will claim "AGI reached," but OpenAI, Anthropic, and xAI will state they have the exact blueprint to get there within a few years.
* DeepMind: Genie 4 will debut and shock the public with its world modeling capabilities.
* Tesla & Robotics: FSD is solved with rapid urban expansion; the first autonomous humanoid robots enter homes (though they remain slow and limited initially).
* Public Perception: A massive "disconnect" remains. The public will focus on AI art and video critiques, largely missing how powerful the underlying coding and agentic logic has become.
My Timeline
* Personal AGI Arrival Date: 2029.
* Reflection: Briefly considered moving it to 2028 after seeing Claude 4.5 capabilities, but currently standing firm at 2029.
Your Biggest Wish: A true Desktop Agent that can autonomously navigate a PC and use various apps to handle complex knowledge work.
[1/7]
Excited to share our new survey on Latent Reasoning! The field is buzzing with methods—looping, recurrence, continuous thoughts—but how do they all relate? We saw a need for a unified conceptual map. 🧵
📄 Paper: arxiv.org/abs/2507.06203
💻 Github: github.com/multimodal-art…
Really excited to work with @AndrewYNg and @DeepLearningAI on this new course on post-training of LLMs—one of the most creative and fast-moving areas in LLM development.
We cover the key techniques that turn pre-trained models into helpful assistants: SFT, DPO, and online RL.
Post-training is evolving fast—from scaling SFT to scaling RL, from human preferences to verifiable reward in math, coding, knowledge reasoning, agent and instruction following. This field has become a cornerstone of powerful language models. This course offers a practical overview of these.
We hope it’s a solid starting point for anyone looking to understand post-training or customize their own models.
Check it out here: bit.ly/4knRg33
New Course: Post-training of LLMs
Learn to post-train and customize an LLM in this short course, taught by @BanghuaZ, Assistant Professor at the University of Washington @UW, and co-founder of @NexusflowX.
Training an LLM to follow instructions or answer questions has two key
Accuracy takes time: LLMs with higher benchmark scores take longer to run. Anyone who has compared models knows this intuitively, but what exactly is the trade-off? Here’s what we found!
For friends of open source: imo the highest leverage thing you can do is help construct a high diversity of RL environments that help elicit LLM cognitive strategies. To build a gym of sorts. This is a highly parallelizable task, which favors a large community of collaborators.
Free useful guides on model distillations:
1. Model Distillation guide from @OpenAI
2. Knowledge Distillation tutorial by @PyTorch
3. Jetson Introduction to Knowledge Distillation by @nvidia
4. Tutorial on Knowledge Distillation with @kerasteam
5. @huggingface's guides:
- Knowledge Distillation
- Knowledge Distillation for Computer Vision
Save the link and check out the links below 👇
(1/8) LLM Model Evals 💪vs LLM Task Evals 🥊
Evals are all the rage 🔥, but they mean different things to different people.
The biggest confusion is that there are actually 2 different categories of evals.
1⃣Model evals (ex: HellaSwag, MMLU, TruthfulQA etc)
2⃣Task evals (ex: Q&A from Phoenix Evals: github.com/Arize-ai/phoen…)
Model Evals vs Task Evals is the difference between measuring "generalized fitness" 💪 and "specialized fitness" 🥊
Most of us would like to have generalized fitness because it allows us to do a variety of everyday activities well. But if sumo wrestling was your dream, you would obviously prefer to have a much larger body mass.
The problem is, most practitioners today are focusing on generalized fitness and getting crushed in the ring ☠️
🧵 on the differences
Tagging folks working on the LLM Model or Task Eval space!
@rown@universeinanegg@ybisk@YejinChoinka@allen_ai@haileysch__@lintangsutawika@hendrycks@markchen90@MillionInt@HenriquePonde@Shahules786@karlcobbe@mobav0@lukaszkaiser
🎓LLM Course
This is such a beautiful and comprehensive resource on LLMs.
It includes notebooks, key references, and roadmaps.
There is something to learn for everyone. For students, researchers, and practitioners.
The Prompt Engineering Guide is also referenced, which is cool to see.
One observation as I was reviewing the references is how much hard work the ML community dedicates toward open and high-quality education. This resource does a great job of organizing all those incredible LLM educational resources that exist out there.
One topic I would add is LLMOps. But to be fair, the majority of the topics are roughly covered in the LLM Engineer Roadmap.
Highly recommended!
And last but not least, many thanks to @maximelabonne for releasing this excellent resource. 👏
31 Followers 22 FollowingBuilt by model builders, for model builders.
Humanlaya engineers expert-level data and verifiable rewards for frontier AI labs.
545K Followers 2K FollowingPolyagentmorous ClawFather. Came back from retirement to mess with AI and help a lobster take over the world.
@OpenClaw🦞 + @OpenAI
11K Followers 3K Followingex-Head of APAC ecosystem @huggingface, interested in future tech. Ex-Googler on TFLite/micro. Ideas are my own. DM me to talk open source and robotics in APAC
228K Followers 7K FollowingOG GenAI Skeptic; spoke at US Senate. Warned about hallucinations in 2001. Advocating world models & neurosymbolic AI ever since. Author, Marcus on AI & 6 books