BerkeleyNLP @BerkeleyNLP

We work on natural language processing, machine learning, linguistics, and deep learning. PIs: Dan Klein, @alsuhr, @sewon__min nlp.cs.berkeley.edu Berkeley, California Joined September 2019

Tweets

125
Followers

7K
Following

37
Likes

131

Sewon Min @sewon__min

2 weeks ago

Really amazing results analyzing what's creative/novel vs. what's copied from Internet data, enabled by the amazing @liujc1998's Infini-gram! infini-gram.io This is also enabled in @allen_ai's OlmoTrace allenai.org/blog/olmotrace where anyone can find matching n-grams between LLM-generated text and its training data.

Alex Imas @alexolegimas

3 weeks ago

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction.

37 114 801 342K 580

1 10 83 17K 54

View Details

Negar Arabzadeh @NegarEmpr

4 weeks ago

1/ Thrilled to introduce T³: a corpus for RAG over reasoning tasks, built from thinking traces. We show that surprisingly RAG can improve reasoning— with the right corpus. Rag with Transformed Thinking Traces T³ gain by up to 43.9% on AIME 2025-2026. 🔗 arxiv.org/abs/2605.03344 🧵

11 31 210 473K 123

View Details

Ryan Yixiang Wang @RyanYixiang

a month ago

8/ For reproducibility and to enable further study of modularity in MoEs, we’re releasing EMO, baselines, and code: Models: hf.co/collections/al… Blog: allenai.org/blog/emo Code: github.com/allenai/EMO Viz: emovisualization.netlify.app Shoutout to @AkshitaB93 @sewon__min for making this possible!

1 3 14 1K 8

View Details

Sewon Min @sewon__min

a month ago

Full details + results: arxiv.org/abs/2605.06663 Also check this out emovisualization.netlify.app: Our model specializes qualitatively differently (capability-level rather than lexical) -- this emerged naturally even though we didn't expose any domain prior!

0 3 29 2K 8

View Details

Sewon Min @sewon__min

a month ago

As MoEs grow larger and sparser, they become memory-bottlenecked. What if experts were actually composable - so you only keep the subset relevant to your task? We show that this doesn't emerge in standard MoEs (their training makes this hard), but you can pre-train MoEs to support this kind of modularity! I hope everyone sees the right figure from @RyanYixiang 's original post - I was so excited when I saw this result!!

Ryan Yixiang Wang @RyanYixiang

a month ago

MoEs are everywhere in frontier models, and they are deployed as a monolith system. But many applications only need a narrow slice of capabilities, e.g., math, code, biomedical, etc. So what if "modularity" is actually the missing opportunity for MoEs? Today, we're releasing

7 73 529 114K 322

4 41 323 48K 222

View Details

Ryan Yixiang Wang @RyanYixiang

a month ago

Ai2 @allen_ai

a month ago

Today we’re releasing EMO, a new mixture-of-experts (MoE) model trained so modular structure emerges directly from data without human-defined priors. EMO can use a small subset of its experts for a given task while keeping near full-model performance. 🧵

13 57 405 88K 234

7 73 529 114K 322

View Details

Sewon Min @sewon__min

2 months ago

I will give two talks at ICLR workshops!! 🇧🇷 Sunday 9:40-10:10: "LLMs for Distributed Data Use" @ Workshop on Data Problems in Foundation Models (Room 203 A/B) Monday 15:30–16:05 : "Are Mixture-of-Experts Modular? Why It Matters and How to Fix It" @ ICBINB Workshop (Room 201 C) Both happened to be related to MoEs, but tackle two completely different questions → some say hi!

2 9 130 13K 18

View Details

Sanjay Adhikesaven @sadhikesaven

2 months ago

Imagine you fully post-trained "YourModel v1". Then, you've got better data — math, code, tool use, safety — and you want to improve it. Today, that usually means retraining the whole model. But what if new data could be added modularly, with a fixed cost each time?

Ai2 @allen_ai

2 months ago

Last year, we introduced FlexOlmo, a novel way to train parts of a model independently then combine them later. BAR builds on that idea for a harder problem: how to keep improving a model without having to retrain each time. 🧵

7 20 199 70K 144

5 18 139 20K 89

View Details

Sewon Min @sewon__min

4 months ago

Exciting results on open-source modes for IMO-level problems - congratulations to @aviral_kumar2 and everyone involved!! Great to see @wenjie_ma's ProofGrader (proofgrader.github.io) integrated into the development ✨

Lewis Tunstall @_lewtun

4 months ago

We trained a tiny 4B model to reason for millions of tokens through IMO-level problems. Heaps excited to share our new blog post covering the full pipeline, from distilling the 🐳 to augmenting RL with a reasoning cache that unlocks extreme inference-time scaling for theorem

24 129 824 163K 874

0 11 83 12K 42

View Details

Sewon Min @sewon__min

6 months ago

Really excited about this work!! As a retrieval person, having a pre-training-scale retrieval index in an academic setting has long been a dream, and I thought it would be too difficult / infeasible. Collaborating with systems experts made it possible much earlier than I expected. Huge thanks to the students driving this: @YichuanM and @jinjianliuu !

Yichuan Wang @YichuanM

6 months ago

5 52 174 67K 112

5 16 120 23K 48

View Details

Yichuan Wang @YichuanM

6 months ago

(1/N) 🚀 DS-Serve is a framework for efficient, scalable neural retrieval — it turns any in-house dataset (<1T tokens) into a high-throughput (up to 10,000 QPS), low-latency (<100ms), memory-efficient (<200GB RAM) retrieval system with a web UI and API. With DS-Serve, we publicly deployed a 400B-token datastore of high-quality LLM pretraining data (2B vectors), spanning academic resources — and it matches commercial search endpoints on our benchmarks at extremely low latency and high throughput. Try it out: api.ds-serve.org:30888/ui Blog: berkeley-large-rag.github.io/RAG-DS-Serve Work from UC Berkeley ( @BerkeleyNLP & @BerkeleySky) with collaborators at UW & UIUC!

5 52 174 67K 112

View Details

Jiaxin Ge @aomaru_21490

8 months ago

✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 echo-bench.github.io

4 31 129 47K 47

View Details

Sewon Min @sewon__min

8 months ago

Super excited about @wenjie_ma's work on verifying math proofs! ✅ 24 competitions, 3 SoTAs (o3, Gemini-2.5-Pro, R1) ✅ Strong evaluator -- a carefully designed evaluator with simple ensemble beats agentic ones ✅ Strong best-of-n performance Check out the paper & website!

Wenjie Ma @wenjie_ma

8 months ago

9 37 196 60K 96

3 12 116 32K 41

View Details

Wenjie Ma @wenjie_ma

8 months ago

LLMs solving math benchmarks with verifiable answers like AIME? ✅ LLMs solving math proofs? ❌ Still an open problem. RL works great for final-answer problems, but proofs are different: - Often no single checkable answer - Correct answers can hide flawed reasoning The key bottleneck: reliable proof evaluation. Without a good evaluator, we can't automatically evaluate or train better "provers." Our new work tackles this challenge step by step. 🧵 📄 Paper: arxiv.org/pdf/2510.13888

9 37 196 60K 96

View Details

Kayo Yin @kayo_yin

a year ago

Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🧠🎉 How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach? 🌐 sites.google.com/berkeley.edu/p… 📅 Submit by June 23rd

6 20 94 58K 23

View Details

Ruiqi Zhong @ZhongRuiqi

a year ago

Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @AnthropicAI Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P

30 35 540 58K 265

View Details

Nicholas Tomlin @NickATomlin

a year ago

The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hope our new *benchmark generator* can help measure progress toward this vision

Vivek Verma @vcubingx

a year ago

🎮 Excited to announce gg-bench, a fully synthetic benchmark for LLMs consisting of games generated entirely by LLMs!! This benchmark centers around the fact that LLMs are capable of generating complex tasks that they themselves cannot even solve. 📄: arxiv.org/abs/2505.07215

3 23 145 38K 90

4 29 179 26K 122

View Details

Vivek Verma @vcubingx

a year ago

3 23 145 38K 90

View Details

Nicholas Tomlin @NickATomlin

a year ago

I'm incredibly excited to share that I'll be joining @TTIC_Connect as an assistant professor in Fall 2026! Until then, I'm wrapping up my PhD at Berkeley, and after that I'll be a faculty fellow at @NYUDataScience

33 10 201 17K 11

View Details

Ruiqi Zhong @ZhongRuiqi

a year ago

Finished my dissertation!!! (scalable oversight,link below) Very fortunate to have @JacobSteinhardt and Dan Klein as my advisors! Words can't describe my gratitude, so I used a pic of Frieren w/ her advisor :) Thanks for developing my research mission, and teaching me magic