未达之风 @MLostWind

Joined September 2014

Tweets

12
Followers

16
Following

435
Likes

0

Dustin @r0ck3t23

2 months ago

Ilya Sutskever just told the AI industry why scaling is finished. One word built it. One word is about to break it. Sutskever: “Scaling is just one word, but it’s such a powerful word because it informs people what to do.” For five years, that single word replaced an entire research culture. Nobody needed breakthroughs. They needed bigger checks. Sutskever: “If you mix some compute with some data into a neural net of a certain size, you will get results, and you will know that it will be better if you just scale the recipe up.” That’s not science. That’s a recipe. Sutskever: “Companies love this because it gives you a very low risk way of investing your resources.” The most transformative technology in human history ran on the same logic used to franchise a restaurant chain. More locations. More ingredients. Same recipe. Predictable returns. You didn’t need researchers who could see around corners. You needed accountants who could approve purchase orders. But recipes expire. Sutskever: “At some point though, pre-training will run out of data. The data is very clearly finite.” Five years of infrastructure. Five years of hiring. Five years of investor decks. All built on top of something temporary. Sutskever: “I don’t think that’s true.” The co-founder of OpenAI. The mind behind the breakthroughs that made this entire era possible. Saying more money won’t solve it. Sutskever: “In some sense we are back to the age of research.” Most of the companies racing to build AGI were never research companies. They were scaling companies. They hired for execution. Not discovery. They optimized for throughput. Not insight. The talent pipelines. The investor pitches. The board decks. All built around one assumption. That the recipe would never expire. It’s expiring. And the companies that spent five years perfecting the art of spending money are about to discover something. The next era demands what capital can’t purchase. An original idea.

66 141 1K 150K 1K

View Details

SemiAnalysis @SemiAnalysis_

2 months ago

NVIDIA vLLM NVL72 ADVANTAGE: GB200 NVL72 delivers up to 3x performance compared to B200 on @Kimi_Moonshot 's Kimi K2.5. This is enabled by GB200's scale-up network which allows for frontier inference optimizations like wide expert parallelism. Great work to @rogerw0108 @NVIDIAAIDev @vllm_project @inferact @simon_mo_ ! 🚀 Not only is SGLang optimized for disagg+wideEP but vLLM is optimized too!

2 25 169 52K 46

View Details

LMSYS Org @lmsysorg

2 months ago

🚀 Congrats @Alibaba_Qwen on releasing Qwen3.6-35B-A3B — day-0 support is now live in SGLang! The first open-weight Qwen3.6: 35B total params (3B active), same hybrid architecture as Qwen3.5, with major upgrades in agentic coding & thinking preservation. 🔧 Agentic Coding: frontend workflows & repo-level reasoning with greater fluency 🧠 Thinking Preservation: retains reasoning context from historical messages ⚡ Gated DeltaNet + Sparse MoE (256 experts, 8+1 active): high throughput, low latency 📏 262K native context, extensible to 1M Cookbook: cookbook.sglang.io/autoregressive… Launch with SGLang:

Qwen @Alibaba_Qwen

2 months ago

⚡ Meet Qwen3.6-35B-A3B：Now Open-Source！🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license. 🔥 Agentic coding on par with models 10x its active size 📷 Strong multimodal perception and reasoning ability 🧠 Multimodal thinking + non-thinking modes

445 2K 12K 2.7M 5K

3 5 36 3K 8

View Details

SemiAnalysis @SemiAnalysis_

2 months ago

Curious what's in the PR of almost 1400 kernels? Here we walk through a simple batched GEMM kernel 🟠 Tile size: M128, N16, K256 🟠W4A16: matrix A is INT4 with BF16 scaling factor for every 32 elements, matrix B is BF16 🟠3 pipeline stages 🟠1 CTA MMA 🟠Static scheduler This warp specialized kernel has the following warp roles: 🟠Load A 🟠Load A scaling factor (SF) 🟠Load B 🟠Cast A: Dequantize INT4 to BF16. Waits on Load A and Load A SF 🟠MMA: Performs matmul. Waits on Cast A and Load B 🟠Epilogue: Performs activation computation. Waits on MMA An interesting thing about this kernel is that its MMA uses TS mode due to matrix A dequantization requires CUDA cores, which work on registers instead of TMEM. As shown in our microbenchmarking article, TS mode has slightly lower throughput due to SMEM bandwidth bottleneck. In addition, @cursor_ai also shown that the CUDA core / Tensor Core compute gap also creates bottlenecks. To mitigate these issues, we see the kernel uses pipelining, similar to what Cursor did. Microbenchmarking article: newsletter.semianalysis.com/p/dissecting-n… Cursor blog post: cursor.com/blog/kernels

Alex Zhurkevich @cudagdb

3 months ago

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…

13 51 334 149K 261

0 6 56 18K 43

View Details

SambaNova @SambaNovaAI

2 months ago

Faster AI chips alone don't fix slow inference. The real bottleneck is data movement. In the decode era, how well your architecture moves data determines speed, throughput, and cost. Here's why Dataflow matters more than ever 👇 sambanova.ai/blog/why-dataf…

1 2 2 320 1

View Details

DAIR.AI @dair_ai

2 months ago

Banger paper from NVIDIA. Agentic reasoning needs models that are not just capable, but efficient at long-context inference. The agent model layer is moving toward open, long-context, high-throughput architectures. This paper introduces Nemotron 3 Super, an open 120B parameter model with 12B active parameters, built as a hybrid Mamba-Attention Mixture-of-Experts architecture. The headline numbers are strong: up to 1M context length, comparable accuracy on common benchmarks, and up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B. The model combines several efficiency bets, including NVFP4 pretraining, LatentMoE for accuracy per FLOP and per parameter, and MTP layers for native speculative decoding. It is trained on 25 trillion tokens, then post-trained with supervised fine-tuning and RL. Paper: arxiv.org/abs/2604.12374 Learn to build effective AI agents in our academy: academy.dair.ai

8 37 222 21K 169

View Details

GitTrend @GitTrend0x

2 months ago

Karpathy 一丢代码，全网程序员集体进化了！大神又整活了：扔出极简 repo/gist，社区直接把它当底层骨架，卷出一堆生产级神器。不是简单的 fork，是真正的底层进化、从教育玩具变成能自动研究、自动建知识库、4 小时训 ChatGPT 的狠活。我挑了 4 个正在 X 上刷屏的“Karpathy 系进化体”，程序员看了会沉默，AI 玩家看了会狂喜： 1️⃣ autoresearch（github.com/karpathy/autor…） 630 行代码，让 AI agent 自己改代码、训模型、打分、留优。人类睡觉，它进化。 “手动调参秃头活？交给机器吧！” （已有人 remix 成 ooda 版，A/B 测试、文案优化全能套） 2️⃣ llmwiki 系列（github.com/lucasastorian/… 等）基于 Karpathy 的 LLM Wiki gist 进化：LLM 不再是搜索引擎，而是 Obsidian 里的“程序员”，自动总结、交叉引用、滚雪球式维护知识库。 RAG 哭晕在厕所，推特直呼“知识库自己长大了”。 3️⃣ nanochat（github.com/karpathy/nanoc…）大神最新“unhinged”作品：nanoGPT 的全栈进化版，单 GPU 4 小时 $100 出一个能聊、能写诗、能解题的 ChatGPT 克隆。设计初衷就写着“maximally forkable”，下一个研究 harness 预定！ 4️⃣ micrograd / nanoGPT 衍生playground（silicon-more、napagrad 等）从 Zero to Hero 课程底层进化而来，计算图 + 训练循环被玩出花，成了无数人的 AI 启蒙+benchmark 底座。 // 为什么这些项目这么爆？ Karpathy 从不给你黑箱框架，他给的是极简、可读、可 hack 的骨架。你 fork 它，不是在抄作业，而是在和大师一起递归自我改进。这才是开源的最高境界：一个人的代码，变成全世界的进化树。

11 177 961 66K 1K

View Details

nash_su - e/acc @nash_su

2 months ago

继 Karpathy 大神后，YC 的 CEO @garrytan 也分享了自己的知识管理方法： gist.github.com/garrytan/49c88… Karpathy大神的： gist.github.com/karpathy/442a6… 我将各自理论的核心架构、逻辑和概念可视化出来，方便大家看

17 318 2K 145K 2K

View Details

NVIDIA AI Infrastructure @NVIDIAAIInfra

2 months ago

We deliver the lowest token cost through extreme co-design. As NVIDIA software optimizations increase token throughput, the value of your NVIDIA GPUs grows from the moment you invest in them. Learn more ➡️ nvda.ws/4me7HBr

14 18 235 40K 32

View Details

Deep Learning Weekly @dl_weekly

2 months ago

🤖 From this week's issue: A practical guide to LLM inference optimization framed around the "efficient frontier" concept — five techniques that move production systems toward the latency/throughput Pareto boundary without additional hardware spend. cloud.google.com/blog/topics/de…