Today we release Token Superposition Training (TST), a modification to the standard LLM pretraining loop that produces a 2-3× wall-clock speedup at matched FLOPs without changing the model architecture, optimizer, tokenizer, or training data.
During the first third of training, the model reads and predicts contiguous bags of tokens, averaging their embeddings on the input side and predicting the next bag with a modified cross-entropy on the output side. For the remainder of the run, it trains normally on next-token prediction. The inference-time model is identical to one produced by conventional pretraining.
Validated at 270M, 600M, and 3B dense scales, and at 10B-A1B MoE.
The work on TST was led by @bloc97_, @gigant_theo, and @theemozilla.
808K Followers 322 FollowingTogether with the AI community, we are pushing the boundaries of what’s possible through open science to create a more connected world.
498K Followers 87 FollowingTensors and neural networks in Python with strong hardware acceleration. PyTorch is an open source project at the Linux Foundation. #PyTorchFoundation
69K Followers 196 FollowingHelping ambitious Founders and CEOs create meaningful Personal Brands on Twitter | AI & Web Dev Enthusiast |🔗DM For Paid Promotion.📩 [email protected]
227K Followers 4K FollowingPart of the DSC community, our focus is on the evolving future of data and the technology that is driven by it. Community Editor is Kurt Cagle.
250K Followers 2K FollowingThe world's leading publication for data science and artificial intelligence professionals.
Submit an Article ✍️ https://t.co/57pIMegK1o
1.2M Followers 787 FollowingProfessor at NYU & Executive Chairman at AMI Labs.
Ex-Chief AI Scientist at Meta.
Researcher in AI, Machine Learning, Robotics, etc.
ACM Turing Award Laureate.
465K Followers 1K FollowingML/AI research engineer. Ex stats professor.
Author of "Build a Large Language Model From Scratch" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)