Distributed query engine providing simple and reliable data processing for any modality and scale (https://t.co/IN219tFqrN)daft.aiJoined September 2022
🚢 Daft v0.7.15 just shipped.
try_cast() converts types without crashing your pipeline — invalid values become null instead of throwing a runtime error.
Also in this release: LZ4 flight shuffle compression, UUIDv7 partition transforms, PostgreSQL source.
daft.ai/blog/daft-v0-7…
On a TPC-H repartition across 32 workers: at 10 TB the object-store shuffle runs the head node out of memory past 1,000 partitions, while Flight Shuffle completes every partition count and runs 3.6 to 4.7x faster where both finish.
If a distributed query has to materialize more than a few terabytes of data, there's one operation that will dominate: the shuffle.
Shuffling data at scale has been a real bottleneck for Daft users, so we took the time to fix the root cause and rebuild the shuffle from scratch.
VLA submissions at ICLR grew 18x in a single year, but World Action Models are showing more promising results when it comes to inference speed and adaptability.
ICLR is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence called representation learning.
As Physical AI has gone mainstream, a ton of research has focused on leveraging VLAs to translate the intelligence of LLMs into robotics tasks.
But VLAs are slow, and WAM like Shengshu's MotuBrain achieved 96% on RoboTwin 2.0 with an architecture that supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction in a single model.
"These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability."
It's crazy that MotuBrain runs at 11 hz and adapts to new humanoid embodiments with only 50--100 trajectories!
Link in the comments
First-class observability in Daft.
Operators, Tasks, Rows, Memory are all surfaced in a dashboard that ships with the install.
+ OTel endpoints for your existing collector.
+ Stuck detection.
+ DAFT_TRACE for console debugging.
~45 PRs across the observability stack.
daft.ai/blog/first-cla…
daft.VideoFile is perfect for Physical AI.
Open X-Embodiment aggregates over a million episodes. DROID alone runs 350+ hours of multi-camera 60fps footage. That's hundreds of millions of frames across a single dataset, and most action-model training doesn't need them all.
- read_video_frames — filter on keyframes; supports S3, GCS, & YouTube URLs.
- video_metadata — resolution, fps, duration, frame count from file headers.
- video_frames(start_time, end_time) — decode a 10-second window from a 90-minute file.
Frames land as Image columns in the same DataFrame.
Feed them to a vision model, compute embeddings, and write to Iceberg.
Check out the blog
daft.ai/blog/daft-vide…
VLAs are dead, long live World Action Models
So declares @DrJimFan, the most credible researcher in robotics today.
daft.ai/blog/vlas-are-…
👆We just published a short blog where @ykdojo breaks down the video. It certainly helped me correct my mental model.
So turns out I'm not the only one who builds on @daftengine 😆
In fact, theres a TON of projects that leverage daft natively to power their AI & data processing.
Daft is the Data Engine for AI.
> I say it because its true.
> I keep saying it because the Daft community keeps giving back!
Check out all these projects! (link in the comments)
Probably my favorite episode yet!
Just finished filming our latest episode of Zero Shot Espresso with @danimberman who is an @ApacheAirflow PMC, developed the @kubernetesio executor, and now helps technical teams ship production AI as a consultant.
🚢 Daft v0.7.10
30 contributors (a release record!)
41 new features and functions.
Distributed as_of joins, SimHash dedupe, temporal arithmetic, C++ extensions.
daft.ai/blog/daft-v071…
The fastest H3 geospatial indexing in Daft wasn't written by the Daft team.
Developed by Garrett Weaver, daft-h3 runs 3–16x faster than simply wrapping h3-py in a Python UDF. That speed up is thanks to Daft's Native Extensions powered by Apache Arrow's C Data Interface.
Most image embedding pipelines are actually two pipelines stitched together.
Script one: PySpark reads images from S3, resizes them, joins with metadata, writes to Delta Lake.
Script two: PyTorch loads ResNet, generates embeddings on GPU, writes back to Delta Lake.
Two frameworks. Two sets of dependencies. Two GPU configs. Serialization overhead at every boundary.
With Daft, it's one script. download → resize → join → embed → write. daft.cls handles GPU placement and batching. No handoff.
686 Followers 2K FollowingDevOps Platform Eng | Galatians 5:1 Stand fast therefore in the liberty by which Christ has made us free, and do not be entangled again with a yoke of bondage.
1K Followers 5K FollowingHead of R&D, Product Design & AI | Technologist, Designer, Strategist, Maker, & Futurist, with a passion for good design and innovation
159 Followers 360 FollowingDaft's #1 Fan - Data Engineer & Public Speaker
Simulating the Physical AI Data Loop @eventual.ai
MS Aero, BS Mech, BA App Physics
gh: everettVT
6K Followers 34 FollowingAt Essential AI, we're building an open platform to democratize frontier AI capabilities and accelerate breakthroughs globally through collaborative science.
11K Followers 67 FollowingDelta Lake is an open-source storage framework that enables building a Lakehouse architecture for Spark, Flink, Trino, Hive, Scala, Java, Rust, Python, & more!
3K Followers 1K FollowingCEO/Cofounder @lancedb, The AI-Native Multimodal Lakehouse. Early pandas co-author. Turning caffeine into code since the last century
1K Followers 7K FollowingAI @planet | geo | space | optimism | jhana dabbler | qualia surfer | latent space explorer | e/acc with care | here for vibes & AI, views are my own
465 Followers 158 FollowingCofounder @ Eventual
Daft the data engine for AI #RunModelsOnData
Eventual the data platform for integrating your agents with ctx
789 Followers 178 FollowingBuilding the semantic control plane that grounds AI agents in your data platform | prev: Blendo (CEO, acq. RudderStack), Trino | Always an Engineer
12K Followers 3K Following#MicrosofFabric user advocate, interests in Small Data & Self Service #Microsoftemployee since Dec 2023 , but my tweets are my own
424 Followers 205 FollowingCEO at https://t.co/8trqylaLqu
Building @daftengine: Distributed query engine providing simple and reliable data processing for any modality and scale