ericmichael @ericmichael

Data + AI lead @ UTHealth RGV / UTRGV School of Medicine, BJJ Black Belt @ 80/20 Jiu-Jitsu Edinburg, TX Joined July 2008

Tweets

169
Followers

808
Following

716
Likes

1K

ericmichael @ericmichael

6 months ago

@rankintweets iPad, Tailscale SSH, Tmux. It’s great.

0 0 1 294 1

View Details

ericmichael @ericmichael

7 months ago

Biggest issue with Claude Code with Opus 4.5 is still that after a few context compressions it’s a huge risk to continue working in the conversation.

0 0 0 59 0

View Details

ericmichael @ericmichael

7 months ago

@jeremyphoward It’s a great and fast model! Simple little tricks like adding a no-op scratchpad tool to let it “plan” in-context can improve tool calling while keeping things super fast.

0 0 0 287 0

View Details

SAM 3D body does a great job at reconstructing 3D scenes of 2d Brazilian Jiu-Jitsu images. Struggles a bit with entanglement/occlusion. Initial tests of using out of the box SAM 3D embeddings for image—>position-embedding similarity shows promise compared to CLIP and could have a ton of useful application for coaches and athletes. Experimenting with fine-tuning an embedding model with contrastive loss to improve performance for position similarity.

0 0 2 76 0

View Details

ericmichael @ericmichael

7 months ago

@mervenoyann Replace detectron2/SAM2 with SAM3 + “person” prompt!

0 0 0 73 1

View Details

ericmichael @ericmichael

7 months ago

Very close to recreating the SAM3D Body playground. FastAPI + React + Three.JS. Vibe coded. 1. Use SAM3 prompted to identify people. Expose UI for user to select which identified people you want meshes for. 2. Pass masks to SAM3D Body to generate meshes and poses. 3. Animate with three.js. FB repo for SAM3D has lots of references but much better if you swap the repo’s human detection / masking with SAM3 (promptable) instead of detectron or SAM2.

0 0 2 159 0

View Details

ericmichael @ericmichael

7 months ago

@jxnlco 3D Jiu-Jitsu RAG course when?

0 0 0 16 0

View Details

ericmichael @ericmichael

7 months ago

Is it just me or is Playwright MCP the only useful MCP server out there? Everything else seems like noise. Or that using the MCP server would be more limited than just using their CLI interface (chainable and composable through bash).

0 0 1 59 0

View Details

ericmichael @ericmichael

7 months ago

@AnthropicAI “Tools. Models have access to a wide array of software tools (often via the open standard Model Context Protocol).” something about this line doesn’t sit right. There’s so few useful MCP servers that to throw this line into the article feels like poorly placed ad

0 0 2 4K 0

View Details

ericmichael @ericmichael

8 months ago

claude hates uv

tuna🍣 @tunahorse21

8 months ago

NO CLAUDE I SAID USE UV

56 159 3K 127K 190

0 0 2 67 0

View Details

ericmichael @ericmichael

8 months ago

I think we’re actually in agreement if I’m grokking this correctly. What I’m doing in practice for example is what I think you described in your summary: collecting large amount of diverse traces and hand labeling. Splitting dataset into train/dev/test and optimizing the judge (allowing it to receive signal from dev human annotations) until TPR/TNR reach an acceptable threshold. The human signal is a key part of the judge training. Then optimizing the LLM (using GEPA) against the GEPA-trained judge.

0 0 0 65 0

View Details

ericmichael @ericmichael

8 months ago

GEPA prompt optimization: You don’t need a judge but it’s super helpful. GEPA is going to produce a bunch of experiments on your prompt and you need a way of scoring the _output_ of those runs for GEPA to provide good reflexive feedback. If you have human labeled data the labels only apply for the specific prompt you used at that time. You need an automated way of scoring outputs with each prompt iteration to get the full power of GEPA approaches (IMO). You can do that with a well-aligned judge. Or you could do it with a code-based check. Your human labeled data went stale the second a new prompt variation was mutated. Humans would need to go relabel outputs for each prompt variation. Aligning a judge allows you to automate this on _unseen_ prompt/output combos so your experts don’t have to relabel the outputs for every mutated prompt. Any clever way you have of using an LLM to match the output versus human labeled data _is_ a judge whether you realize it or not. Don’t want to make and align a judge? Write a code-based check!

1 0 2 567 5

View Details

ericmichael @ericmichael

8 months ago

@gooby_esq I’d be curious to know what are these classes of problems that we expect an LLM to be able to solve but not be able to verify. It is likely the case that the judge needs access to data, tools, context to verify the solution.

1 0 0 35 0

View Details

Mitchell Hashimoto @mitchellh

8 months ago

Some reactions to my post are "wow, I'll never use Ghostty since you use AI." That's fine, I really don't care. But my friends, if you plan on avoiding all software that had any AI assistance in its dev, I have really bad news for you about the general software ecosystem.

97 104 3K 237K 162

View Details

ericmichael @ericmichael

9 months ago

I’m all for evals but with coding agents a lot of the eval work happened in post-training already and got built into the model. Not true for other domains. The amount of tools required for effective coding agents is really small: bash, apply_diff. Every other tool is really just a nice to have. Most of the code for building a coding agent goes to controlling execution / UI / approvals more tightly. All of which would fall under the scope of automated testing rather than evals. so I can see why they would delay evals until vibe iteration runs out. I just made my own coding agent and the out of box performance of GPT-5 high is really strong. Devs have to weigh the time it takes to eval correctly vs waiting for the next coding RL’d model to be released. For any other domain it’s unlikely that the developer is also religiously using it and also a domain expert.

0 0 0 720 0

View Details

ericmichael @ericmichael

9 months ago

- level 1: Chatbots, prompting (identity, goal, starting context, guidelines), using OpenAI compatible APIs - level 2: Single Agent + Basic Retrieval Tools, Tracing, Human Evals - level 3: Single Agent + Action Oriented Tools, Guardrails, Approvals, Reference-Based and Reference-Free Automated Evals - level 4: Deployment, CI/CD, Privacy, Compliance, BAAs, Enterprise considerations - level 5: Multi-Agent Architectures (handoffs, agents as tools), DSPy / Prompt Optimization using Eval data - level 6: Workflows vs Agents, Hybrid Approaches - level 7: Measuring ROI and making business case for AI applications - level 8: Measuring OSS performance vs proprietary across evals, cost, performance metrics Hot takes: - Agentic RAG > vector embedding based retrieval, easier to create and better out of box performance - Caching beyond what API providers already provide is premature optimization - Single Agent systems > Multi-Agent systems based knowledge complexity and bang for buck performance - Generic metrics like ROGUE, BLEU, etc harm more than they help. Evals should be human-centered and focused on measuring the prevalence of _observed failure modes_ specific to the application domain instead of context-unaware metrics - Inference, self-hosting, OSS not advisable until measuring the prevalence of failure modes through evaluations shows that OSS performance is worth it when compared to other factors such as cost (GPUs, salaries, development complexity), latency, etc. - De-identification (PII redaction) for regulated industries like healthcare is more challenging and imposes more legal risk than simply building HIPAA compliant environments that are allowed to just process and store the PHI

0 5 16 3K 31

View Details

ericmichael @ericmichael

9 months ago

@iScienceLuvr I think this would be really useful in a healthcare setting where folks are worried about agentic web search leaking PHI via search queries. Seems to me like it would be a much safer solution than any kind of automated de-identification scheme.