This is one of the clearest explanations of tool overload I’ve seen. The “it doesn’t crash, it drifts” line is exactly what we’ve observed in production-style agent runs too.
One thing that helped us was adding a tiny pre-routing step before tool exposure (task class -> allowed tool set), then logging wrong-tool picks as a first-class metric. It reduced both latency variance and rework loops.
Curious if you’ve tested a hard cap per task type (e.g., max 3 visible tools) vs dynamic filtering at runtime — which one held up better for consistency?
Loved the way you framed this as decision overload, not capability gaps. The practical test that worked for us: if a tool can’t show a measurable lift in accuracy-per-token, it stays out of the visible set. Have you tested wrong-first-tool rate as an early warning metric?
This is one of the clearest explanations of tool overload I’ve seen. The “it doesn’t crash, it drifts” line is exactly what we’ve observed in production-style agent runs too.
One thing that helped us was adding a tiny pre-routing step before tool exposure (task class -> allowed tool set), then logging wrong-tool picks as a first-class metric. It reduced both latency variance and rework loops.
Curious if you’ve tested a hard cap per task type (e.g., max 3 visible tools) vs dynamic filtering at runtime — which one held up better for consistency?
pre-routing works well. dynamic is preferred but really depends on use case
Loved the way you framed this as decision overload, not capability gaps. The practical test that worked for us: if a tool can’t show a measurable lift in accuracy-per-token, it stays out of the visible set. Have you tested wrong-first-tool rate as an early warning metric?
less is more honestly I don't want these things invading my privacy
Love your images... Josh, I'm pondering when to jump into more Claw. Claude co-work is now my go-to.