Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Cache-Optimized Agentic Fanout: Squeezing 65x More Intelligence Per Dollar From Claude
Learn how to achieve 65x cost savings on agentic workloads by optimizing for Anthropic's cache economics, processing large documents with parallel agent invocations.
We got a 65x reduction in per-rule input cost by designing the prompt architecture around Anthropic’s cache economics — and the constraints that forced are more interesting than the savings. Live demo of a production system that evaluates 50+ natural-language compliance rules against 100+ page financial documents. Each rule needs full document context (~32k tokens), so naive execution is a non-starter on cost and rate limits. The fix: append-only message threads where nothing is ever mutated or removed, because any change to the prefix — including the tool schema — invalidates the cache. Tools that should not fire are rejected with a reply prompt, not removed from the definition. A warming request primes the cache before the real fanout begins. pg-boss orchestrates the map-reduce: Phase 1 parses documents in parallel via Reducto, Phase 2 fans out N independent agent invocations (one per rule, configurable concurrency per ECS task), Phase 3 reduces to review status. After the first cache write, subsequent rules pay ~500 new tokens instead of ~32k (!!!). The agent has four tools that persist directly to Postgres with no post-processing step: Think (structured reasoning), Calculate (mathjs-backed arithmetic verification), AnnotationWrite (bounding-box PDF evidence traceable to specific parsed document blocks), and RecommendationWrite (PASS/FAIL with confidence). Full OTel instrumentation with GenAI semantic conventions traces every tool call, token count, and cache hit/miss.