Seven stories.
Teams that ship.
Real engagements with real metrics. Names are abstracted where confidentiality requires — happy to walk through any of these in detail on a call.
The problem. A high-growth B2B SaaS was generating 1,400 inbound leads per month from content, ads, and partner referrals. Five SDRs were spending 60% of their time qualifying leads, half of which were obviously unfit. High-intent leads were sitting in the queue for hours before getting touched.
What we built
A Slack-native triage agent that ingests every new lead in real time. It enriches the contact and account against Apollo and a custom intent feed, scores against an ICP rubric we built with their VP Sales, and either books a meeting directly via Calendly (for clearly qualified leads) or escalates to an SDR with a 3-bullet brief and recommended outreach.
The hard parts
The interesting engineering wasn't the LLM — it was the eval set. We built a labeled corpus of 800 historical leads with the actual sales outcome (won, lost, no-fit, etc.) and used it to tune the scoring rubric until the agent matched human reviewer judgment on 91% of cases. We rebuild and re-run those evals every time the prompt or model changes.
What's running now
The agent has triaged 4,200 leads over 90 days. SDRs went from 60% of their time on qualification to 16%, redeployed into pipeline acceleration. Demo conversion on agent-booked meetings is 2.4× higher than the prior baseline. They moved into a $5K/month retainer at the end of the included 30-day support window.
The problem. A growing DTC brand was getting buried in support tickets — order status, returns, sizing, product questions. Their support team was a mix of full-time and seasonal contractors, with high variance in reply quality and slow response times during peak.
What we built
A Zendesk-integrated copilot that does RAG over their help center, past tickets, and live order data from Shopify. For straightforward tickets (order status, return requests, sizing questions) it drafts and sends replies autonomously. For nuanced or sensitive tickets, it drafts a reply that an agent reviews, edits, and approves in one click.
The hard parts
The product catalog has 1,200 SKUs with frequent updates, and the help center had drift — old policies, contradictory articles, missing edge cases. Before the agent could ship, we ran a knowledge-base audit and surfaced 47 documents that needed updating. The customer ops team rewrote those during weeks 2–3. Without that, the agent would have hallucinated confidently. With it, it doesn't.
What's running now
68% of tickets are fully deflected without a human in the loop. Of the remaining 32%, the agent has drafted a reply that's accepted (with edits) on 84% of cases. Support headcount has stayed flat through 3× volume growth over two peak seasons. CSAT improved 6 points. They moved into a $6K/month retainer.
The problem. A mid-size firm was spending huge associate time on first-pass contract review — NDAs, vendor agreements, master services agreements. The work was repetitive, the markups were highly templated, but partners couldn't approve a fully-automated solution without an audit trail.
What we built
A document review agent trained on five years of the firm's own redlines. It reads incoming contracts, identifies clauses that deviate from firm-standard positions, and drafts a redline that an associate edits and finalizes. Every suggestion is logged with the source document and reasoning, so partners can audit any decision retroactively.
The hard parts
Legal sign-off was the entire problem. We spent three weeks of the engagement just building the audit trail and reviewing it with the firm's COO and managing partner. The agent's outputs are intentionally never sent to a client without human review — the win is associate speed, not headcount. That framing made the build defensible.
What's running now
Associates save 4.5 hours per matter on first-pass review. Throughput is up ~35% across the contract-heavy practice. Partner sign-off on the audit-trail workflow held up under a malpractice insurer's review. They moved into a $4K/month retainer focused on extending the agent to new contract types.
The problem. A growing marketplace had a moderation queue that couldn't keep up. New listings sat unreviewed for 18+ hours during peak, hurting conversion and seller experience. Fraud and policy violations were getting through because human moderators were rushing.
What we built
A multimodal agent that reviews each new listing's text and images against the marketplace's policies. Clean listings auto-approve in under five seconds. Borderline cases route to a human moderator with the agent's reasoning attached. Every moderator override feeds back into the eval set, and the agent re-tunes weekly.
The hard parts
Policy itself was a moving target. We worked with the Trust & Safety lead to formalize 23 separate policy categories into a structured spec — the work that policy documents alone couldn't capture. The agent's first-week performance was 78% agreement with moderators; by week four it was 94%. The improvement came almost entirely from the eval feedback loop, not from prompt-engineering.
What's running now
31K listings reviewed in the first overnight batch test. Now running on every new listing in real time. 94% agreement with moderator decisions on hold-out test sets. 19 hours/day of moderator time saved, redeployed into investigation, policy refinement, and seller education. They moved into an $8K/month retainer.
The problem. A mid-market SaaS had a Salesforce instance that had degraded over five years of growth — duplicate accounts, stale contact data, deals stuck in stages with no recent activity. Forecasting was unreliable. AEs were spending Monday mornings cleaning their own records.
What we built
A scheduled agent that runs nightly across the CRM. It detects and merges duplicate contacts and accounts using firmographic and identity matching. It re-enriches stale records via Clearbit and Apollo. It flags deals with no activity in 21+ days for AE follow-up. It writes a weekly summary to the RevOps lead.
The hard parts
Duplicate merging is the kind of thing that can do real damage. We built in a quarantine mode where every proposed merge was human-reviewed for the first three weeks before the agent gained autonomous merge authority — and even now, AEs can roll back any merge with one click. The trust came from never doing irreversible things without a clear undo.
What's running now
The initial cleanup found 38% of records had stale or missing critical fields and surfaced $1.2M in revisitable pipeline that had gone cold. AE Monday pipeline reviews shortened from 60 minutes to 25. They're now expanding the engagement into a Custom Platform that combines hygiene with intelligent next-best-action recommendations.
The problem. A fast-growing apparel brand had 3,200 SKUs on Shopify, most with supplier-written descriptions — inconsistent tone, missing SEO keywords, no brand voice. Seasonal launches kept adding SKUs faster than the two-person content team could handle. Product pages were their top organic entry point, but conversion was leaking because the copy wasn't doing its job.
What we built
A Shopify-integrated content agent trained on three years of the brand's best-performing copy. It ingests product attributes, images, and category context, then produces a description, SEO title, meta description, and six bullet points per SKU. A brand guidelines module runs a self-review pass before each output is staged for the team to review and publish.
The hard parts
Voice consistency at 3,200 SKUs is where most off-the-shelf tools fail. We spent two weeks building a brand voice eval set — 120 rated examples across tone, specificity, and prohibited language — and tuned the agent against it before running the full catalog. Without that foundation, the output would have been generic. With it, the content director accepted 94% of first drafts with minor edits.
What's running now
All 3,200 SKUs were rewritten in two weeks. New SKUs are now content-ready on day of launch, not two weeks after. Organic product page traffic improved 22% in the following quarter. The eight months of copywriter time freed up was redeployed into campaign and editorial work.
The problem. A B2B SaaS in a competitive vertical had a one-person content team that was the bottleneck for everything: blog, email newsletters, LinkedIn, and sales enablement. The content director was spending 70% of her time writing and 30% on strategy. The pipeline stalled whenever she was unavailable.
What we built
A content pipeline agent integrated into their Notion workflow. The content director submits a brief (topic, audience, goal, key points) and the agent researches the topic, pulls the competitive landscape, and produces a publish-ready blog post, email version, and three LinkedIn variants. A brand voice module — trained on two years of their highest-performing content — runs before delivery.
The hard parts
The brand voice was the product. We spent the first two weeks of the engagement doing a content audit: rating 200 articles against a rubric for specificity, tone, depth, and differentiation. That corpus became the eval set. The agent's first outputs scored 71% on brand alignment; after two weeks of tuning it was at 92%, which the content director called "better than most freelancers I've worked with."
What's running now
The team went from 4 posts per month to 16 with no new headcount. The content director now spends her time on editing, strategy, and new formats. Organic traffic grew 34% in four months. They've since extended the agent to produce sales one-pagers and case study first drafts from call transcripts.
Yours is next.
If any of these patterns rhyme with your business, the Audit is the fastest way to find out whether the same playbook works for you.
Start the conversation →