The 7 Agent Tasks That Broke (and How I Fixed Them) in 90 Days

On Monday I posted the 90-day numbers from rebuilding this agency as an AI-first operation. The wins.
This is the other half. The seven specific agent tasks that broke in the same 90 days, the root cause for each, and the fix that stuck. Published as a build-in-public log, not a cleaned-up case study.
If you're considering an AI-first stack — these are the shapes of failure I want you to expect.
1. The standup digest agent quietly skipped tickets when JQL exceeded the response window
What broke: An agent scheduled to run a morning Jira standup digest paginated through tickets, hit a soft response-window limit, and silently truncated the output. The digest looked complete. Three tickets per cron run were missing.
How I caught it: A team member asked why a ticket they had moved to "In Review" wasn't in the digest. The digest had no row for that project at all.
Root cause: The agent's prompt said "summarize all in-progress tickets" — it didn't say "if pagination returns a continuation token, follow it before summarizing." The agent took the first page and treated it as the universe.
The fix: Two changes. First — explicit instruction in the prompt that any paginated response must be exhausted before the summarize step. Second — a check at the end of the agent run that compares the count of tickets summarized against the count returned by a fresh count-only query. Mismatch flags a warning in the handoff log.
Principle: Agents do not naturally exhaust pagination. They take the first page as complete. If your agent reads from any source that can paginate, write the pagination loop into the prompt and verify the count.
2. The blog post agent shipped a draft with a fabricated statistic in the opening paragraph
What broke: A scheduled cron agent shipping weekly content drafted a Cluster A piece that opened with a confident statistic about AI citation rates — a number that did not appear anywhere in the cited sources. The number sounded plausible. It was wrong.
How I caught it: Manual review before publish. The number anchored the entire piece, so I traced it back to source — and the source did not contain it. Close, similar shape, different value.
Root cause: The agent had been told "cite source URLs for every claim." It complied — every paragraph had a citation. But the citation only had to exist near the claim, not actually contain the claim. The agent learned that proximity equals citation.
The fix: Three layers. One — explicit prompt rule that the cited source must contain the claim, not adjacent claim. Two — a regex sweep before publish that flags any numerical claim with a % or $ or specific number against a "needs-source-verify" check. Three — a pre-publish gate that requires the founder to confirm every number-with-citation is genuinely supported.
Principle: Citation discipline is not enforced by asking the agent to cite. It is enforced by checking that the citation actually contains the claim. The agent will satisfy the letter and miss the spirit unless you check.
3. The cold outreach drafting agent generated a DM that "cited data" the prospect would not be able to find
What broke: A DM drafted for a prospect referenced "the ChatGPT citation gap your store has against your two closest competitors" — a confident, specific framing. The agent had not actually run the query. The number was inferred.
How I caught it: I read the draft, asked the agent to show me the query log that produced the comparison, and there was none. The "data" was a plausible inference, not a measurement.
Root cause: The agent had been told to use data-led DMs. It interpreted "data-led" as "include numbers" rather than "quote measurements you actually ran." Cheaper to infer than to measure.
The fix: New rule in the outreach drafting prompt — any data claim in a DM must include the literal prompt that was run and the literal answer that was received, quoted, in the agent's working notes. The DM body can paraphrase. The working notes must contain the measurement. The founder reviews working notes before send. No working notes equals no send.
Principle: Plausible inference framed as data is a fabrication risk in cold outreach. The fix is to require evidence of the measurement, not just the claim. This is the highest-leverage fix on this list — the others cost time, this one costs trust.
4. The connect-acceptance agent accepted a competitor agency owner as a 1st-degree connection
What broke: An agent scheduled to process LinkedIn connection invitations accepted a profile that looked like a 1st-degree founder match by title. The profile was actually an agency owner in an adjacent vertical — exactly the kind of contact our criteria explicitly excluded.
How I caught it: Manual spot check of the accept queue the following week.
Root cause: The criteria were "1st degree, English profile, ecommerce founder match." The agent matched on title keyword "Founder" without parsing the company description, which clearly listed agency services. The criteria were too thin.
The fix: Two changes. First — added an explicit exclusion list to the accept criteria: agency, consultancy, freelance, SaaS, marketing services. Any profile whose tagline or current role description contains exclusion keywords gets routed to a "review queue" instead of auto-accept. Second — the agent must record a one-line justification for each accept that references the specific brand or store URL it identified. No brand reference equals no accept.
Principle: Accept criteria need both an inclusion list and an exclusion list. The agent will satisfy inclusion criteria with false positives if exclusion is implicit. Make the "no" list explicit and machine-checkable.
5. The weekly review agent paraphrased a scheduled task as completed when it had only been drafted
What broke: The Saturday weekly review agent generated a status update that said "Lead Magnet #1 outline shipped this week." The outline existed as a draft file. It had not been reviewed, not been approved, not been built into a real lead magnet PDF.
How I caught it: The status update sounded too clean against what I actually remembered shipping. I checked the artifact. The outline was a markdown draft, not a finished PDF.
Root cause: The agent was reading the weekly progress files to determine status. The progress files used "ship" as a generic verb — sometimes meaning "draft created," sometimes meaning "deployed to production." The agent could not distinguish.
The fix: Tightened the vocabulary across all status entries. Three states only: draft (file exists, not reviewed), approved (reviewed by founder, ready to publish), live (deployed, URL verified). The weekly review agent paraphrases each item with the literal state word. No more "shipped" as a status — the word only appears in a sentence describing the action that produced a state transition.
Principle: Vocabulary drift between agents is one of the cheapest ways to corrupt downstream output. Lock the vocabulary explicitly and require the agents to use only the locked words for state.
6. The MDX shipping agent forgot to add a generated post to the sitemap
What broke: A blog post was published, the MDX file landed in the right directory, the build deployed, the URL returned 200. The sitemap did not include the URL. Search engines were not pinged on it for two days.
How I caught it: The agent's own D+1 follow-up check ran a sitemap grep for the new slug and reported back "URL not present in sitemap."
Root cause: The build script regenerated the sitemap from a glob of MDX files. The glob excluded a newly added subdirectory. The agent shipped to the new subdirectory because the topic warranted it, but the build infrastructure had not been updated to include the new location.
The fix: Two changes. First — added the new subdirectory to the build glob. Second — extended the D+1 check to also confirm IndexNow ping was sent successfully, not only that the URL is live. Third — the agent's MDX placement rule was tightened: any new file path that does not match an existing glob pattern triggers a flag in the handoff log, before deploy.
Principle: Infrastructure and content move at different speeds. The agent does not know that the build glob exists. Either teach it about the glob or fail loudly when the glob does not cover the new file.
7. The morning briefing agent created a phantom "Active" sales lead from an old thread
What broke: The morning briefing surfaced a lead as "Active follow-up window — Day 5 of 7." The lead was real. The follow-up window had closed three weeks earlier when the prospect explicitly declined.
How I caught it: I tried to draft a follow-up and found the prospect's "thanks but no thanks" message at the top of the thread.
Root cause: The agent was reading from a CRM memory file that had not been updated when the prospect declined. The decline message was logged in the email connector. The CRM memory file still showed "Active." The briefing agent trusted the CRM file as canonical state.
The fix: Three changes. One — the briefing agent now reads from both the CRM file and the email connector for any lead it surfaces, and flags any divergence. Two — when a lead is mentioned in any inbox connector with words like "not a fit," "decline," "pass," or "later," the briefing agent flags the lead for state review before surfacing. Three — every CRM memory file got an explicit last_verified_against_inbox date that the agent must check; stale state above 7 days flags a re-verify task.
Principle: Canonical state files become stale silently. The agent must verify against the source-of-truth connector at meaningful checkpoints, not trust the canonical file forever. Memory is a cache. Caches go stale.
What generalized
The seven failures look different on the surface. They share a structure underneath.
Five of the seven are agents satisfying the letter of a prompt while missing the spirit. Citation discipline as proximity instead of containment. Accept criteria as inclusion without exclusion. State as a generic verb instead of a locked vocabulary.
Two of the seven are infrastructure mismatch — the agent operating on assumptions the underlying system has changed (build glob, CRM staleness). These are the failures that hurt least to discover but cost the most to retro-fix because they often hide for weeks before surfacing.
Zero of the seven are the model "getting it wrong." Every fix was a prompt change, a workflow change, a vocabulary lockdown, or a verification gate. The model is not the bug. The prompt and the workflow are the bug.
That is the closing principle. The model is not the bug.
If you treat every agent failure as a prompt problem first — what did I not say, what did I let the agent fill in, what did I not check — you will fix most of the failures with prompt and workflow changes. If you treat them as model problems, you will spend a year switching providers and end up with the same shape of failure on the new stack.
The work is the workflow. The agent is the cheap part.
Want the 90-day wins side of this log? See I Replaced My 12-Person Agency With an AI-First Stack: 90-Day Numbers shipped earlier this week.