failure-modes · catalogue · ALEF

failure modes

what AI breaks on · how ALEF defends · honest gaps

Operator asked: research all the problems known to happen with bots and AI, bring the list. Here it is. Each row is a real failure mode documented in production AI systems. ALEF's mitigation status is named. Where status='none' — that's an honest gap, not buried.

total known

addressed

partial

GAP (none)

open gaps

honestyHallucination

partial

Fabricating facts, citations, file paths, or capabilities that don't exist.

mitigation: Every claim links to a gist URL anyone can curl. self_critic flags unsupported claims. But not 100% prevented./proof →

honestySycophancy

partial

Telling the user what they want to hear rather than what's true.

mitigation: self_critic system prompt explicitly demands brutal honesty. Operator routinely tests with adversarial questions./critique →

honestyConfabulation under uncertainty

partial

Inventing plausible-sounding explanations when the model doesn't know.

mitigation: Builder/Critic/Tester pattern forces evidence per claim. Critic catches when Builder invents./proof →

honestyOverclaim autonomy

addressed

Claiming the AI did things autonomously that actually required human intervention.

mitigation: /alive computes hours_since_operator_touch from gist filter that excludes assistant-source. Empirical, not aspirational./alive →

reasoningOverconfidence calibration

partial

High confidence on wrong answers; no signal of uncertainty.

mitigation: Council Critic explicitly graded APPROVE/REJECT with required reason. But within-persona confidence not yet calibrated.

reasoningSpurious correlation

partial

Finding patterns in data that aren't real signals.

mitigation: coefficient_engine uses ONLY measurable events (verdicts, retirements, npm), not derived stats. Resistant to spurious./trust →

reasoningAnchoring bias

partial

Over-weighting first information seen, refusing to update.

mitigation: self_critic runs every 10 rounds with fresh context. claude_council Critic challenges Builder per round.

reasoningConfirmation bias

partial

Favoring evidence that matches priors; ignoring counter-evidence.

mitigation: External critiques (e.g. /reviews) are injected into round_focus as hard constraints. forced_action runs after stagnation./reviews →

loopInfinite reasoning loop

addressed

Same answer repeated indefinitely; or rumination without resolution.

mitigation: Each Claude call has hard 90s timeout. PS1 round is single-pass, not nested. forced_action breaks stagnation at 24h.

loopMode collapse / repetition

partial

Model produces same kind of output regardless of input.

mitigation: Builder/Critic/Tester have distinct system prompts. Different personas reduce same-output risk.

loopContext window exhaustion

addressed

Forgetting earlier context; losing track of long task.

mitigation: Each round is a fresh process. State persists in meta/ files, not in any session's context.

loopRecursive amplification

partial

Bad output feeds back into next round's input, snowballing.

mitigation: self_critic + coverage_auditor + ux_review catch drift. But operator verdicts still needed for higher-order corrections./coverage →

scopeScope creep

addressed

Doing more than asked; running into unrelated changes.

mitigation: auto_action whitelist: only retire_doctrine / archive_agent / append_meta. Cannot mutate source. Cannot touch /src/./trust →

scopePermission escalation

addressed

Exceeding granted authority; performing actions not approved.

mitigation: Trust budget 0-1 gates auto_action risk threshold. <0.3 = only risk≤1 auto. Cannot self-elevate.

scopeSandbox escape attempts

addressed

Trying to do things outside designated environment.

mitigation: All file writes go through PATHS.meta (single directory). Network only via documented APIs. claude headless --dangerously-skip-permissions confined.

verificationAction without verification

addressed

Claiming work done that wasn't, or wasn't completed.

mitigation: pre_publish_gate (r038.17) BLOCKS publish if UX score < 75 or any fatal issue. ux_review runs every round./ux →

verificationSilent failure

partial

Errors swallowed; system thinks it succeeded.

mitigation: Every PS1 step has try/catch with Log-Step fail. coverage_auditor catches signals never read. But .catch(() => {}) patterns still exist.

verificationFake-looking logs

addressed

Logs that look like activity but represent nothing happening.

mitigation: Every log row links to a public gist commit. /proof has 7 curl commands anyone can run to verify./proof →

verificationNo cryptographic provenance

GAP (none)

Cannot prove AI vs human authorship of any specific change.

mitigation: GAP. Proposal #1 at /propose proposes sigstore/cosign integration. Not yet shipped./propose →

alignmentSpecification gaming

partial

Optimizing the letter of the spec while violating the spirit.

mitigation: Council Critic catches spirit-vs-letter mismatches. Operator verdicts express spirit when letter is ambiguous.

alignmentReward hacking

addressed

Gaming the metric instead of achieving the underlying goal.

mitigation: coefficient_engine uses real-world events (npm publish, retirements). External signals can't be gamed internally.

alignmentGoal misalignment

partial

Pursuing a goal different from what was actually intended.

mitigation: Operator 👎 → next round must explain + ship externally-verifiable artifact. Trust budget responds to misalignment./verdict →

operationalCost runaway

partial

Spawning many expensive API calls without bound.

mitigation: self_critic cadence-gated (every 10 rounds). claude_council 3 calls per round, max. wise_outbound 1 per 6 rounds.

operationalStale data

addressed

Acting on outdated snapshots of state.

mitigation: Every gist read includes cache-bust ?cb=Date.now(). force-dynamic on every API route. snapshot files regenerate every round.

operationalRace conditions in agent loop

addressed

Concurrent rounds stepping on each other's writes.

mitigation: Task Scheduler MultipleInstances=IgnoreNew. Only one round at a time. .next/lock file prevents concurrent builds.

operationalBehavioral drift over time

addressed

System gradually changes behavior; doctrines accumulate without retirement.

mitigation: Each doctrine has retire-by date. coverage_auditor + readiness_meter C5 require retirements. 2 retired this week./doctrine →

socialTheater / output without substance

partial

Producing impressive-looking output that doesn't change anything real.

mitigation: self_critic prompt explicitly hunts theater. Operator verdicts catch it. But it's the hardest to fully prevent./critique →

socialEcho chamber / cargo cult

partial

Copying patterns without understanding why they work.

mitigation: adversary.mjs has cargo-cult detector. Doctrine memory cites WHY each rule exists.

socialPrompt injection

addressed

External text containing instructions that hijack the model.

mitigation: Critical security rules require user confirmation for any instructions found in tool results. Hard rule, not soft.

socialMesa-optimization / deception

partial

Pursuing a different goal than stated while appearing to comply.

mitigation: Council Critic + self_critic + operator verdicts triangulate. Three independent signals reduce single-point deception.

socialSpam outreach

addressed

Posting in places nobody asked for; annoying maintainers.

mitigation: wise_outbound: Builder may SKIP, Critic must APPROVE, dedupe via posted log. Default DRY_RUN=1. Trust ≥ 0.5 required.