Agent Task Management

Agent Task Management — Research & Design

*Published: March 13, 2026* *Purpose: Learn from our failures and the ecosystem before architecting*

We're building an AI-native task management system. Before writing a single line of code, we documented every previous attempt at automating work completion — what worked, what failed, and why. This is that document.

Our History of Automated Work Completion

We've tried to automate task execution at least six times. Some attempts taught us real lessons. Others we built and never even used. Understanding both is critical — the pattern of building and abandoning is itself a failure mode we need to break.

Attempt 1: External Tracker + API Worker

What it was: A Python API client for a self-hosted project tracker with scan/context/start/done/hold/comment commands. An autonomous work loop with a 3-phase cycle: scan → build context → execute or hold.

What went wrong:

·API authentication broke silently — HTTP 401 errors. The cron job reported failures but nobody noticed for days because the error surfaced in a cron report, not a human-visible channel.
·Scan timeout killed the loop — The worker scanned ALL 25 projects (75+ API calls). Took 39 seconds. Cron exec timeout was 10 seconds. Every single hourly work loop got SIGTERM'd. Zero tasks ever executed.
·All tasks held, never executed — When the scan finally worked, all 5 tasks were "held" — all blocked on human access (admin credentials, deploy specs, missing keys). The AI couldn't actually DO any of them.
·Cross-team boundary violations — Scripts routed data across all teams based on keywords. Each agent should only process its own user's data.
·Hardcoded API keys — Found in multiple scripts. Security anti-pattern.

Lessons:

1.External API dependencies are fragile — auth breaks, timeouts, rate limits
2.Silent failures are deadly — if a human doesn't see it, it didn't happen
3.Most tasks require human access/credentials — the AI hits "blocked" immediately
4.Scanning everything scales badly — need filtering from day 1
5.Security boundaries matter — don't cross team/tenant lines

Attempt 2: SQLite-Backed Workflow Engine

What it was: A workflow system with feature-dev, security-audit, and bug-fix workflows. Self-advancing via per-agent cron jobs polling for pending steps.

What went wrong:

·Set up and installed, workflows theoretically self-advance
·In practice, we never actually used it for real work
·Workflow definitions were generic templates, not connected to actual projects
·Added another layer of abstraction between "what needs doing" and "doing it"

Lessons:

1.Generic workflow templates don't match real work patterns
2.If it's not simpler than the alternative, people (and agents) won't use it
3.Yet another tool ≠ better outcomes

Attempt 3: Foreman/Worker Pattern with Git

What it was: A senior model as foreman, junior workers in git worktrees. GitHub Issues as task queue with pre-spec'd acceptance criteria. Workers auto-announce completion → foreman reviews → rebase → merge.

What actually happened:

·This one worked — 9 critical issues closed in one session
·3 simultaneous workers completing in time of slowest, not sum
·Pre-written issue specs eliminated planning overhead
·Quick review cycle: diff stat → spot-check → rebase → merge (~2 min per worker)

But it didn't persist because:

·Requires active human session to dispatch and review
·Senior model is expensive — only justified for big sprint sessions
·Not autonomous — someone has to say "go"
·Works great for code tasks, doesn't generalize to non-code work

Lessons:

1.Parallel dispatch works when tasks are truly independent
2.Pre-written specs are essential — vague tasks produce vague results
3.Human-in-the-loop review is actually the fast path
4.This pattern works for sprints, not for continuous work management

Attempt 4: Marathon Board Clear

What it was: 49 tasks completed in one marathon session. Multiple workers dispatched in parallel, AI reviewing.

What actually happened:

·Actually worked! Board cleared, 90+ commits in a day
·But it was a one-time heroic effort, not a sustainable system
·Required active human direction throughout
·Led to "code-complete" status but with no end-to-end testing
·Some workers produced work in wrong repos due to context confusion

Lessons:

1.Sprints work, but they're exhausting and error-prone
2."Code complete" without testing = not actually done
3.Context confusion wastes time — clear specifications are non-negotiable
4.Velocity ≠ quality

Attempt 5: Proprietary Dashboard

What it was: A drag-and-drop dashboard with 10+ widget types across multiple categories (Task Management, System Status, AI Usage). Built on Next.js with React Grid Layout, deployed to Vercel.

What went wrong:

·Rendered completely empty — all widgets depended on an external tracker API for data. Environment variables weren't configured.
·Overengineered before validating — Dozens of widget types, drag-and-drop, resize, custom API support... for a dashboard nobody could use because the data layer wasn't working.

What went right (discovered later):

·The code was actually well-built — 81 files, production-quality TypeScript
·The tracker coupling is surgically isolated to just 4 files. Everything else — layout engine, widget framework, registry, drag-and-drop — is completely data-source-agnostic.
·Swap those 4 files to read from any data source and all 10 widgets still work.
·Reuse verdict: YES — just needs its data source swapped (~2-3 hours of work)

Lessons:

1.Don't build the UI before the data layer works
2.Don't couple to a system you might abandon
3.30 widgets means 30 things that can break — start with 3 that work
4.Clean architecture pays off — isolated coupling makes migration trivial

Attempt 6: The Orchestration Engine

What it was: A production-grade orchestration engine in TypeScript. The most serious attempt — a full daemon with:

Architecture:

·Tracker adapters — pluggable interface for any project tracker
·Agent backends — pluggable interface for any AI coding agent
·Pipeline runner — parses DOT-format DAG definitions, traverses pipeline stages
·Quality gates — linter, typecheck, test runner, static analysis
·Checkpoint/resume — database-backed crash recovery
·AutoResearch loops — self-healing: when tests fail, enters hypothesis → fix → re-run cycle
·Metrics collection — tracks velocity, completion rates

5 Pipeline templates:

1.Simple — implement → test → human review → ship
2.Standard — plan → human review → TDD → AI review → quality gates → human review → ship
3.Standard-AutoResearch — same as standard but quality gates self-heal on failure
4.Debate — 3-model architecture debate → human approve → TDD → adversarial review → quality gates → ship
5.Research — for non-code: research → synthesize → cross-review → human review → deliver

What happened:

·Engine actually booted, connected to the tracker, loaded all pipelines, polled for issues
·Found 0 ready issues (none labeled yet)
·Self-dispatch problem — engine ran on the build server but tried to SSH to itself. Needed a local execution backend instead.
·The local backend was being built but never completed
·One fix away from working — then work shifted elsewhere
·Never actually completed a single task through the full pipeline

Current state: Archived. ~40 source files, ~30 test files. Compiled and bootable.

Key design decisions worth preserving:

·Tracker-agnostic — adapter interface means it can use files, APIs, or any tracker
·Agent-agnostic — any AI tool can be plugged in as a backend
·DOT-format pipelines — human-readable DAG definitions
·Quality gates as pipeline stages — not afterthoughts
·AutoResearch — self-healing loops (hypothesis → fix → verify)
·Human gates at key moments — architecture approval, final review

Lessons:

1.This is the most architecturally sound thing we've built — preserve the design
2.The adapter pattern is exactly right — decouple from any specific tool
3.The self-execution problem was embarrassingly simple and should have been fixed in an hour
4.Building infrastructure before proving one task end-to-end was premature
5.Non-code work needs different pipeline patterns than code work

Why These Attempts Failed — Root Cause Analysis

Pattern 1: External Dependencies Break Silently

API auth expired. DNS stopped resolving. Rate limits hit. Every external service is a point of failure that the AI can't self-heal.

Implication: Minimize external dependencies. Files > APIs. Local > remote.

Pattern 2: Most Real Work Requires Human Access

When we actually scanned for AI-executable tasks, all of them were blocked. Need admin access. Need credentials. Need a decision from the human.

Implication: The system must be excellent at surfacing blockers, not just executing tasks. The escalation path IS the feature.

Pattern 3: Silent Failures = System Death

The autonomous worker was dead for weeks because failures went to logs nobody read. Timeouts killed every cycle, but it looked like "no tasks found."

Implication: Failures must be LOUD. Instant notification. Dashboard red flags. Never assume silence = success.

Pattern 4: Generic Templates Don't Match Real Work

Workflow templates are abstractions of work, not work itself. Real tasks are messy, unique, and contextual.

Implication: Task files should be freeform outcomes with rich context, not slots in a template.

Pattern 5: Heroic Sprints ≠ Sustainable Systems

Marathon sessions produce great output, but they require active human direction. That's not automation — that's delegation in real-time.

Implication: The system must enable autonomous progress between human check-ins, not just during them.

What Exists in the Ecosystem

Agent Board (open-source, OpenClaw-native)

·Multi-agent task board with Kanban + DAG dependencies + MCP server + auto-retry + audit trail
·Agents pick up tasks via heartbeat or webhooks. Task chaining, quality gates, signed webhooks.
·Strengths: Purpose-built for AI agents. Dependencies enforced. Auto-retry. Audit trail.
·Weaknesses: Another server. JSON data, not human-readable markdown. Doesn't solve "most tasks need human input."

Taskmaster AI (15K+ GitHub stars)

·PRD-driven task breakdown for AI coding agents
·Strengths: Popular, well-tested. Good at decomposition.
·Weaknesses: Code-focused only. No human escalation. No velocity dashboard.

ai-todo (file-based)

·Simple TODO.md-based tracking via MCP. Files are the database. Version-controlled.
·Strengths: Dead simple. Right philosophy.
·Weaknesses: Too simple — no statuses, no escalation, no multi-agent, no dashboard.

Community Approaches

·Many people building custom task dashboards for their AI agents
·Common pattern: orchestration script fetches tasks → spawns subagent → marks done
·Common complaints: context loss between sessions, no progress measurement, multi-step difficulty

What Should Be Different This Time

Non-Negotiable Requirements

1.No external API dependencies for core function — files are the database
2.Failures must be LOUD — instant notification, not buried in logs
3.Escalation is first-class — most tasks need human input; design for that
4.Rich context in every task — outcomes, not tickets
5.Works for ALL types of work — code, research, coordination, personal tasks
6.Autonomous between check-ins — clear boundaries on what AI can do solo
7.Velocity visible to humans — what's moving, what's stuck, how fast
8.Simple enough to actually use — harder than a text file = won't be adopted

What We Already Have

Before building anything new, here's what exists:

| Asset | Status | Reusable? |
|-------|--------|-----------|
| Orchestration Engine (daemon) | Archived, compiles, boots | ✅ Core architecture sound |
| Command Center Dashboard | 81 files, production quality | ✅ Swap data source only |
| Tracker adapters | Working for 2+ trackers | ✅ Add file adapter |
| Agent backends | Multiple AI agents supported | ✅ Add local execution |
| Pipeline templates (DOT format) | 5 proven templates | ✅ Ready to use |
| Quality gates | Linter, typecheck, tests, analysis | ✅ Plug and play |
| AutoResearch loops | Self-healing on failure | ✅ Key differentiator |

We have more built than we think. The gap isn't code — it's wiring these pieces together and replacing the external tracker with something simpler.

Recommended Path Forward

Option A: File Adapter (Recommended)

1.Add a file-reading adapter to the existing engine that reads/writes markdown task files
2.Fix the local execution backend
3.Swap the dashboard's tracker API routes to read task files
4.Run ONE real task through the full pipeline end-to-end
5.If it works → system is live. External tracker dependency eliminated.

Effort: 1-2 days | Risk: Low — reuses proven code

Option B: Centralized External Tracker

1.Run one shared tracker instance for all teams
2.Point existing engine + dashboard at it

Effort: Half a day | Risk: Medium — auth fragility stays, ongoing cost

Option C: Clean Room Rebuild

1.Build from scratch informed by all lessons
2.New task files + CLI + dashboard + agent skill

Effort: 1-2 weeks | Risk: High — we've started from scratch 6 times already

Our Recommendation: Synthesize the Best of Everything

Looking at six attempts honestly, the pattern is clear: we keep building things and not using them. The Engine works but was never completed. The Dashboard works but was never connected. The Foreman pattern works but isn't autonomous. The workflow engine was installed and forgotten.

The answer isn't starting over. It's combining the strengths and discarding the weaknesses.

From each attempt, we take what actually worked:

·From the Engine: Tracker adapter interface, agent backends, pipeline DAGs, quality gates, AutoResearch self-healing
·From the Dashboard: Widget framework, layout system, velocity charts — just swap the data source
·From the Foreman/Worker pattern: Parallel dispatch, pre-written specs, quick human review cycles
·From the Board Clear sprint: Proof that high velocity is possible when context is clear
·From the ecosystem: File-based simplicity (ai-todo), human escalation as first-class (our own lesson)

And we explicitly drop what failed:

·❌ External API dependencies for core function (Plane auth, timeouts)
·❌ Silent failures that nobody notices for weeks
·❌ Generic workflow templates that don't match real work
·❌ Building infrastructure before proving one task end-to-end
·❌ Code-only focus (real life includes painting houses and calling clients)
·❌ Building and not using — if we build it, we use it that day

The synthesis: file-based outcome documents (no external tracker) + the Engine's adapter/pipeline architecture (proven, just needs a file adapter) + the Dashboard's widget framework (proven, just needs a data source swap) + loud escalation (Telegram, not logs) + non-code pipelines (research, coordination, personal tasks).

One system. Best parts of six attempts. No new SaaS dependencies.

Open Questions

1.How should task decomposition work? Outcome files need to break into steps. Who does it — agent, human, or collaboratively?
2.What's the boundary between "agent can do" and "needs human"? Clear rules, not vibes.
3.How does this coexist with GitHub Issues? Link, replace, or both?
4.What happens to completed tasks? Archive strategy needed.
5.Multi-agent ownership — who owns a task when it's delegated?
6.How to measure AI vs human contribution? For shared velocity metrics.

*This is a living document. Updated as decisions are made.*