Authored By: Sumeet Vij, VP and Head of Insights Engineering and AI
Anthropic recently published their harness design for long-running agentic systems, and it does a good job of codifying principles around a set of patterns that many engineering teams have been converging on independently.
For many, it provided a helpful roadmap. Anthropic’s approach defines principles that anyone building agentic systems should internalize: separate the judge from the builder, define success before writing code, communicate through files rather than shared context, and calibrate your evaluator relentlessly.
For us at Vantor, it was useful in a different way.
We’ve been building exactly this kind of system—independently and iteratively—for many months before the paper dropped. This system is Vantor’s Agentic Software Development Life Cycle (SDLC).
Our SDLC takes a Jira ticket from initial requirements through to a reviewed, production-ready branch. And comparing it to Anthropic’s approach, we found that it extended their success in several meaningful ways to accelerate the engineering cycle for production-ready software delivery.
We’re sharing our approach here in the spirit of transparency—what’s working, where it’s breaking, and where we’re still figuring things out—because this space is evolving quickly and the most useful insights tend to come from systems that are actually running in production.
From prompt chains to systems
A lot of early agent workflows are essentially structured prompt chains. You define roles, add a review step, iterate a few times, and try to steer the system toward something acceptable.
That model works reasonably well for bounded tasks. It starts to break once you extend it across something like software development, where the work is multi-phase, stateful, and difficult to evaluate in a single pass.
The issues tend to show up in predictable ways. You don’t have a precise definition of “correct” before generation begins, so evaluation becomes subjective. Iterations lose context, so decisions made in one step don’t reliably carry forward. When the system produces conflicting outputs, there isn’t a clear mechanism for resolving them. And when a human steps in, they’re often reconstructing intent from partial information.
At that point, you’re not really working with a loop anymore. You’re dealing with a system that needs structure around it.
Anthropic’s harness design moves in that direction. But building a harness that works in a research demo and running one in production across a real engineering team are different problems. The gap between them is where Vantor’s agentic SDLC lives.
The Agentic SDLC we built at Vantor
At a high level, our Agentic SDLC still centers on a generate/evaluate loop. That part is familiar.
Where it differs is in how early we introduce structure, and how strictly we enforce it.
Work doesn’t begin with generation. It begins with definition. Before any code is written, the system requires:
- A set of requirements that can be tested
- A Test Driven Development (TDD) scaffold guided by a Technical Design that expresses expected behavior
- Phase-level specifications that limit what the generator is allowed to do.
That step is easy to skip, but in practice it removes a large amount of ambiguity later in the process.
Once generation begins, the system moves through iterations in a controlled way. Each phase has its own scope, its own acceptance criteria, and its own evaluation cycle. That sounds conventional, but it turns out to be important when you’re trying to keep the system from drifting as it progresses.
Where structure actually matters
The differences become more visible once you look at how the system behaves under pressure when outputs aren’t obviously correct, or when different parts of the system disagree.
Separation of generation and evaluation
One of the first decisions we made was to separate generation and evaluation across two different models. There’s a version of this pattern where a single model is prompted to play both roles, but in practice it tends to converge on its own reasoning. Even with careful prompting, it’s difficult to get truly independent evaluation from the same system that produced the output.
Vantor uses genuinely different AI systems. Our Generator (Augment Code) writes the code. Our Evaluator (Codex) reviews it. Different training, different weights, no shared session. They communicate exclusively through markdown files. This eliminates self-evaluation bias at the infrastructure level, not through prompt engineering. That discipline matters even more for platform-scale systems like Tensorglobe.
Full SDLC pipeline before any code
We don’t depend on a brief prompt to define a sprint contract. That’s one planning step for a multi-stage problem. Vantor’s planning pipeline runs through structured requirements analysis (eight dimensions, gap IDs for every finding), a Technical Design Document, and sequenced phase decomposition with test-first task ordering. All human-approved before any agent touches code.
Each phase becomes a Jira Story with sub-tasks carrying explicit acceptance criteria and test definitions. We invest human time in planning and reclaim it through automation in implementation and review.
Treating evaluation as a system
Another thing that becomes apparent quickly is that evaluation can’t just be a pass/fail step in an AI software engineering workflow. If you’re running multiple iterations, you need to understand not just what failed, but why it failed and how that decision was made. Otherwise, the system loses coherence as it progresses.
We handle that by tying every finding back to a specific requirement or test, assigning a severity level, and recording the outcome in a way that can be revisited later. It adds some overhead, but without it, you end up re-litigating the same issues in slightly different forms across phases.
The justification protocol
Real engineering involves legitimate trade-offs—a system that forces a “fix” for every finding will thrash between contradictory constraints. Vantor’s Generator can push back with a written justification. Codex evaluates it against severity-based rules: CRITICAL/HIGH security findings always require a fix; lower severities accept spec-based or architectural rationale. The full reasoning trail is preserved across all cycles providing a structured negotiation protocol.
Explicit escalation and human signal
There are also cases where the system can’t resolve the disagreement on its own. When that happens, the task is moved into a BLOCKED state, and the full history of the interaction—findings, responses, and decisions—is preserved. A human can then step in with the context intact, rather than trying to reconstruct it after the fact. This is less about exception handling and more about acknowledging that some level of human intervention is inevitable and designing for it up front.
Verifying completion
A similar pattern shows up at the end of the workflow. It’s relatively easy for a system to produce something that looks correct, especially if the evaluation criteria are loosely defined. Over time, that leads to a drift between what the system produces and what the task requires.
We address that by enforcing acceptance criteria at both the sub-task and phase level. A task is only considered complete once those criteria have been explicitly verified. It’s a simple rule, but it prevents several subtle failure modes. This is our direct answer to the “context anxiety” problem for Agents – the tendency of models to rush completion as context fills. Our system structurally cannot skip verification to finish faster.
Separating communication layers
There’s a practical consideration around how information is surfaced. Agent-to-agent communication tends to be verbose and highly structured. Human-facing communication needs to be more concise and easier to interpret. Mixing the two creates friction for both.
We keep those layers separate. The agents communicate through structured files that preserve their reasoning, while humans interact with the system through Jira, where the state of the work and the key decisions are visible without exposing every intermediate step
Compound learning across phases and features
Vantor’s system implements two-tier learning that compounds knowledge over the lifetime of a project:
- Tier 1 (feature-level): After each phase, HIGH/MEDIUM patterns and accepted justifications are extracted into a feature-scoped learnings file. The next phase reads it before starting. Phase 2 actively avoids what the AI Evaluator flagged in Phase 1.
- Tier 2 (repository-level): After all phases complete, a distillation step merges feature learnings into a repo-wide file (80-line cap, frequency-weighted). Accepted justifications are permanent, injected into AI Evaluator’s first cycle so the evaluator skips pre-approved trade-offs rather than re-disputing them.
- The compound interest effect: The review cycle counts decrease across features. Each new feature starts with richer pitfall awareness than the last allowing institutional knowledge to be captured in the repository.
What the Agentic SLDC enables—and where we’re still working
The most noticeable effect isn’t just an improvement in output quality, but in consistency across a long-running agentic system. You can run multiple phases without losing track of why decisions were made. You can trace a change back to the requirement or finding that drove it. And when a human needs to intervene, they can do so without disrupting the overall flow of the system. That combination is what makes the system usable beyond simple cases.
There are still areas where the system is incomplete. Evaluator calibration is one of them. The severity rules we use today are static, and they don’t yet incorporate feedback from past disputes. Ideally, the evaluator should improve over time based on those outcomes.
Specification review is the second. While specs are reviewed before implementation, they don’t yet go through the same structured evaluation process as code.
These aren’t fundamental limitations, but they’re areas where the system can become more adaptive. What’s becoming clear is that multiple teams are working toward similar solutions.
The initial focus on generation is giving way to a more balanced view that includes evaluation, memory, and control. Systems need to be able to carry context across iterations, handle disagreement in a structured way, and integrate human input without collapsing.
Anthropic’s harness design is one expression of that shift. Our Agentic SDLC is another, shaped by the constraints of running it as part of a real development workflow and a broader AI-ready living globe roadmap.
For related Vantor engineering and platform context, explore the Tensorglobe platform, Vantor product architecture, unified satellite tasking API, the AI-ready living globe, mission-relevant geospatial AI, and Google Earth AI integration posts.