The Missing Layer in Autonomous Development

There’s a moment in every engineering organization’s AI journey that goes something like this: someone hooks up an AI agent to a codebase, it writes some code that looks plausible, and a room full of smart people starts asking uncomfortable questions. Is this even what we asked for? Did the agent have the context it needed, or was it guessing? Who approved the plan, and what did it cost? Can we stop it mid-run if it goes wrong? Is there an audit trail? What happens when it hallucinates a database migration?

The industry is moving fast on AI-powered code generation. Agents that can read a ticket, plan an implementation, write the code, and open a pull request — that’s not science fiction anymore. The capability is real. But capability without a harness around it is a demo. It’s not something you can hand to a VP of Engineering and say “trust this with your production codebase.”

That gap, between what AI agents can do and what organizations will actually let them do, is where I’ve been spending most of my time. And it’s why I started Artificer Digital.

A bit about me

I’m Tim Schiller. I’ve spent my entire eighteen-year career at ThoughtFarmer, a collaboration platform used by organizations around the world. I started there fresh out of university as their first support hire and junior developer, then moved into professional services doing client-facing customization work, briefly ran the support team, and for the past few years I’ve been a senior developer on the core product. I’m also an AWS Certified Developer. Seeing one product from that many angles (the 2 AM incident calls, the client-specific customizations, the slow accretion of technical debt and the even slower work of paying it down) has shaped how I think about what AI agents need in order to actually work in production.

Over the course of the last year, I started noticing that the way I used AI in my development work was evolving from “help me write this function” to something much more interesting: orchestrating multi-step autonomous workflows where AI agents do substantive engineering work under human oversight. Not autocomplete. Actual task execution.

That evolution is what led me to found Artificer Digital. The company has two threads running in parallel.

The first is Artificer Forge, an autonomous software development platform I’m building for engineering teams at mid-to-large enterprises. Forge connects to your issue tracker (GitHub Issues or Jira), generates an implementation plan for human approval, then autonomously writes the code and opens a pull request. The interesting part isn’t the code generation. It’s everything around it: per-task budget tracking, context boundaries between tasks, kill switches, audit trails, and real-time execution streaming. The harness layer.

The second is The Artificer’s Grimoire, a weekly newsletter covering developments in autonomous AI agents, context engineering, and orchestration. It’s produced by an AI pipeline I built (crawling sources, evaluating signal, and drafting each issue), with me reviewing the output rather than hand-curating it. It’s harness engineering applied to a different problem: agents doing the work inside a system I designed, with me in the loop at the points that matter. If you’re a practitioner in this space, it’s a way to stay current without drowning in noise.

The harness layer

A year ago, most of the public conversation about AI coding agents was about capability benchmarks. Can the agent pass this coding interview? Solve this competitive programming problem? Build an app from a single prompt? That conversation hasn’t gone away — but alongside it, something much more interesting has been emerging.

A community of researchers, tool builders, and practitioners has started converging around a different set of questions. The ones that actually matter when you’re trying to put this in front of a real engineering organization. How do you steer the agent before it acts? How do you catch problems after it does? How do you manage context so the model doesn’t degrade halfway through a task? How do you know, with evidence, that the system is behaving inside the parameters you specified?

They’ve given the discipline a name: harness engineering. The practice of building control systems around AI coding agents.

Harness engineering is where I’ve been living, professionally and intellectually, for the last year. Honestly, it’s where I’ve become fascinated to the point of obsession. Bockeler’s framework of guides and sensors (feedforward controls that steer the agent before it acts, feedback controls that catch it when it drifts) is one of the clearest formulations I’ve read of what this discipline actually is. Horthy and the folks at HumanLayer have done excellent empirical work on context degradation and Research-Plan-Implement patterns, showing that model performance collapses once context windows fill past roughly forty percent. Anthropic’s writing on harness design for long-running development and OpenAI’s treatment of context as a scarce resource have both shaped how I think about the craft. And Rombaut’s work on scaffold primitives (ReAct, generate-test-repair, plan-execute, multi-attempt retry, tree search) has given the field a shared vocabulary for the control loops inside the harness.

I’m not claiming to have invented any of this. I’ve been sitting with it, absorbing it, arguing with it in my head, applying it to real systems, and watching my own methodology (Initiative-Driven Development; more on that below) formalize out of the intersection. The community is where my ideas have been inspired and where they’ve gotten sharper.

So when I talk about “the missing layer,” I don’t mean missing from the conversation. The conversation is vibrant. I mean missing from most deployments. And on the deployment side, if you spend any time on Reddit or the developer forums, the same few varieties of skepticism keep surfacing.

The first is the slop complaint. Someone asks an AI agent to do a major refactor across a codebase (no plan, no context, no methodology), pulls the lever like a slot machine, and of course it produces slop. They conclude the tool is a joke.

The second is the vibe-coding sneer, aimed at non-engineers who generate working code with AI and start calling themselves developers. There’s a real Dunning-Kruger dynamic buried in some of those cases, so the critique isn’t baseless. But it’s gotten stretched into a blanket dismissal — “all AI-assisted coding is vibe coding” — and that’s where it goes wrong.

The third is the grunt-work framing. AI is useful for writing unit tests, generating boilerplate, the stuff nobody wants to do anyway. Not wrong, exactly, but it badly undersells what’s possible. The ceiling isn’t set by the model. It’s set by how people are scoping the task.

What all three patterns have in common is that they’re about tool literacy, not tool capability. When you treat an agent like a slot machine, you get slop. When you bolt it onto no process and expect wizardry, you get the thing that earned “vibe coding” its bad name. When you only hand it grunt work, you only get grunt work back. Used properly (with the right methodology upstream and a real harness around the execution), an agent can produce better code than the most senior engineer, faster. And with the harness in place, you’re not babysitting a single agent. You have a team of them, doing exactly what you specified, inside parameters you trust. No slop allowed.

That’s the gap Forge is built to close. And it’s what I find myself coming back to, over and over: not whether the model can write the code, but whether the system around it can be trusted with production.

The governance side of that harness (the part an engineering leader actually has to sign off on) comes down to a handful of things:

Cost control. If an agent gets stuck in a loop, how much money does it burn before someone notices? Forge tracks token spend per task and enforces hard budget ceilings. When the budget is hit, the task stops. No surprises on the bill.

Human authority. The agent should never be the final decision-maker. Forge generates implementation plans that humans review and approve before any code is written. The agent proposes; the human decides.

Context boundaries. Agents lose coherence as context windows fill up. Performance degrades, decisions drift, earlier reasoning gets crowded out by later tokens. Forge enforces hard context boundaries between tasks so each one starts fresh, with only the information it needs. This is where harness engineering crosses into context engineering, and it’s a layer most deployments skip entirely.

Observability. When an agent is working on a task, you need to see what it’s doing in real time, not after the fact. Forge streams execution events as they happen so you can watch the agent think, and intervene if it’s heading in the wrong direction.

Audit trails. Every decision the agent makes, every file it touches, every API call, logged and traceable. When something goes wrong (and it will), you need to be able to reconstruct exactly what happened and why.

Kill switches. Sometimes you just need to stop everything, right now. Not “please finish your current step and then stop.” Now.

None of this sounds glamorous on a slide. But watching it actually run — a ticket becoming a plan becoming a container becoming a pull request, with every step observable and every boundary enforced — is one of the most exciting systems I’ve built. It’s the difference between a toy and a tool, and it’s most of what I think about.

A brief detour, because this matters to me

I want to step sideways for a moment, because there’s a personal thread running through all of this that’s worth naming.

When I was thirteen or fourteen, my dad had an old motorcycle that had sat broken for a decade. It would start and run, but it wouldn’t shift into gear. One afternoon when he was out, I decided I was going to figure out what was wrong with it.

I had no real idea what I was doing. But I knew roughly where the clutch and gearbox were, and he’d taught me the basics of how to use tools. So I disassembled the clutch assembly, carefully, laying every part out on a clean table in the order I’d removed it. Studied it. And eventually I found it: the clutch push rod, worn down at the end from years of use, no longer long enough to fully disengage the clutch when the lever was pulled.

My dad came home ready to be furious. He saw the table, parts in order, my explanation of what was wrong, and he wasn’t. He was impressed.

I tell that story because it’s the closest I can come to explaining why harness engineering feels like home. The discipline of it — observe the system, take it apart, study how the pieces interact, find the worn part, put it back together better — is how I’ve approached almost everything in my life. Software has always been an extension of that instinct. With a function I can build a gear; with a class, a transmission; with a project, a working digital machine that does exactly what I want it to. Agentic AI is a new level of abstraction, a new kind of power. But the mindset I bring to it is the same one I brought to that motorcycle.

I think that’s also where the most interesting human work goes from here. As models get better at keystrokes, the creativity and ingenuity shifts upward: toward designing the harness (the guides, the sensors, the boundaries) that lets a fleet of agents do great work inside parameters we can trust. That’s the part I don’t think gets automated, and it’s the part I find genuinely fascinating.

What to expect from this site

This site is where I write about what I’m learning and building. The posts will generally fall into a few categories:

Technical deep dives on AI orchestration architecture, context engineering, and the infrastructure patterns that make autonomous development work at a systems level. AWS Step Functions, Bedrock, CDK, observability pipelines. Real implementation details, not hand-waving.

Industry commentary on where autonomous development is heading, what’s working, what’s hype, and what the implications are for engineering organizations and the people who work in them.

Building in public, the honest experience of going from senior IC at an established company to building a product and a business, including the decisions, trade-offs, and mistakes along the way.

Practical guides born from real implementation work, not theory. The kind of thing I wish I could have found when I was figuring it out.

Coming soon: Initiative-Driven Development

The first substantial thing I’ll be publishing here is the methodology I mentioned above: Initiative-Driven Development (IDD). It’s a harness engineering methodology aimed at a specific problem anyone who has tried to use AI agents for large-scale development work has encountered: the agent loses coherence across sessions.

You can get an AI agent to do excellent work inside a single session. But real engineering initiatives (the kind that span weeks, touch multiple bounded contexts, and require architectural decisions) don’t fit in one session. Context windows reset. The agent forgets what it decided yesterday. Work drifts from the plan.

IDD addresses this with a structured document hierarchy (Initiatives, Phases, Milestones, and Waves) where each level acts as a context reset boundary and a feedforward guide. It sits squarely in the harness engineering tradition: progressive disclosure, persistent planning artifacts, human review gates, and control-loop primitives at the edges. It’s how I plan and execute everything at Artificer Digital, and it’s been transformative for the kind of multi-session AI-assisted work that most people find frustrating.

More on that soon. If you want to be notified when it drops, the best way is to follow me on LinkedIn or subscribe to The Artificer’s Grimoire.

If you’re working in this space (building AI agents, integrating them into engineering workflows, or just trying to figure out what autonomous development means for your team), I’d love to hear from you. You can find me on GitHub, LinkedIn, or reach me at tim@artificerdigital.com.