Scoping a two-week AI pilot for a five-person agency

Why a two-week shape

Two weeks is short enough that the agency can afford to lose the time if the pilot fails, and long enough to survive the first round of real client work hitting the new workflow. Anything shorter and you only measure the honeymoon; anything longer and the pilot starts to compete with live delivery for attention and loses.

The other reason is staffing. A five-person team cannot dedicate anyone full-time to an AI experiment. Two weeks of part-time attention from one or two people is roughly the budget that is actually available, once client commitments are honoured.

Narrowing the brief to one metric

"We want to use AI more" is not a brief — it is a feeling. Before any tool is chosen we spend the first conversation converting the feeling into a single number the agency already tracks, or could track without new tooling.

For a client-services agency, the metrics that tend to move first are:

First-reply time on inbound enquiries — the delay between a customer email arriving and a human response going out.
First-draft time on a standard deliverable — the hours from brief received to a draft good enough for internal review.
Proposal throughput — the number of proposals sent per week, holding quality constant.

We pick one. Not two. A pilot that tries to move two metrics at once cannot attribute changes to a specific intervention, and the agency leaves the pilot no wiser than when they started.

Setting the baseline before anything changes

Half of a good pilot is the week before it starts. We spend the first three or four days measuring the chosen metric under the agency's current, unchanged workflow. Spreadsheets are fine. No instrumentation project, no dashboards.

The baseline matters because otherwise, at the end of the pilot, the team will remember the pilot week as faster regardless of what actually happened. Memory flatters interventions; numbers do not.

Choosing the intervention last

Most agencies come to us with the intervention already chosen. They have heard about a particular tool, seen a demo, and want help implementing it. We reverse the order on purpose. The metric is picked first, the baseline is measured second, and only then do we choose the intervention — because the intervention has to be the smallest one that could plausibly move the chosen metric.

For first-reply time, that is almost never a full chatbot. It is more often a prompt template that drafts a first response from the enquiry email, which a human then edits before sending. Five minutes instead of thirty. The team stays in the loop on tone, pricing, and anything that commits the agency legally — but they skip the blank-page step.

What the two weeks actually look like

Days 1–3: baseline measurement. No workflow changes. The team logs start and end times for the chosen metric, and we note the common shape of the work — what kinds of enquiries come in, how they cluster, how long the current responses tend to be.

Days 4–5: we draft the intervention. For a first-reply-time pilot, that is typically one prompt template plus a short written protocol: when to use it, when to override it, what to do if the draft is wrong. Nothing more elaborate.

Days 6–12: the team uses it on live client work. We check in briefly each day — usually fifteen minutes — to capture which drafts were usable, which had to be rewritten, and why. The agency keeps doing its job; the pilot rides on top of the real work rather than replacing it.

Days 13–14: we compare the metric against the baseline, and write a short report. One page. What moved, what did not, and whether the workflow survives removing the pilot-period support.

What a successful pilot looks like

A good pilot outcome is not "AI saved us twelve hours a week." It is "the chosen metric moved by X per cent, in a way the team is confident they can sustain without us." Sustainability matters more than the size of the improvement — a smaller, repeatable gain is more valuable than a larger one that depends on daily hand-holding.

Equally valid: the metric did not move. That is still a useful result. It tells the agency that this particular intervention, on this particular workflow, is not where the leverage lives — and saves them from buying a licence they would not have used.

What we do not do in two weeks

We do not build integrations, train custom models, stand up dashboards, or migrate data. A pilot that touches more than one workflow is not a pilot any more; it is a programme of work, and it needs the corresponding budget and governance. The point of keeping the shape narrow is that the agency can say yes, try it, and say no — all in the space of a single sprint.

— Zheng Zhong, Founder, SOLO TECH LTD