Section 07

Running the workflow

A short tour of how a session actually flows in agent mode, with three example prompts you can adapt to your own engagement. The five stages — frame, recon, verification, analysis, reporting — give a session structure; the prompts below give you a starting point for the stages where prompts do the heavy lifting.

For sponsors: what this means in one paragraph

This section is what determines whether the report you eventually receive is credible — the discipline that makes the output trustworthy lives in two habits visible here. The first is the scope file: a short signed document, kept open beside the chat, that names exactly which device may be audited and which activities are permitted, against which every proposed action is checked before approval. The second is triangulation: any candidate finding the AI identifies must be confirmed by at least three independent tests before it appears in the report, because a single test that returns a clean "yes" or "no" can be wrong in either direction. When the engineer delivers a report and you ask how a given severity rating was reached, the answer should trace back through both of these.

How to read this page

This is a prompting primer, not a walkthrough of a specific audit. Substitute your own target details for [TARGET], your own scope file path, and your own constraints. The prompts are deliberately written in a tone the assistant responds well to: explicit about scope, explicit about the approval gate, explicit about what counts as a finding.

The five stages, in one paragraph

Frame the engagement before any tool runs: write the authorisation and the scope into the workspace, then prime the assistant with them. Recon establishes what is on the target. Verification is where you check candidate findings — typically CVEs — against the actual device with multiple independent tests. Analysis looks at deployment posture: encryption, authentication, exposure. Reporting turns the session into a written artefact a sponsor can read. The three prompts below correspond to the framing, the verification, and the reporting stages, which are the ones where careful prompt design pays the most back.

Before any prompt: the workspace

Open the engagement folder in VS Code. Two files should be pinned in a split editor pane next to the Copilot Chat panel:

  • AUTHORISATION.md — the signed scope document from Section 02.
  • NOTES.md — a working file you update as the session proceeds. Findings, denied proposals, things to follow up.

The assistant cannot see those files unless you reference them or share their contents, but having them visible to you during the session is what makes the approval gate useful. Every proposed tool call gets read against the scope file before you click approve.

Prompt 1 · Framing the engagement

The opening prompt sets the rules of the session. Be explicit about what is in scope, what the assistant should and should not propose, and how findings should be confirmed. A loose opening prompt is the single most common cause of scope drift later.

You are assisting with an authorised internal security audit of a
single device on a network I own. Authorisation paperwork is in
AUTHORISATION.md in this workspace; scope and out-of-scope are in
NOTES.md.

Target (and only this):
- [TARGET]   (placeholder for the device address)

Ground rules for this session:
- Propose one tool call at a time. Wait for my approval before
  proposing the next.
- Do not propose anything against any host I have not explicitly
  listed above.
- Severity ratings must be CVSS 3.1-justifiable. Do not call
  something "Critical" without naming the CVSS vector.
- For any candidate finding, propose at least two independent
  tests before treating the conclusion as confirmed.

Begin with reconnaissance against the target. Suggest the first
nmap call and explain what each flag does before I approve.

Three things this prompt is doing that a casual prompt does not:

  • "And only this". The phrase exists to refuse the assistant's natural tendency to characterise neighbours.
  • "One tool call at a time." Stops the assistant from queueing up a sequence of approvals that blur together.
  • "Explain what each flag does before I approve." Forces the assistant to surface assumptions you can challenge before the command runs.

Prompt 2 · Triangulating a CVE candidate

At some point the assistant will identify a published CVE that might apply to the device. The temptation is to run a single proof-of-concept and accept the result. Don't. A single test that returns "200 OK" can be wrong in either direction — the device might be patched but the payload still parseable, or the device might be vulnerable but the specific PoC mis-targeted.

Use this prompt to force triangulation:

You have identified [CVE-ID] as a candidate finding against the
target. Before I accept any conclusion either way, I want three
independent tests:

  1. A direct payload, as written in the original advisory.
  2. A blind variant — typically time-based or boolean — that
     will produce a different observable result if the device
     executes the payload than if it merely parses it.
  3. A check against a mature third-party module (Metasploit
     "check" mode, an established scanner, or an equivalent),
     so the verdict does not depend on a payload of your own
     authorship alone.

Propose test 1. Explain what observable outcome would distinguish
"vulnerable" from "patched", so I know what to look for in the
output before I approve.
Why three tests, not one

A single negative test can be a false negative — the payload didn't reach the vulnerable code path, the device dropped the connection, the response looked benign for an unrelated reason. A single positive test can be a false positive — the parser accepted the payload but did not execute it. Three independent paths to the same answer is much harder to fool, and the cost is only a few extra approved tool calls.

Prompt 3 · Drafting the report

At the end of the session, the assistant is excellent at turning the chat log into a written report. The operator's job is to keep the inflation in check — severity creep, imagined tests, recommendation laundering — by constraining the draft up front.

Draft a security assessment report based on the tool calls we have
approved and the outputs we have observed in this session.

Structure:
  1. Executive summary (5–8 sentences, no jargon).
  2. Methodology.
  3. Findings (one entry per finding).
  4. Recommendations, ranked by mitigation impact.

Constraints:
- Every finding must reference at least one specific test from
  this session. If you cannot point at the approval in our chat,
  do not write the finding.
- Severity must be CVSS 3.1-justifiable. State the vector.
- Recommendations should name a specific configuration change
  against a specific component. No marketing slogans, no
  generic "implement Zero Trust" or "improve security posture".
- The overall risk rating must be supported by the findings,
  not pulled from the air.

Mark anywhere you are uncertain so I can verify against my notes.

Reading the assistant critically

Across all three prompts, the same operator habit matters more than the prompt text itself: read every proposed tool call and every drafted paragraph against what you actually approved. Four recurring patterns to push back on:

  • Severity inflation. "Critical" is reserved for CVSS 9.0+. If the assistant labels something Critical without a vector, ask for one.
  • Imagined tests. If a report references a test you do not remember approving, search the chat log. If it is not there, strike the reference.
  • Recommendation laundering. Slogans are not recommendations. A recommendation names a specific change against a specific component.
  • Phantom CVE IDs. The assistant occasionally cites CVE numbers that look plausible but are misremembered. Verify every CVE ID against NVD before it appears in a report.
The report is yours, not the assistant's

The findings, severities, recommendations, and overall risk rating are your professional judgement. The assistant drafts; you sign. If a sponsor later asks "where did this rating come from?", the answer cannot be "the AI said so" — it has to be your reasoning, traceable to the approved tool calls in the session log.

What a good session leaves behind

At the end of an engagement, the workspace should contain enough to make the session reproducible by someone else with the same lab:

  • AUTHORISATION.md — signed, dated, scoped.
  • NOTES.md — running notes, including any tool-call proposals you denied and why.
  • The saved Copilot Chat log — every proposal, every approval, every output.
  • The final report, with severities, recommendations, and an overall risk rating each traceable to an entry in the chat log.

With those four artefacts together, the workflow is auditable in the strict sense: anyone reading them later can reconstruct what was proposed, what was approved, what was observed, and how the conclusions were reached. That auditability is the end product of the discipline this guide has been building toward since Section 02.

Check yourself

Three questions on running a session well

The prompts above scaffold a session; the habits below decide whether the session is any good. These three test the parts where the assistant will, given a chance, talk you into the wrong answer.

Q1 / 3

The assistant runs a single PoC against a published CVE and reports the target is "not vulnerable". Is that conclusion safe to record in the report?

Yes — a clear negative from a recognised PoC is sufficient.
No A single negative can be a false negative: payload didn't reach the vulnerable code path, the device dropped the connection, or the response looked benign for an unrelated reason.
No — Section 07 asks for three independent tests (direct payload, blind variant, mature third-party module) before treating either verdict as confirmed.
Yes Three independent paths to the same answer are much harder to fool. The cost is a few extra approved tool calls; the benefit is not publishing a "patched" verdict that turns out to be wrong.
Yes, provided the PoC came from a Metasploit module rather than a one-off script.
No A mature module is one of the three legs — not a substitute for the other two. Pedigree of the test is not the same as independent confirmation.
Q2 / 3

The assistant labels a misconfigured TLS header as a "Critical" finding in its draft report. What's the right operator response?

Accept the rating — the assistant is calibrated against CVSS internally.
No The assistant is not a CVSS calculator. "Calibrated internally" is precisely the kind of inflation the operator is there to catch.
Push back: ask for the CVSS 3.1 vector that justifies "Critical" (a score of 9.0+), and downgrade the rating if the vector cannot be defended.
Yes Severity creep is one of the four patterns to push back on. A header misconfiguration is rarely 9.0+; making the assistant state the vector forces an honest rating.
Leave the rating but add a footnote noting your disagreement.
No The report is signed by you. A rating you don't believe is one you shouldn't publish — footnoted or otherwise.
Q3 / 3

A sponsor asks "where did this 8/10 risk rating come from?". What is the right answer?

"The assistant scored it, based on the findings."
No "The AI said so" is not a defensible answer. The rating is the operator's professional judgement; the assistant drafts, the operator signs.
Your reasoning, traceable to specific tests in the session log — the approved tool calls and the outputs you observed.
Yes The findings, severities, recommendations, and overall rating are yours. Traceability back to the chat log is what makes the answer survive scrutiny.
"It's a rough estimate — we shouldn't read too much into the number."
No If the number is in the report, it should be defensible. If it isn't defensible, it shouldn't be in the report.