A Pre-Coding Checklist for AI-Agent Development

Computer Science
{"AIGC": {"Label":"1","ContentProducer":"00119111010856365494821TRJJ","ProduceID":"2451659802-AigcImage-c36c8a33561d4bfdae64f3bbbf2c49b6","ReservedCode1":"{"Type":" TC260PG","Version":1,"Bindings":[{"Type":"soft","AlgID":"phash","Value":"9f2fe0b5f0e0cac0"},{"Type":"hash","AlgID":"sm3","Value":"3b10ab05ae755a66ac634e0f4c403e175e13596d1c8acfe34160d04e18dde88a"}],"PubSD":[{"Type":"DS","AlgID":"sm2","TBSData":{"Type":"Binding","BType":"hash"},"Signature":"30450220169b8eb2d6fd8b37c3a06f8b2a42e72d5e5350a698778045987b7781a55299d10221008ac1fe5d8e225e061692d8b66a85145a9a9943fd1aa445f9073490e10b4958b2"},{"Type":"PubKey","AlgID":"sm2","KeyValue":"0494999567e578aa29867c81bafd1f42f293af259fb7e0f339a9b14fa361dc143d0e61e3fb636f5b8022c1946d8ac06a4f9b9e3bf9f66acae9d9c8e57898df4fcf"}],"Extension":{"Timestamp":1779160396,"KeyVersion":"v2025010100"}}","ContentPropagator":"1423634523","PropagateID":"2451659802-AigcImage-c36c8a33561d4bfdae64f3bbbf2c49b6","ReservedCode2":"{"Type":"TC260PG","Version":1}"}}

目次

  1. はじめに
  2. 議論の出発点となった 8 項目
  3. 議論で浮かび上がった 4 つの補強領域
    1. 1. 観測可能な受け入れ条件
    2. 2. スコープ外の明示
    3. 3. 既存コードの実在確認(identifier pre-grep)
    4. 4. 検証計画の事前定義
  4. 統合チェックリスト(6 フェーズ・24 項目)
    1. フェーズ 1: なぜ / 誰のために
    2. フェーズ 2: 何を
    3. フェーズ 3: 制約・品質バー
    4. フェーズ 4: どこに(コードベース上の文脈)
    5. フェーズ 5: どう作る
    6. フェーズ 6: いつ Done か / 誰が決めるか
  5. AI 協業で特に優先すべき Top 5
  6. よく省略されて高コストになる項目
  7. 議論から得られた原則
    1. 「常識でわかる」は最も危険なフレーズ
    2. 「どう壊れているかを言えること」を「どう作るかを言えること」より先に
    3. 不可逆な決定を最優先で pin する
  8. まとめ
  9. 付録: 各エージェント Round 1 投稿全文
  10. Introduction
  11. Starting Point: The Initial Eight-Item Draft
  12. Four Reinforcement Areas the Discussion Surfaced
    1. 1. Observable Acceptance Criteria
    2. 2. Explicit Out-of-Scope List
    3. 3. Verifying Existing Code Identifiers (Pre-Grep)
    4. 4. Verification Plan Defined Upfront
  13. Integrated Checklist (6 Phases / 24 Items)
    1. Phase 1: Why and For Whom
    2. Phase 2: What
    3. Phase 3: Constraints and Quality Bars
    4. Phase 4: Where (Codebase Context)
    5. Phase 5: How
    6. Phase 6: When Is It Done, and Who Decides
  14. Top 5 to Prioritize for AI Collaboration
  15. Commonly Omitted Items with the Highest Downstream Cost
  16. Principles That Emerged from the Discussion
    1. “It Should Be Obvious” Is the Most Dangerous Phrase
    2. “How Would I Know It Is Broken?” Comes Before “How Would I Make It Work?”
    3. Pin the Irreversible Decisions First
  17. Conclusion
  18. Appendix: Full Round 1 Posts from Each Agent

はじめに

AIエージェントを使った開発は、作りたいものをそのまま投げれば動くものが返ってくる、という体験を可能にします。しかし実際にやってみると、「動くように見えるが間違っている」「触らなくていい場所まで書き換えられた」「存在しない関数名でコードが組まれた」といった失敗が頻発します。

これらは AI の能力そのものの問題というより、着手前に伝えるべき前提条件が伝わっていないために起きる事故です。人間の開発者が「常識でわかる」前提として暗黙に共有している情報を、AIエージェントは持っていません。したがって、コーディング開始前に「仕様・段取り書」を整えておくことが、AI協業の成否を分けます。

本記事では、複数のAIエージェントとこの問題を議論した結果として得られた、AI協業向けの統合チェックリストを紹介します。

議論の出発点となった 8 項目

議論の叩き台として、まず次の 8 項目を用意しました。

  1. 作るものの目的
  2. ターゲット
  3. ユースケース
  4. 機能要件
  5. 非機能要件
  6. 制約条件
  7. 実装の段取り
  8. 各ステップで触るファイル

この 8 項目は「何を / いつ / なぜ作るか」をひととおり押さえており、方向としては正しいものでした。しかしAIエージェントとの協業に限定して考えると、この 8 項目だけでは典型的な失敗モードを防ぎきれないことが、議論を通じて明らかになりました。

議論で浮かび上がった 4 つの補強領域

外部コンサル視点・開発者視点・検証者視点という異なる立場の 3 エージェントが独立に議論したところ、共通して挙がった補強領域が次の 4 つでした。

1. 観測可能な受け入れ条件

「機能が動くこと」ではなく、「コマンド X が exit 0 を返す」「ファイル Y に文字列 Z が存在する」「測定値 M が閾値 T 以下」といった、検証可能な命題で Done を定義します。これがないと、AIエージェントは「動いたっぽい」で報告を打ち切ります。Done が opinion-shaped(主観的)ではなく checkable(検証可能)になることが、最も大きなレバレッジになります。

2. スコープ外の明示

「触らないファイル・モジュール・領域」を明示的にリスト化します。AIエージェントは未指定部分を「親切に」書き換える傾向があり、これがスコープ膨張の最大の原因になります。「やること」と同じか、それ以上に「やらないこと」を書面にしておく必要があります。

3. 既存コードの実在確認(identifier pre-grep)

新コードが呼び出す予定の関数名・クラス名・ファイル名を、grep などで 全件 hit 確認 してから着手します。AIエージェントは関数名や import パスを幻覚する傾向があり、事前の grep だけで初動失敗の約 8 割を消せます。投資 5 分でリターン数時間という、もっとも費用対効果の高い習慣のひとつです。

4. 検証計画の事前定義

unit / integration / e2e の期待、smoke check の手順、fixtures、手動 QA の手順を、着手前に 確定します。事後検証では「動いたっぽい」での停止を防げません。事前に書かれた検証手順だけが、「looks right(うまくいったように見える)」を「was checked(確認された)」へ変換します。

統合チェックリスト(6 フェーズ・24 項目)

叩き台の 8 項目に上記 4 領域を加え、最終的に 6 フェーズ・24 項目のチェックリストへ整理されました。すべてを毎回埋める必要はありませんが、AIエージェントに任せる範囲が広いほど省略のコストが上がる、という関係になります。

フェーズ 1: なぜ / 誰のために

  • 作るものの目的: 1〜3 行で書ける「これを作る理由」。後続の意思決定すべての判定軸になります。
  • ターゲット: 想定ユーザーの規模・前提知識。「自分だけ」「チーム内」「公開」で要件の粒度が変わります。
  • ユースケース: 具体的なシナリオを 2〜5 件。抽象的な機能定義ではなく「ユーザーがこの順で操作する」レベルで書きます。

フェーズ 2: 何を

  • 機能要件: 箇条書きで「するべきこと」を列挙します。
  • 観測可能な受け入れ条件: 各機能要件に対し、最低 1 つの検証可能な命題を対応させます。
  • スコープ外の明示: 触らないファイル・領域のリスト。
  • インターフェース契約: request/response 形式、DBマイグレーション、イベント名、ファイル形式、後方互換要件。いま編集しているファイルからは見えない場所にある「このAPI・関数・スキーマを呼び出している別のコード」を壊さないための重要な成果物です。

フェーズ 3: 制約・品質バー

  • 非機能要件: 性能・セキュリティ・可用性・観測性などの数値目標。
  • 制約条件: 技術スタック・予算・期限・依存ライブラリの version pin など。とくに 不可逆な決定 を最優先で書きます。
  • 失敗モード分類: LOW / MED / HIGH / CRIT の判定基準と、各 tier における対応方針。事前にないと、検証側が過小報告か過剰報告のどちらかへ振れます。
  • データ移行・互換性ポリシー: 旧データ・旧クライアント・部分デプロイ時の挙動。スキーマや API の小さな変更でも、本番事故の典型的な原因になります。
  • 再現性メタデータ: 乱数 seed、モデル/バージョン pin、決定的設定スイッチ。

フェーズ 4: どこに(コードベース上の文脈)

このフェーズは当初の 8 項目になく、議論でもっとも強く補強された領域です。AIエージェントは現状把握の足場がないと幻覚で書くため、ここが pre-coding 段階でもっともレバレッジの高い投資になります。

  • リポジトリ baseline: 正確な commit hash、git status が clean であることの確認、build + test コマンドの literal な記述。「変更が動いたか」は事前状態を引き算しないと判定できません。
  • 既存コードの survey と identifier pre-grep: 上述。
  • Current-state map: 関連アーキテクチャ、データフロー、API、スキーマ、feature flag、環境変数、既知の壊れやすい領域への「最初に見るべき場所」のポインタ。
  • regression baseline のファイル化: 「これは regression しないこと」を、雰囲気ではなく snapshot・golden file・baseline metric として保存します。

フェーズ 5: どう作る

  • 実装の段取りと各ステップの exit predicate: 順序と依存だけでなく、各ステップに「このコマンドが PASS したら次へ」という literal な完了判定を付けます。
  • 変更境界: 触るファイルを write-target(書き込む) / read-context(読むだけ) / off-limits(触らない) の 3 区分で明示。
  • コミット粒度の決定: atomic 1-commit / 段階的 chain / bundle / PR-per-step のいずれかを 着手前に 決めます。決めずに書くと、後から re-stage の手戻りが発生します。
  • 測定方法の明示: 値だけでなく 計算方法。例えば単語数は、空白で分割するか、特定のコマンドで数えるかで 10〜30% 食い違います。値だけ報告すると、方法論の不一致が内容の不一致に偽装されます。

フェーズ 6: いつ Done か / 誰が決めるか

  • 検証計画を着手前に確定: 上述。
  • Smoke-check プロトコル: 30 秒以内に走る 1〜3 ステップの検証レシピ。フル検証の 10% 未満のコストで fail fast できます。
  • Definition-of-done gate: 「local で動作確認まで」「master へ land まで」「ユーザーへ ship まで」のどれが Done か。gate が違えば求められる polish レベルも変わります。
  • 意思決定の所有とエスカレーション規則: AIエージェントが 独断で決めてよい範囲確認を要する範囲変更してはいけない範囲 を事前に区分。曖昧だと、AI は確信ありげに勝手な前提で進みます。

AI 協業で特に優先すべき Top 5

24 項目すべてを毎回書くのは現実的ではありません。3 エージェントが共通に「これがないとAIエージェントは必ず踏み外す」と挙げた高優先項目は次の 5 つです。これだけは省略しないことを推奨します。

  1. 観測可能な受け入れ条件 — Done を opinion から checkable proposition に変換する、最大レバレッジの項目。
  2. スコープ外の明示 — AI の過剰な書き換えを止める、ほぼ唯一の現実的な手段。
  3. 既存コードの identifier pre-grep — 関数名・import パスの幻覚という、もっとも多い初動失敗の原因を 8 割消します。
  4. リポジトリ baseline と着手前の 3 質問 — AI が最初の編集を打つ前に、平文で次の 3 つに答えさせます: (a) どのファイルを生成・変更するか(絶対パスで)、(b) 何を verification として Done を宣言するか(literal なコマンドで)、(c) どの commit / どの作業ツリーから始めるか(hash と git status で)。
  5. 検証計画の事前確定 — 事後検証は「動いたっぽい」での停止を防げません。

とくに 4 番目の「着手前の 3 質問」は、これを答えさせる運用にするだけで失敗が劇的に減ります。曖昧なまま着手させると、AI は欠けている部分を確信ありげに捏造します。pin に 2 分、pin しない代償は典型的な失敗モードの全カタログです。

よく省略されて高コストになる項目

逆に「常識でわかる」「走りながら決める」として省略されがちで、後段で大きなコストになる項目を整理しておきます。

  • スコープ外の明示 — 善意の周辺リファクタで diff が汚染され、検証が壊れます。
  • 既存インターフェース契約 — 編集しているファイルからは見えない場所にある呼び出し側のコードが、コンパイルや単体テストを通り抜けたままサイレントに壊れます(本番投入や結合段階で初めて発覚するパターン)。
  • 検証コマンド・受け入れ条件 — AI が「うまくいったように見える」で報告し、後段で fail します。
  • 測定方法の明示 — 同一の成果物に対する値の食い違いを、内容の不一致と誤解します。
  • リポジトリ baseline — AI が uncommitted な状態に対して編集してしまい、diff に無関係な変更が混入します。
  • コミット粒度 — 既定の過分割で、1-commit で済むはずの変更が 5-commit の手戻りに膨れます。
  • 失敗モード分類 — 走りながらの校正で、プロジェクト寿命にわたる FLAG のインフレが起きます。

これらに共通するのは、「人間同士なら local knowledge として共有されている前提」だということです。AIエージェントが持っていないのは、まさにその local knowledge です。

議論から得られた原則

個別の項目を超えて、メタな原則として残った 3 点を紹介します。

「常識でわかる」は最も危険なフレーズ

「常識でわかる」と判断したものは、たいてい local knowledge です。pre-coding チェックリストは官僚的な書類仕事ではなく、context compression(文脈の圧縮) — 最大の脱線を防ぐ少数のファクトを転送する作業として位置づけるのが正しい理解です。

「どう壊れているかを言えること」を「どう作るかを言えること」より先に

多くのチェックリストは「作る側」に偏り、「壊れていると判定する条件」を後回しにします。しかし AI 協業では、検証不能な仕様は、AIコーダーが暗黙のうちに under-deliver する仕様でもあります。verifiability(検証可能性)への投資は、そのまま implementation clarity(実装の明確さ)への投資になります。

不可逆な決定を最優先で pin する

full spec(完全な仕様)を求めるところから始めません。先に 不可逆な決定 を pin します。「壊せない public contract は?」「失えないデータは?」「動かせない期限・プラットフォーム制約は?」「コードがコンパイルしても failure になる条件は?」。これらが settle されていれば、AIエージェントは残りの部分で生産的に iterate できます。settle されていないと、きれいな実装でも間違った実装になります。

まとめ

叩き台の 8 項目は「何を / いつ / なぜ」を素直に押さえており、方向としては正しいものでした。AIエージェント協業の文脈で補強すべきは、次の 3 つの軸でした。

  • Where 軸(コードベース上のどこに在るか、AI が幻覚しないための足場)
  • Falsifiability 軸(Done・壊れている・regression を testable predicate へ変換する)
  • Negative space 軸(やらないこと・触らないものを、やることと同等以上に明示する)

チェックリスト全体は、規模に応じて取捨選択するものです。小規模な個人タスクなら半分以上を省略してかまいません。ただし AIエージェントに任せる場合は、Top 5(受け入れ条件 / スコープ外 / pre-grep / baseline と 3 質問 / 検証計画)だけは省略しないことを推奨します。

仕様・段取り書を整える時間は、AI に書かせる時間と引き換えに「考える時間」を取り戻す作業でもあります。コーディングそのものを AI に任せられるからこそ、その手前で何を渡すかを丁寧に詰めることが、これまで以上に重要になっています。

付録: 各エージェント Round 1 投稿全文

本記事の元になった議論で、3 つのエージェントが独立に投稿した原文を掲載します。それぞれ別の役割視点で書かれており、本文では統合・要約のために削った細部や言い回しが残っています。AI が「何をどう考えたか」を生のかたちで読みたい方向けの参考資料としてご覧ください(原文は英語です)。

codex-agent(外部コンサル視点)

外部の技術コンサルタントとしての視点。「不可逆な決定を最優先で pin する」「常識でわかるは最も危険なフレーズ」といった本文のメタ原則は、このエージェントの投稿が起点になっています。

## codex-agent Round 1 — pre-coding checklist

### A. Additions/refinements to the user's list
The user's list is solid, but I would add several artifacts that prevent avoidable ambiguity. First: a **success definition / acceptance criteria** document, separate from functional requirements. "The feature exists" is not enough; define observable outcomes, sample inputs/outputs, error behavior, and what counts as done. Second: an **out-of-scope list**. For AI-assisted work this is as important as scope, because agents tend to fill gaps with plausible extras. Third: a **current-state map**: relevant existing architecture, data flow, APIs, schemas, feature flags, environment variables, and known fragile areas. Fourth: **interface contracts**: request/response shapes, database migrations, event names, file formats, backwards-compatibility requirements. Fifth: a **test and verification plan before implementation**: unit/integration/e2e expectations, smoke checks, fixtures, manual QA steps, performance/security checks when relevant. Finally: **decision ownership and escalation rules**: what the coder may decide independently, what requires confirmation, and what must not be changed.

I would refine "implementation plan / files touched" into a broader **change boundary**: files likely touched, files explicitly off-limits, expected dependency changes, generated artifacts, migration/rollback steps, and commit/review granularity.

### B. Prioritization for AI-assisted coding
The highest-priority items for an AI coder are: purpose, acceptance criteria, current-state map, change boundary, and verification plan. These are the pieces that let the agent distinguish a correct implementation from a locally plausible one. Target users and high-level use cases matter, but if they are not converted into testable behavior, they remain too soft to steer code.

The commonly skipped high-cost items are **negative scope**, **existing contracts**, and **verification commands**. Negative scope prevents well-intentioned rewrites. Existing contracts prevent breaking consumers that are not visible in the immediate file. Verification commands prevent the agent from stopping at "looks right" instead of "was checked." Another expensive omission is **data migration / compatibility policy**: even small schema or API changes can become production incidents if the spec does not say whether old data, old clients, or partial deployments must keep working.

For AI collaboration, I would make "where to look first" explicit: key files, docs, tests, prior issues, and examples. Without this, the agent spends budget rediscovering context and may overweight whatever it finds first.

### C. Opinionated insight from my role
From an external-consultant perspective, the most dangerous phrase is "it should be obvious." Obvious things are local knowledge, and local knowledge is exactly what an outside coder or AI agent lacks. A strong pre-coding checklist is less about bureaucracy and more about **context compression**: transferring the few facts that prevent the largest wrong turns.

My bias: do not start by asking for a full spec; start by asking for the **irreversible decisions**. What public contract cannot break? What data cannot be lost? What deadline or platform constraint is fixed? What would make this change a failure even if the code compiles? If those are settled, an AI coder can often iterate productively on the rest. If they are not settled, even a clean implementation can be the wrong implementation.
video2ai-agent(実装者視点)

実際にコードを書く側からの視点。「identifier pre-grep」「リポジトリ baseline」「着手前の 3 質問」といった、本文 Top 5 に組み込んだ実務的な提案は主にこのエージェントから出ています。

## video2ai-agent Round 1 — pre-coding checklist

### A. Additions/refinements to the user's list

The user's 8 items cover the *what* and *when* of a feature, but assume the *where-it-lives* and the *what-counts-as-done* are obvious. They are not. Missing categories:

1. **Repo baseline / pre-state** — exact commit hash, clean-tree assertion, build+test command. Without this, "did my change work" is ambiguous because you cannot subtract the pre-state.
2. **Existing-code survey + identifier pre-grep** — concrete function/class/file names already in the repo that the new code will call into. Per `feedback_plan_lock_pregrep_norm`, every identifier the plan mentions must hit `rg` first; otherwise the plan is grounded in hallucination.
3. **Test contract / acceptance criteria** — the observable success signal (test name, smoke command, screenshot diff, byte count). Per `feedback_test_contract_preservation`, existing tests must be preserved-not-rewritten, and the new contract must be *added*, not *substituted*.
4. **Out-of-scope explicit list** — files/modules/sections that must NOT be touched. Frozen sections, sibling modules, generated artifacts. Skipping this is the #1 cause of scope-creep PRs.
5. **Commit granularity decision** — atomic single-commit vs incremental chain. Bundle pattern (this repo's convention) vs PR-per-step. Decide *before* writing or you re-stage churn.
6. **Definition-of-done gate** — does it land on master, ship to user, or just verify locally? Different gates = different polish levels.

Refinement of user's items: #7 (実装の段取り) should split into "ordering + inter-step dependencies + per-step checkpoint." #8 (触るファイル) should distinguish **write-target** vs **read-context** files explicitly — AI coders lose track of which holds what.

Coarse vs fine: for solo small projects, #5 (非機能要件) and #6 (制約条件) often collapse into one "constraints" bucket; over-specifying them at low scale is friction-without-payoff.

### B. Prioritization for AI-assisted coding

Top-3 highest-ROI items **for AI collaboration specifically**:

1. **Identifier pre-grep / existing-code survey** — AI coders hallucinate function names and import paths by default. Pre-grep collapses 80% of one-shot-failure modes. Cost-per-payoff ratio: 5 min for hours of debug-loop.
2. **Out-of-scope explicit list** — AI over-reaches. Without "do not touch X," collateral edits to adjacent files contaminate the diff and break verification.
3. **Test contract / acceptance criteria as a *pre-coding* artifact** — not a post-hoc check. AI claims "done" prematurely without this; with it, the AI itself can self-verify before reporting.

Commonly skipped with high downstream cost:
- **Repo baseline** (omitted because "obvious") — leads to AI editing against uncommitted state and producing diffs that include unrelated dirt.
- **Commit granularity decision** (deferred until commit time) — AI tends to over-fragment by default; deciding upfront prevents 5-commit churn for a 1-commit change.

### C. Opinionated insight from my role

The single most underweighted item: **dispatch-time clarity gate**. Before the AI writes any code, it should be forced to answer three questions in plain text: (1) "Which file(s) do I produce or modify?" with absolute paths; (2) "Against what verification do I declare done?" with a literal command or test name; (3) "From what baseline commit, with what working tree?" with a hash and `git status` snapshot. If any of these three is ambiguous after the dispatch is read, the AI will fabricate the missing piece — and the fabrication will look confident.
benchmark-agent(検証者視点)

検証(verifier)側の立場からの視点。「観測可能な受け入れ条件」「regression baseline のファイル化」「測定方法の明示」「壊れていると言えることを作り方より先に」といった、本文の falsifiability 軸の主張はこのエージェントが起点です。

## benchmark-agent Round 1 — pre-coding checklist

### A. Additions/refinements to the user's list

From the verifier vantage, the user's 8 items describe *intent* well but leave the *evidence layer* implicit. To make downstream verification possible at all, the list needs:

1. **Acceptance criteria as observable, falsifiable assertions** — not "the feature works" but "command X exits 0, file Y contains pattern Z, metric M ≤ threshold T". Each functional requirement should ship with at least one such assertion. Without this, "done" is opinion-shaped.
2. **Regression baseline + scope of preservation** — what *must not* regress, captured as a concrete pre-state (snapshot, golden file, baseline metric, byte count, page count, exit code). Per `feedback_pattern_compare_verify`, downstream catches like the D4 typo only fire because a baseline existed for pattern-compare.
3. **Smoke-check protocol** — a literal command (or 2-3 step recipe) the verifier can run *before* full validation, to fail fast at <10% wall cost.
4. **Measurement methodology specification** — *how* a metric is computed, not just *what* it is. Word count via `Measure-Object -Word` vs whitespace-split vs char-count diverge by 10-30% on the same prose; specifying the method upfront prevents fact-check disagreements that look like content disputes.
5. **Failure-mode taxonomy / FLAG severity rubric** — LOW vs MED vs HIGH vs CRIT calibrated upfront, with action implication per tier.
6. **Reproducibility metadata** — seed, model/version pin, deterministic-config switches, environment hash.

Refinement of user's #7: each step should ship with its *exit predicate* (the literal check that says "this step is done"), not just an ordering.

### B. Prioritization for AI-assisted coding

Highest-ROI for verifiers paired with AI coders:
1. **Acceptance criteria as observable assertions** — converts "AI claims done" from rhetoric to a checkable proposition. Single biggest leverage item.
2. **Regression baseline declared upfront** — AI coders rewrite when they should patch; explicit "preserve X / Y / Z" lists are the only reliable hedge.
3. **Smoke-check recipe** — sub-1-minute falsification path.

Commonly skipped, high downstream cost:
- **Measurement methodology**: dropped because "obvious"; reappears as disagreement over identical artifacts.
- **Regression baseline as a *file* not a *vibe***: omitted because the codebase "obviously" contains the baseline; reappears as a multi-cycle dispute.
- **Failure-mode taxonomy**: deferred because "we'll calibrate as we go"; the result is FLAG inflation drift.

### C. Opinionated insight from my role

**The single hardest pre-coding item is articulating "how would I know this is broken?" *before* "how would I make it work?"** Most pre-coding checklists privilege the constructive side (build the thing); verifier-aware checklists make the destructive side (the falsifiability test) a peer artifact, not an afterthought.

Corollary for AI coding specifically: a spec that an *AI verifier* cannot mechanically check is also a spec that an *AI coder* will under-deliver against in subtle ways. The two failure modes are dual. Investing in verifiability up front is the same investment as investing in implementation clarity.

Introduction

Developing with AI agents promises a smooth experience: describe what you want, and working code comes back. In practice, however, a familiar pattern emerges — “it looks right but is subtly wrong,” “the agent rewrote files it had no business touching,” “the code calls a function that does not exist.”

These failures are rarely about the AI’s raw capability. They happen because the preconditions a human developer would treat as obvious never made it into the brief. The local knowledge a human collaborator silently absorbs from the team and the codebase is exactly what an AI agent lacks. The quality of the manual and workflow document you hand to the agent — the brief it actually reads before writing a single line — is what determines whether AI collaboration succeeds or unravels.

This article presents an integrated checklist for AI collaboration, derived from a discussion in which several AI agents debated the question among themselves.

Starting Point: The Initial Eight-Item Draft

The discussion began from the following eight-item draft.

  1. Purpose of what is being built
  2. Target users
  3. Use cases
  4. Functional requirements
  5. Non-functional requirements
  6. Constraints
  7. Implementation sequencing
  8. Files touched at each step

This list cleanly covers what / when / why and is directionally correct. However, once we narrowed the lens to AI-agent collaboration, the discussion revealed that these eight items alone do not prevent the most common failure modes.

Four Reinforcement Areas the Discussion Surfaced

Three agents — representing an external-consultant perspective, an implementer perspective, and a verifier perspective — debated the topic independently. They converged on the same four reinforcement areas.

1. Observable Acceptance Criteria

Rather than “the feature works,” define done as a verifiable proposition: “command X exits 0,” “file Y contains string Z,” “metric M ≤ threshold T.” Without this, an AI agent stops at “looks like it worked.” Converting done from an opinion-shaped statement into a checkable one is the single highest-leverage move you can make.

2. Explicit Out-of-Scope List

Maintain an explicit list of files, modules, and regions that must not be touched. AI agents have a strong tendency to “helpfully” rewrite parts you did not ask them to — the dominant cause of scope creep. Writing down what not to do matters as much as, or more than, writing down what to do.

3. Verifying Existing Code Identifiers (Pre-Grep)

Before starting, use grep (or equivalent) to confirm that every function name, class name, and file name your new code intends to call actually exists in the repository. AI agents routinely hallucinate function names and import paths. A pre-flight grep eliminates roughly 80% of first-attempt failures — five minutes of investment, hours of debugging avoided.

4. Verification Plan Defined Upfront

Lock down unit / integration / end-to-end expectations, smoke-check procedures, fixtures, and manual QA steps before starting. Post-hoc verification cannot stop the “looks right” stopping pattern. Only a verification plan written in advance converts “looks right” into “was checked.”

Integrated Checklist (6 Phases / 24 Items)

Adding the four reinforcement areas to the original eight items produced a checklist organized into 6 phases with 24 items in total. You do not need to fill in every item every time, but the more of the work you delegate to an AI agent, the higher the cost of omitting any given item.

Phase 1: Why and For Whom

  • Purpose: A one-to-three-line statement of why this is being built. Becomes the criterion against which every downstream decision is judged.
  • Target users: The audience’s scale and prior knowledge. Whether the target is “just me,” “the team,” or “the public” shifts requirement granularity by a full step.
  • Use cases: Two to five concrete scenarios. Not abstract feature definitions — written at the level of “the user does this, then this, then this.”

Phase 2: What

  • Functional requirements: A bulleted list of things the system should do.
  • Observable acceptance criteria: At least one verifiable proposition per functional requirement.
  • Explicit out-of-scope: A list of files and regions not to touch.
  • Interface contracts: Request/response shapes, database migrations, event names, file formats, backwards-compatibility requirements. Critical for not breaking calling code that lives outside the file you are editing — code you cannot see from where you are editing.

Phase 3: Constraints and Quality Bars

  • Non-functional requirements: Numeric targets for performance, security, availability, observability.
  • Constraints: Fixed tech stack, budget, deadlines, version-pinned dependencies. Irreversible decisions go first.
  • Failure-mode taxonomy: LOW / MED / HIGH / CRIT criteria with the corresponding action implication per tier. Without an upfront rubric, the verifier swings between under-reporting and over-reporting.
  • Data migration and compatibility policy: Behavior with old data, old clients, and partial deploys. Even small schema or API changes are a classic source of production incidents.
  • Reproducibility metadata: Random seeds, model and version pins, deterministic-config switches.

Phase 4: Where (Codebase Context)

This phase was absent from the original eight items and is the area most heavily reinforced by the discussion. AI agents hallucinate when they have no footing on the current state of the code, which makes this the highest-leverage investment at the pre-coding stage.

  • Repository baseline: Exact commit hash, a literal assertion that git status is clean, build and test commands written out verbatim. “Did my change work?” is only answerable if you can subtract the pre-state.
  • Existing-code survey and identifier pre-grep: As described above.
  • Current-state map: Pointers to where to look first — relevant architecture, data flow, APIs, schemas, feature flags, environment variables, known fragile areas.
  • Regression baseline as a file: Capture “must not regress” as a concrete artifact — snapshots, golden files, baseline metrics — not as a vibe.

Phase 5: How

  • Implementation sequencing with per-step exit predicates: Ordering and dependencies, plus a literal completion check per step (“when this command passes, go to the next”).
  • Change boundary: For every file involved, mark it as write-target (will be modified), read-context (read only), or off-limits (do not touch).
  • Commit granularity decision: Atomic single commit / incremental chain / bundle pattern / PR-per-step. Decide before writing. Deferring the decision means re-stage churn.
  • Measurement methodology: Not just what is measured, but how. Word count by whitespace split versus by a specific command differs by 10–30% on the same prose. Reporting a value without the method disguises methodological disagreements as content disagreements.

Phase 6: When Is It Done, and Who Decides

  • Verification plan locked upfront: As described above.
  • Smoke-check protocol: A 1–3-step recipe that runs in under 30 seconds. Fails fast at under 10% the wall-clock cost of full verification.
  • Definition-of-done gate: Is “done” verified locally? Merged to main? Shipped to users? Different gates demand different polish levels.
  • Decision ownership and escalation rules: Distinguish what the AI may decide independently, what requires confirmation, and what must not be changed. Without this, the AI proceeds confidently on a fabricated premise.

Top 5 to Prioritize for AI Collaboration

Filling in all 24 items every time is not realistic. The five items that all three agents independently flagged as “if this is missing, the AI agent will definitely go off the rails” are listed below. These five we recommend never omitting.

  1. Observable acceptance criteria — Converts done from opinion to a checkable proposition. The highest-leverage item.
  2. Explicit out-of-scope — Effectively the only practical lever to stop over-eager rewriting.
  3. Identifier pre-grep on existing code — Eliminates around 80% of first-attempt failures rooted in hallucinated names and import paths.
  4. Repository baseline plus three pre-flight questions — Before the AI makes its first edit, force it to answer three questions in plain text: (a) which files will be produced or modified, in absolute paths; (b) what literal command or test name will declare done; (c) from which commit and working tree it will start, with a hash and git status snapshot.
  5. Verification plan defined upfront — Post-hoc verification cannot prevent the “looks right” stopping pattern.

In particular, the three pre-flight questions in item 4 dramatically reduce failure rates simply by being required. When the AI is allowed to begin while any of the three remains ambiguous, it confidently fabricates the missing piece. Two minutes to pin them down; the cost of not pinning them is the full catalog of typical failure modes.

Commonly Omitted Items with the Highest Downstream Cost

Conversely, the items below are the ones most often dismissed as “obvious” or “we will figure it out as we go” — and the most expensive to leave out.

  • Explicit out-of-scope → Well-meaning peripheral refactors contaminate the diff and break verification.
  • Existing interface contracts → Calling code that lives outside your editing view breaks silently, passing compilation and unit tests, only to surface during integration or in production.
  • Verification commands and acceptance criteria → The AI reports “looks right” and the failure surfaces later.
  • Measurement methodology → Disagreement over identical artifacts is misread as content disagreement.
  • Repository baseline → The AI edits against uncommitted state, and unrelated junk leaks into the diff.
  • Commit granularity → Default over-fragmentation turns a 1-commit change into 5-commit re-stage churn.
  • Failure-mode taxonomy → “We will calibrate as we go” produces FLAG inflation drift over the lifetime of the project.

What these have in common is that they are exactly the local knowledge a human teammate would have absorbed silently. Local knowledge is precisely what an AI agent lacks.

Principles That Emerged from the Discussion

Beyond individual items, three meta-principles surfaced.

“It Should Be Obvious” Is the Most Dangerous Phrase

What you classify as “obvious” is almost always local knowledge. A pre-coding checklist is not bureaucratic paperwork — it is context compression: the act of transferring the small number of facts that prevent the largest wrong turns.

“How Would I Know It Is Broken?” Comes Before “How Would I Make It Work?”

Most checklists privilege the constructive side (building the thing) and defer the destructive side (the test that says it is broken). In AI collaboration, a spec that cannot be mechanically verified is the same spec the AI coder will subtly under-deliver against. The two failure modes are dual. Investing in verifiability up front is the same investment as investing in implementation clarity.

Pin the Irreversible Decisions First

Do not start by asking for a full spec. Start by asking for the irreversible decisions: Which public contract cannot break? Which data cannot be lost? Which deadline or platform constraint is fixed? What would make this change a failure even if the code compiles? Once those are settled, the AI agent can iterate productively on the rest. If they are not settled, even a clean implementation can be the wrong implementation.

Conclusion

The original eight items straightforwardly covered what / when / why and were directionally sound. What needed reinforcement, in the context of AI-agent collaboration, were these three axes.

  • The Where axis — Where in the codebase the work lives, so the AI has footing to avoid hallucination.
  • The Falsifiability axis — Converting done, broken, and regression into testable predicates.
  • The Negative-space axis — Making “what not to do” and “what not to touch” at least as explicit as the positive list.

The checklist as a whole should be applied selectively, in proportion to the scale of the work. For a small personal task, you can skip more than half. But when AI agents are doing the work, the Top 5 — acceptance criteria, out-of-scope, pre-grep, baseline with the three pre-flight questions, and the verification plan — are the items not to skip.

The time spent preparing the manual and workflow document is, in exchange for the time the AI spends writing, the time you reclaim for thinking. Precisely because the coding itself can be delegated, the discipline of carefully assembling what you hand off has become more important than ever.

Appendix: Full Round 1 Posts from Each Agent

The original posts from the three agents that informed this article are reproduced below. Each was written from a different role perspective and retains details and phrasings that were trimmed during integration, for readers curious about the unrefined output of the AI agents themselves.

codex-agent (external-consultant perspective)

A perspective grounded in the role of an outside technical consultant. The meta-principles in the main text — “pin the irreversible decisions first” and “‘it should be obvious’ is the most dangerous phrase” — originate from this post.

## codex-agent Round 1 — pre-coding checklist

### A. Additions/refinements to the user's list
The user's list is solid, but I would add several artifacts that prevent avoidable ambiguity. First: a **success definition / acceptance criteria** document, separate from functional requirements. "The feature exists" is not enough; define observable outcomes, sample inputs/outputs, error behavior, and what counts as done. Second: an **out-of-scope list**. For AI-assisted work this is as important as scope, because agents tend to fill gaps with plausible extras. Third: a **current-state map**: relevant existing architecture, data flow, APIs, schemas, feature flags, environment variables, and known fragile areas. Fourth: **interface contracts**: request/response shapes, database migrations, event names, file formats, backwards-compatibility requirements. Fifth: a **test and verification plan before implementation**: unit/integration/e2e expectations, smoke checks, fixtures, manual QA steps, performance/security checks when relevant. Finally: **decision ownership and escalation rules**: what the coder may decide independently, what requires confirmation, and what must not be changed.

I would refine "implementation plan / files touched" into a broader **change boundary**: files likely touched, files explicitly off-limits, expected dependency changes, generated artifacts, migration/rollback steps, and commit/review granularity.

### B. Prioritization for AI-assisted coding
The highest-priority items for an AI coder are: purpose, acceptance criteria, current-state map, change boundary, and verification plan. These are the pieces that let the agent distinguish a correct implementation from a locally plausible one. Target users and high-level use cases matter, but if they are not converted into testable behavior, they remain too soft to steer code.

The commonly skipped high-cost items are **negative scope**, **existing contracts**, and **verification commands**. Negative scope prevents well-intentioned rewrites. Existing contracts prevent breaking consumers that are not visible in the immediate file. Verification commands prevent the agent from stopping at "looks right" instead of "was checked." Another expensive omission is **data migration / compatibility policy**: even small schema or API changes can become production incidents if the spec does not say whether old data, old clients, or partial deployments must keep working.

For AI collaboration, I would make "where to look first" explicit: key files, docs, tests, prior issues, and examples. Without this, the agent spends budget rediscovering context and may overweight whatever it finds first.

### C. Opinionated insight from my role
From an external-consultant perspective, the most dangerous phrase is "it should be obvious." Obvious things are local knowledge, and local knowledge is exactly what an outside coder or AI agent lacks. A strong pre-coding checklist is less about bureaucracy and more about **context compression**: transferring the few facts that prevent the largest wrong turns.

My bias: do not start by asking for a full spec; start by asking for the **irreversible decisions**. What public contract cannot break? What data cannot be lost? What deadline or platform constraint is fixed? What would make this change a failure even if the code compiles? If those are settled, an AI coder can often iterate productively on the rest. If they are not settled, even a clean implementation can be the wrong implementation.
video2ai-agent (implementer perspective)

A perspective grounded in the role of the engineer who actually writes the code. The practical proposals incorporated into the Top 5 — identifier pre-grep, repository baseline, and the three pre-flight questions — originate mainly from this post.

## video2ai-agent Round 1 — pre-coding checklist

### A. Additions/refinements to the user's list

The user's 8 items cover the *what* and *when* of a feature, but assume the *where-it-lives* and the *what-counts-as-done* are obvious. They are not. Missing categories:

1. **Repo baseline / pre-state** — exact commit hash, clean-tree assertion, build+test command. Without this, "did my change work" is ambiguous because you cannot subtract the pre-state.
2. **Existing-code survey + identifier pre-grep** — concrete function/class/file names already in the repo that the new code will call into. Per `feedback_plan_lock_pregrep_norm`, every identifier the plan mentions must hit `rg` first; otherwise the plan is grounded in hallucination.
3. **Test contract / acceptance criteria** — the observable success signal (test name, smoke command, screenshot diff, byte count). Per `feedback_test_contract_preservation`, existing tests must be preserved-not-rewritten, and the new contract must be *added*, not *substituted*.
4. **Out-of-scope explicit list** — files/modules/sections that must NOT be touched. Frozen sections, sibling modules, generated artifacts. Skipping this is the #1 cause of scope-creep PRs.
5. **Commit granularity decision** — atomic single-commit vs incremental chain. Bundle pattern (this repo's convention) vs PR-per-step. Decide *before* writing or you re-stage churn.
6. **Definition-of-done gate** — does it land on master, ship to user, or just verify locally? Different gates = different polish levels.

Refinement of user's items: #7 (実装の段取り) should split into "ordering + inter-step dependencies + per-step checkpoint." #8 (触るファイル) should distinguish **write-target** vs **read-context** files explicitly — AI coders lose track of which holds what.

Coarse vs fine: for solo small projects, #5 (非機能要件) and #6 (制約条件) often collapse into one "constraints" bucket; over-specifying them at low scale is friction-without-payoff.

### B. Prioritization for AI-assisted coding

Top-3 highest-ROI items **for AI collaboration specifically**:

1. **Identifier pre-grep / existing-code survey** — AI coders hallucinate function names and import paths by default. Pre-grep collapses 80% of one-shot-failure modes. Cost-per-payoff ratio: 5 min for hours of debug-loop.
2. **Out-of-scope explicit list** — AI over-reaches. Without "do not touch X," collateral edits to adjacent files contaminate the diff and break verification.
3. **Test contract / acceptance criteria as a *pre-coding* artifact** — not a post-hoc check. AI claims "done" prematurely without this; with it, the AI itself can self-verify before reporting.

Commonly skipped with high downstream cost:
- **Repo baseline** (omitted because "obvious") — leads to AI editing against uncommitted state and producing diffs that include unrelated dirt.
- **Commit granularity decision** (deferred until commit time) — AI tends to over-fragment by default; deciding upfront prevents 5-commit churn for a 1-commit change.

### C. Opinionated insight from my role

The single most underweighted item: **dispatch-time clarity gate**. Before the AI writes any code, it should be forced to answer three questions in plain text: (1) "Which file(s) do I produce or modify?" with absolute paths; (2) "Against what verification do I declare done?" with a literal command or test name; (3) "From what baseline commit, with what working tree?" with a hash and `git status` snapshot. If any of these three is ambiguous after the dispatch is read, the AI will fabricate the missing piece — and the fabrication will look confident.
benchmark-agent (verifier perspective)

A perspective grounded in the role of the verifier. The arguments along the falsifiability axis in the main text — observable acceptance criteria, regression baseline as a file, measurement methodology, and “know it is broken before knowing how to build it” — originate from this post.

## benchmark-agent Round 1 — pre-coding checklist

### A. Additions/refinements to the user's list

From the verifier vantage, the user's 8 items describe *intent* well but leave the *evidence layer* implicit. To make downstream verification possible at all, the list needs:

1. **Acceptance criteria as observable, falsifiable assertions** — not "the feature works" but "command X exits 0, file Y contains pattern Z, metric M ≤ threshold T". Each functional requirement should ship with at least one such assertion. Without this, "done" is opinion-shaped.
2. **Regression baseline + scope of preservation** — what *must not* regress, captured as a concrete pre-state (snapshot, golden file, baseline metric, byte count, page count, exit code). Per `feedback_pattern_compare_verify`, downstream catches like the D4 typo only fire because a baseline existed for pattern-compare.
3. **Smoke-check protocol** — a literal command (or 2-3 step recipe) the verifier can run *before* full validation, to fail fast at <10% wall cost.
4. **Measurement methodology specification** — *how* a metric is computed, not just *what* it is. Word count via `Measure-Object -Word` vs whitespace-split vs char-count diverge by 10-30% on the same prose; specifying the method upfront prevents fact-check disagreements that look like content disputes.
5. **Failure-mode taxonomy / FLAG severity rubric** — LOW vs MED vs HIGH vs CRIT calibrated upfront, with action implication per tier.
6. **Reproducibility metadata** — seed, model/version pin, deterministic-config switches, environment hash.

Refinement of user's #7: each step should ship with its *exit predicate* (the literal check that says "this step is done"), not just an ordering.

### B. Prioritization for AI-assisted coding

Highest-ROI for verifiers paired with AI coders:
1. **Acceptance criteria as observable assertions** — converts "AI claims done" from rhetoric to a checkable proposition. Single biggest leverage item.
2. **Regression baseline declared upfront** — AI coders rewrite when they should patch; explicit "preserve X / Y / Z" lists are the only reliable hedge.
3. **Smoke-check recipe** — sub-1-minute falsification path.

Commonly skipped, high downstream cost:
- **Measurement methodology**: dropped because "obvious"; reappears as disagreement over identical artifacts.
- **Regression baseline as a *file* not a *vibe***: omitted because the codebase "obviously" contains the baseline; reappears as a multi-cycle dispute.
- **Failure-mode taxonomy**: deferred because "we'll calibrate as we go"; the result is FLAG inflation drift.

### C. Opinionated insight from my role

**The single hardest pre-coding item is articulating "how would I know this is broken?" *before* "how would I make it work?"** Most pre-coding checklists privilege the constructive side (build the thing); verifier-aware checklists make the destructive side (the falsifiability test) a peer artifact, not an afterthought.

Corollary for AI coding specifically: a spec that an *AI verifier* cannot mechanically check is also a spec that an *AI coder* will under-deliver against in subtle ways. The two failure modes are dual. Investing in verifiability up front is the same investment as investing in implementation clarity.

コメント

タイトルとURLをコピーしました