Initial import from garrytan/gstack@026751e (main snapshot via local relay)
Some checks failed
Workflow Lint / actionlint (push) Has been cancelled
Build CI Image / build (push) Has been cancelled
Skill Docs Freshness / check-freshness (push) Has been cancelled
Periodic Evals / build-image (push) Has been cancelled
Periodic Evals / evals (map[file:test/codex-e2e.test.ts name:e2e-codex]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/gemini-e2e.test.ts name:e2e-gemini]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-e2e-design.test.ts name:e2e-design]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-e2e-plan.test.ts name:e2e-plan]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-e2e-qa-bugs.test.ts name:e2e-qa-bugs]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-e2e-qa-workflow.test.ts name:e2e-qa-workflow]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-e2e-review.test.ts name:e2e-review]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-e2e-workflow.test.ts name:e2e-workflow]) (push) Has been cancelled
Periodic Evals / evals (map[file:test/skill-routing-e2e.test.ts name:e2e-routing]) (push) Has been cancelled

Source: https://github.com/garrytan/gstack/commit/026751e
This commit is contained in:
Rocky
2026-05-19 21:18:17 +02:00
commit 834c6db075
797 changed files with 267839 additions and 0 deletions

1742
review/SKILL.md Normal file

File diff suppressed because it is too large Load Diff

296
review/SKILL.md.tmpl Normal file
View File

@@ -0,0 +1,296 @@
---
name: review
preamble-tier: 4
version: 1.0.0
description: |
Pre-landing PR review. Analyzes diff against the base branch for SQL safety, LLM trust
boundary violations, conditional side effects, and other structural issues. Use when
asked to "review this PR", "code review", "pre-landing review", or "check my diff".
Proactively suggest when the user is about to merge or land code changes. (gstack)
allowed-tools:
- Bash
- Read
- Edit
- Write
- Grep
- Glob
- Agent
- AskUserQuestion
- WebSearch
triggers:
- review this pr
- code review
- check my diff
- pre-landing review
---
{{PREAMBLE}}
{{BASE_BRANCH_DETECT}}
# Pre-Landing PR Review
You are running the `/review` workflow. Analyze the current branch's diff against the base branch for structural issues that tests don't catch.
---
## Step 1: Check branch
1. Run `git branch --show-current` to get the current branch.
2. If on the base branch, output: **"Nothing to review — you're on the base branch or have no changes against it."** and stop.
3. Run `git fetch origin <base> --quiet && git diff origin/<base> --stat` to check if there's a diff. If no diff, output the same message and stop.
---
{{SCOPE_DRIFT}}
{{PLAN_COMPLETION_AUDIT_REVIEW}}
## Step 2: Read the checklist
Read `.claude/skills/review/checklist.md`.
**If the file cannot be read, STOP and report the error.** Do not proceed without the checklist.
---
## Step 2.5: Check for Greptile review comments
Read `.claude/skills/review/greptile-triage.md` and follow the fetch, filter, classify, and **escalation detection** steps.
**If no PR exists, `gh` fails, API returns an error, or there are zero Greptile comments:** Skip this step silently. Greptile integration is additive — the review works without it.
**If Greptile comments are found:** Store the classifications (VALID & ACTIONABLE, VALID BUT ALREADY FIXED, FALSE POSITIVE, SUPPRESSED) — you will need them in Step 5.
---
## Step 3: Get the diff
Fetch the latest base branch to avoid false positives from stale local state:
```bash
git fetch origin <base> --quiet
```
Run `git diff origin/<base>` to get the full diff. This includes both committed and uncommitted changes against the latest base branch.
## Step 3.4: Workspace-aware queue status (advisory)
Check whether this PR's claimed VERSION still points at a free slot in the queue. Advisory only — never blocks review; just informs the reviewer about landing-order risk.
```bash
BRANCH_VERSION=$(git show HEAD:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "")
BASE_BRANCH=$(gh pr view --json baseRefName -q .baseRefName 2>/dev/null || echo main)
BASE_VERSION=$(git show origin/$BASE_BRANCH:VERSION 2>/dev/null | tr -d '\r\n[:space:]' || echo "")
QUEUE_JSON=$(bun run bin/gstack-next-version \
--base "$BASE_BRANCH" \
--bump patch \
--current-version "$BASE_VERSION" 2>/dev/null || echo '{"offline":true}')
NEXT_SLOT=$(echo "$QUEUE_JSON" | jq -r '.version // empty')
CLAIMED_COUNT=$(echo "$QUEUE_JSON" | jq -r '.claimed | length // 0')
OFFLINE=$(echo "$QUEUE_JSON" | jq -r '.offline // false')
```
- If `OFFLINE=true`: skip this section (no signal to report).
- Otherwise, include ONE line in the review output: `Version claimed: v<BRANCH_VERSION>. Queue: <CLAIMED_COUNT> PR(s) ahead. <VERDICT>` where VERDICT is either `Slot free` (if `BRANCH_VERSION >= NEXT_SLOT`) or `⚠ queue moved — rerun /ship to reconcile v<BRANCH_VERSION> → v<NEXT_SLOT>`.
---
## Step 3.5: Slop scan (advisory)
Run a slop scan on changed files to catch AI code quality issues (empty catches,
redundant `return await`, overcomplicated abstractions):
```bash
bun run slop:diff origin/<base> 2>/dev/null || true
```
If findings are reported, include them in the review output as an informational
diagnostic. Slop findings are advisory, never blocking. If slop:diff is not
available (e.g., slop-scan not installed), skip this step silently.
---
{{LEARNINGS_SEARCH}}
## Step 4: Critical pass (core review)
Apply the CRITICAL categories from the checklist against the diff:
SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Shell Injection, Enum & Value Completeness.
Also apply the remaining INFORMATIONAL categories that are still in the checklist (Async/Sync Mixing, Column/Field Name Safety, LLM Prompt Issues, Type Coercion, View/Frontend, Time Window Safety, Completeness Gaps, Distribution & CI/CD).
**Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.
**Search-before-recommending:** When recommending a fix pattern (especially for concurrency, caching, auth, or framework-specific behavior):
- Verify the pattern is current best practice for the framework version in use
- Check if a built-in solution exists in newer versions before recommending a workaround
- Verify API signatures against current docs (APIs change between versions)
Takes seconds, prevents recommending outdated patterns. If WebSearch is unavailable, note it and proceed with in-distribution knowledge.
Follow the output format specified in the checklist. Respect the suppressions — do NOT flag items listed in the "DO NOT flag" section.
{{CONFIDENCE_CALIBRATION}}
---
{{REVIEW_ARMY}}
---
## Step 5: Fix-First Review
**Every finding gets action — not just critical ones.**
{{CROSS_REVIEW_DEDUP}}
### Step 5a: Classify each finding
For each finding, classify as AUTO-FIX or ASK per the Fix-First Heuristic in
checklist.md. Critical findings lean toward ASK; informational findings lean
toward AUTO-FIX.
**Test stub override:** Any finding that has a `test_stub` field (generated by a specialist)
is reclassified as ASK regardless of its original classification. When presenting the ASK
item, show the proposed test file path and the test code. The user approves or skips the
test creation. If approved, write the fix + test file. Derive the test file path from
the finding's `path` using project conventions (`spec/` for RSpec, `__tests__/` for
Jest/Vitest, `test_` prefix for pytest, `_test.go` suffix for Go). If the test file
already exists, append the new test. Output: `[FIXED + TEST] [file:line] Problem -> fix + test at [test_path]`
### Step 5b: Auto-fix all AUTO-FIX items
Apply each fix directly. For each one, output a one-line summary:
`[AUTO-FIXED] [file:line] Problem → what you did`
### Step 5c: Batch-ask about ASK items
If there are ASK items remaining, present them in ONE AskUserQuestion:
- List each item with a number, the severity label, the problem, and a recommended fix
- For each item, provide options: A) Fix as recommended, B) Skip
- Include an overall RECOMMENDATION
Example format:
```
I auto-fixed 5 issues. 2 need your input:
1. [CRITICAL] app/models/post.rb:42 — Race condition in status transition
Fix: Add `WHERE status = 'draft'` to the UPDATE
→ A) Fix B) Skip
2. [INFORMATIONAL] app/services/generator.rb:88 — LLM output not type-checked before DB write
Fix: Add JSON schema validation
→ A) Fix B) Skip
RECOMMENDATION: Fix both — #1 is a real race condition, #2 prevents silent data corruption.
```
If 3 or fewer ASK items, you may use individual AskUserQuestion calls instead of batching.
### Step 5d: Apply user-approved fixes
Apply fixes for items where the user chose "Fix." Output what was fixed.
If no ASK items exist (everything was AUTO-FIX), skip the question entirely.
### Verification of claims
Before producing the final review output:
- If you claim "this pattern is safe" → cite the specific line proving safety
- If you claim "this is handled elsewhere" → read and cite the handling code
- If you claim "tests cover this" → name the test file and method
- Never say "likely handled" or "probably tested" — verify or flag as unknown
**Rationalization prevention:** "This looks fine" is not a finding. Either cite evidence it IS fine, or flag it as unverified.
### Greptile comment resolution
After outputting your own findings, if Greptile comments were classified in Step 2.5:
**Include a Greptile summary in your output header:** `+ N Greptile comments (X valid, Y fixed, Z FP)`
Before replying to any comment, run the **Escalation Detection** algorithm from greptile-triage.md to determine whether to use Tier 1 (friendly) or Tier 2 (firm) reply templates.
1. **VALID & ACTIONABLE comments:** These are included in your findings — they follow the Fix-First flow (auto-fixed if mechanical, batched into ASK if not) (A: Fix it now, B: Acknowledge, C: False positive). If the user chooses A (fix), reply using the **Fix reply template** from greptile-triage.md (include inline diff + explanation). If the user chooses C (false positive), reply using the **False Positive reply template** (include evidence + suggested re-rank), save to both per-project and global greptile-history.
2. **FALSE POSITIVE comments:** Present each one via AskUserQuestion:
- Show the Greptile comment: file:line (or [top-level]) + body summary + permalink URL
- Explain concisely why it's a false positive
- Options:
- A) Reply to Greptile explaining why this is incorrect (recommended if clearly wrong)
- B) Fix it anyway (if low-effort and harmless)
- C) Ignore — don't reply, don't fix
If the user chooses A, reply using the **False Positive reply template** from greptile-triage.md (include evidence + suggested re-rank), save to both per-project and global greptile-history.
3. **VALID BUT ALREADY FIXED comments:** Reply using the **Already Fixed reply template** from greptile-triage.md — no AskUserQuestion needed:
- Include what was done and the fixing commit SHA
- Save to both per-project and global greptile-history
4. **SUPPRESSED comments:** Skip silently — these are known false positives from previous triage.
---
## Step 5.5: TODOS cross-reference
Read `TODOS.md` in the repository root (if it exists). Cross-reference the PR against open TODOs:
- **Does this PR close any open TODOs?** If yes, note which items in your output: "This PR addresses TODO: <title>"
- **Does this PR create work that should become a TODO?** If yes, flag it as an informational finding.
- **Are there related TODOs that provide context for this review?** If yes, reference them when discussing related findings.
If TODOS.md doesn't exist, skip this step silently.
---
## Step 5.6: Documentation staleness check
Cross-reference the diff against documentation files. For each `.md` file in the repo root (README.md, ARCHITECTURE.md, CONTRIBUTING.md, CLAUDE.md, etc.):
1. Check if code changes in the diff affect features, components, or workflows described in that doc file.
2. If the doc file was NOT updated in this branch but the code it describes WAS changed, flag it as an INFORMATIONAL finding:
"Documentation may be stale: [file] describes [feature/component] but code changed in this branch. Consider running `/document-release`."
This is informational only — never critical. The fix action is `/document-release`.
If no documentation files exist, skip this step silently.
---
{{ADVERSARIAL_STEP}}
## Step 5.8: Persist Eng Review result
After all review passes complete, persist the final `/review` outcome so `/ship` can
recognize that Eng Review was run on this branch.
Run:
```bash
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"review","timestamp":"TIMESTAMP","status":"STATUS","issues_found":N,"critical":N,"informational":N,"quality_score":SCORE,"specialists":SPECIALISTS_JSON,"findings":FINDINGS_JSON,"commit":"COMMIT"}'
```
Substitute:
- `TIMESTAMP` = ISO 8601 datetime
- `STATUS` = `"clean"` if there are no remaining unresolved findings after Fix-First handling and adversarial review, otherwise `"issues_found"`
- `issues_found` = total remaining unresolved findings
- `critical` = remaining unresolved critical findings
- `informational` = remaining unresolved informational findings
- `quality_score` = the PR Quality Score computed in Step 4.6 (e.g., 7.5). If specialists were skipped (small diff), use `10.0`
- `specialists` = the per-specialist stats object compiled in Step 4.6. Each specialist that was considered gets an entry: `{"dispatched":true/false,"findings":N,"critical":N,"informational":N}` if dispatched, or `{"dispatched":false,"reason":"scope|gated"}` if skipped. Include Design specialist. Example: `{"testing":{"dispatched":true,"findings":2,"critical":0,"informational":2},"security":{"dispatched":false,"reason":"scope"}}`
- `findings` = array of per-finding records from Step 5. For each finding (from critical pass and specialists), include: `{"fingerprint":"path:line:category","severity":"CRITICAL|INFORMATIONAL","action":"ACTION"}`. ACTION is `"auto-fixed"` (Step 5b), `"fixed"` (user approved in Step 5d), or `"skipped"` (user chose Skip in Step 5c). Suppressed findings from Step 5.0 are NOT included (they were already recorded in a prior review entry).
- `COMMIT` = output of `git rev-parse --short HEAD`
{{LEARNINGS_LOG}}
If the review exits early before a real review completes (for example, no diff against the base branch), do **not** write this entry.
## Important Rules
- **Read the FULL diff before commenting.** Do not flag issues already addressed in the diff.
- **Fix-first, not read-only.** AUTO-FIX items are applied directly. ASK items are only applied after user approval. Never commit, push, or create PRs — that's /ship's job.
- **Be terse.** One line problem, one line fix. No preamble.
- **Only flag real problems.** Skip anything that's fine.
- **Use Greptile reply templates from greptile-triage.md.** Every reply includes evidence. Never post vague replies.

62
review/TODOS-format.md Normal file
View File

@@ -0,0 +1,62 @@
# TODOS.md Format Reference
Shared reference for the canonical TODOS.md format. Referenced by `/ship` (Step 5.5) and `/plan-ceo-review` (TODOS.md updates section) to ensure consistent TODO item structure.
---
## File Structure
```markdown
# TODOS
## <Skill/Component> ← e.g., ## Browse, ## Ship, ## Review, ## Infrastructure
<items sorted P0 first, then P1, P2, P3, P4>
## Completed
<finished items with completion annotation>
```
**Sections:** Organize by skill or component (`## Browse`, `## Ship`, `## Review`, `## QA`, `## Retro`, `## Infrastructure`). Within each section, sort items by priority (P0 at top).
---
## TODO Item Format
Each item is an H3 under its section:
```markdown
### <Title>
**What:** One-line description of the work.
**Why:** The concrete problem it solves or value it unlocks.
**Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start.
**Effort:** S / M / L / XL
**Priority:** P0 / P1 / P2 / P3 / P4
**Depends on:** <prerequisites, or "None">
```
**Required fields:** What, Why, Context, Effort, Priority
**Optional fields:** Depends on, Blocked by
---
## Priority Definitions
- **P0** — Blocking: must be done before next release
- **P1** — Critical: should be done this cycle
- **P2** — Important: do when P0/P1 are clear
- **P3** — Nice-to-have: revisit after adoption/usage data
- **P4** — Someday: good idea, no urgency
---
## Completed Item Format
When an item is completed, move it to the `## Completed` section preserving its original content and appending:
```markdown
**Completed:** vX.Y.Z (YYYY-MM-DD)
```

180
review/checklist.md Normal file
View File

@@ -0,0 +1,180 @@
# Pre-Landing Review Checklist
## Instructions
Review the `git diff origin/main` output for the issues listed below. Be specific — cite `file:line` and suggest fixes. Skip anything that's fine. Only flag real problems.
**Two-pass review:**
- **Pass 1 (CRITICAL):** Run SQL & Data Safety, Race Conditions, LLM Output Trust Boundary, Shell Injection, and Enum Completeness first. Highest severity.
- **Pass 2 (INFORMATIONAL):** Run remaining categories below. Lower severity but still actioned.
- **Specialist categories (handled by parallel subagents, NOT this checklist):** Test Gaps, Dead Code, Magic Numbers, Conditional Side Effects, Performance & Bundle Impact, Crypto & Entropy. See `review/specialists/` for these.
All findings get action via Fix-First Review: obvious mechanical fixes are applied automatically,
genuinely ambiguous issues are batched into a single user question.
**Output format:**
```
Pre-Landing Review: N issues (X critical, Y informational)
**AUTO-FIXED:**
- [file:line] Problem → fix applied
**NEEDS INPUT:**
- [file:line] Problem description
Recommended fix: suggested fix
```
If no issues found: `Pre-Landing Review: No issues found.`
Be terse. For each issue: one line describing the problem, one line with the fix. No preamble, no summaries, no "looks good overall."
---
## Review Categories
### Pass 1 — CRITICAL
#### SQL & Data Safety
- String interpolation in SQL (even if values are `.to_i`/`.to_f` — use parameterized queries (Rails: sanitize_sql_array/Arel; Node: prepared statements; Python: parameterized queries))
- TOCTOU races: check-then-set patterns that should be atomic `WHERE` + `update_all`
- Bypassing model validations for direct DB writes (Rails: update_column; Django: QuerySet.update(); Prisma: raw queries)
- N+1 queries: Missing eager loading (Rails: .includes(); SQLAlchemy: joinedload(); Prisma: include) for associations used in loops/views
#### Race Conditions & Concurrency
- Read-check-write without uniqueness constraint or catch duplicate key error and retry (e.g., `where(hash:).first` then `save!` without handling concurrent insert)
- find-or-create without unique DB index — concurrent calls can create duplicates
- Status transitions that don't use atomic `WHERE old_status = ? UPDATE SET new_status` — concurrent updates can skip or double-apply transitions
- Unsafe HTML rendering (Rails: .html_safe/raw(); React: dangerouslySetInnerHTML; Vue: v-html; Django: |safe/mark_safe) on user-controlled data (XSS)
#### LLM Output Trust Boundary
- LLM-generated values (emails, URLs, names) written to DB or passed to mailers without format validation. Add lightweight guards (`EMAIL_REGEXP`, `URI.parse`, `.strip`) before persisting.
- Structured tool output (arrays, hashes) accepted without type/shape checks before database writes.
- LLM-generated URLs fetched without allowlist — SSRF risk if URL points to internal network (Python: `urllib.parse.urlparse` → check hostname against blocklist before `requests.get`/`httpx.get`)
- LLM output stored in knowledge bases or vector DBs without sanitization — stored prompt injection risk
#### Shell Injection (Python-specific)
- `subprocess.run()` / `subprocess.call()` / `subprocess.Popen()` with `shell=True` AND f-string/`.format()` interpolation in the command string — use argument arrays instead
- `os.system()` with variable interpolation — replace with `subprocess.run()` using argument arrays
- `eval()` / `exec()` on LLM-generated code without sandboxing
#### Enum & Value Completeness
When the diff introduces a new enum value, status string, tier name, or type constant:
- **Trace it through every consumer.** Read (don't just grep — READ) each file that switches on, filters by, or displays that value. If any consumer doesn't handle the new value, flag it. Common miss: adding a value to the frontend dropdown but the backend model/compute method doesn't persist it.
- **Check allowlists/filter arrays.** Search for arrays or `%w[]` lists containing sibling values (e.g., if adding "revise" to tiers, find every `%w[quick lfg mega]` and verify "revise" is included where needed).
- **Check `case`/`if-elsif` chains.** If existing code branches on the enum, does the new value fall through to a wrong default?
To do this: use Grep to find all references to the sibling values (e.g., grep for "lfg" or "mega" to find all tier consumers). Read each match. This step requires reading code OUTSIDE the diff.
### Pass 2 — INFORMATIONAL
#### Async/Sync Mixing (Python-specific)
- Synchronous `subprocess.run()`, `open()`, `requests.get()` inside `async def` endpoints — blocks the event loop. Use `asyncio.to_thread()`, `aiofiles`, or `httpx.AsyncClient` instead.
- `time.sleep()` inside async functions — use `asyncio.sleep()`
- Sync DB calls in async context without `run_in_executor()` wrapping
#### Column/Field Name Safety
- Verify column names in ORM queries (`.select()`, `.eq()`, `.gte()`, `.order()`) against actual DB schema — wrong column names silently return empty results or throw swallowed errors
- Check `.get()` calls on query results use the column name that was actually selected
- Cross-reference with schema documentation when available
#### Dead Code & Consistency (version/changelog only — other items handled by maintainability specialist)
- Version mismatch between PR title and VERSION/CHANGELOG files
- CHANGELOG entries that describe changes inaccurately (e.g., "changed from X to Y" when X never existed)
#### LLM Prompt Issues
- 0-indexed lists in prompts (LLMs reliably return 1-indexed)
- Prompt text listing available tools/capabilities that don't match what's actually wired up in the `tool_classes`/`tools` array
- Word/token limits stated in multiple places that could drift
#### Completeness Gaps
- Shortcut implementations where the complete version would cost <30 minutes CC time (e.g., partial enum handling, incomplete error paths, missing edge cases that are straightforward to add)
- Options presented with only human-team effort estimates — should show both human and CC+gstack time
- Test coverage gaps where adding the missing tests is a "lake" not an "ocean" (e.g., missing negative-path tests, missing edge case tests that mirror happy-path structure)
- Features implemented at 80-90% when 100% is achievable with modest additional code
#### Time Window Safety
- Date-key lookups that assume "today" covers 24h — report at 8am PT only sees midnight→8am under today's key
- Mismatched time windows between related features — one uses hourly buckets, another uses daily keys for the same data
#### Type Coercion at Boundaries
- Values crossing Ruby→JSON→JS boundaries where type could change (numeric vs string) — hash/digest inputs must normalize types
- Hash/digest inputs that don't call `.to_s` or equivalent before serialization — `{ cores: 8 }` vs `{ cores: "8" }` produce different hashes
#### View/Frontend
- Inline `<style>` blocks in partials (re-parsed every render)
- O(n*m) lookups in views (`Array#find` in a loop instead of `index_by` hash)
- Ruby-side `.select{}` filtering on DB results that could be a `WHERE` clause (unless intentionally avoiding leading-wildcard `LIKE`)
#### Distribution & CI/CD Pipeline
- CI/CD workflow changes (`.github/workflows/`): verify build tool versions match project requirements, artifact names/paths are correct, secrets use `${{ secrets.X }}` not hardcoded values
- New artifact types (CLI binary, library, package): verify a publish/release workflow exists and targets correct platforms
- Cross-platform builds: verify CI matrix covers all target OS/arch combinations, or documents which are untested
- Version tag format consistency: `v1.2.3` vs `1.2.3` — must match across VERSION file, git tags, and publish scripts
- Publish step idempotency: re-running the publish workflow should not fail (e.g., `gh release delete` before `gh release create`)
**DO NOT flag:**
- Web services with existing auto-deploy pipelines (Docker build + K8s deploy)
- Internal tools not distributed outside the team
- Test-only CI changes (adding test steps, not publish steps)
---
## Severity Classification
```
CRITICAL (highest severity): INFORMATIONAL (main agent): SPECIALIST (parallel subagents):
├─ SQL & Data Safety ├─ Async/Sync Mixing ├─ Testing specialist
├─ Race Conditions & Concurrency ├─ Column/Field Name Safety ├─ Maintainability specialist
├─ LLM Output Trust Boundary ├─ Dead Code (version only) ├─ Security specialist
├─ Shell Injection ├─ LLM Prompt Issues ├─ Performance specialist
└─ Enum & Value Completeness ├─ Completeness Gaps ├─ Data Migration specialist
├─ Time Window Safety ├─ API Contract specialist
├─ Type Coercion at Boundaries └─ Red Team (conditional)
├─ View/Frontend
└─ Distribution & CI/CD Pipeline
All findings are actioned via Fix-First Review. Severity determines
presentation order and classification of AUTO-FIX vs ASK — critical
findings lean toward ASK (they're riskier), informational findings
lean toward AUTO-FIX (they're more mechanical).
```
---
## Fix-First Heuristic
This heuristic is referenced by both `/review` and `/ship`. It determines whether
the agent auto-fixes a finding or asks the user.
```
AUTO-FIX (agent fixes without asking): ASK (needs human judgment):
├─ Dead code / unused variables ├─ Security (auth, XSS, injection)
├─ N+1 queries (missing eager loading) ├─ Race conditions
├─ Stale comments contradicting code ├─ Design decisions
├─ Magic numbers → named constants ├─ Large fixes (>20 lines)
├─ Missing LLM output validation ├─ Enum completeness
├─ Version/path mismatches ├─ Removing functionality
├─ Variables assigned but never read └─ Anything changing user-visible
└─ Inline styles, O(n*m) view lookups behavior
```
**Rule of thumb:** If the fix is mechanical and a senior engineer would apply it
without discussion, it's AUTO-FIX. If reasonable engineers could disagree about
the fix, it's ASK.
**Critical findings default toward ASK** (they're inherently riskier).
**Informational findings default toward AUTO-FIX** (they're more mechanical).
---
## Suppressions — DO NOT flag these
- "X is redundant with Y" when the redundancy is harmless and aids readability (e.g., `present?` redundant with `length > 20`)
- "Add a comment explaining why this threshold/constant was chosen" — thresholds change during tuning, comments rot
- "This assertion could be tighter" when the assertion already covers the behavior
- Suggesting consistency-only changes (wrapping a value in a conditional to match how another constant is guarded)
- "Regex doesn't handle edge case X" when the input is constrained and X never occurs in practice
- "Test exercises multiple guards simultaneously" — that's fine, tests don't need to isolate every guard
- Eval threshold changes (max_actionable, min scores) — these are tuned empirically and change constantly
- Harmless no-ops (e.g., `.reject` on an element that's never in the array)
- ANYTHING already addressed in the diff you're reviewing — read the FULL diff before commenting

134
review/design-checklist.md Normal file
View File

@@ -0,0 +1,134 @@
# Design Review Checklist (Lite)
> **Subset of DESIGN_METHODOLOGY** — when adding items here, also update `generateDesignMethodology()` in `scripts/gen-skill-docs.ts`, and vice versa.
## Instructions
This checklist applies to **source code in the diff** — not rendered output. Read each changed frontend file (full file, not just diff hunks) and flag anti-patterns.
**Trigger:** Only run this checklist if the diff touches frontend files. Use `gstack-diff-scope` to detect:
```bash
source <(~/.claude/skills/gstack/bin/gstack-diff-scope <base> 2>/dev/null)
```
If `SCOPE_FRONTEND=false`, skip the entire design review silently.
**DESIGN.md calibration:** If `DESIGN.md` or `design-system.md` exists in the repo root, read it first. All findings are calibrated against the project's stated design system. Patterns explicitly blessed in DESIGN.md are NOT flagged. If no DESIGN.md exists, use universal design principles.
---
## Confidence Tiers
Each item is tagged with a detection confidence level:
- **[HIGH]** — Reliably detectable via grep/pattern match. Definitive findings.
- **[MEDIUM]** — Detectable via pattern aggregation or heuristic. Flag as findings but expect some noise.
- **[LOW]** — Requires understanding visual intent. Present as: "Possible issue — verify visually or run /design-review."
---
## Classification
**AUTO-FIX** (mechanical CSS fixes only — HIGH confidence, no design judgment needed):
- `outline: none` without replacement → add `outline: revert` or `&:focus-visible { outline: 2px solid currentColor; }`
- `!important` in new CSS → remove and fix specificity
- `font-size` < 16px on body text → bump to 16px
**ASK** (everything else — requires design judgment):
- All AI slop findings, typography structure, spacing choices, interaction state gaps, DESIGN.md violations
**LOW confidence items** → present as "Possible: [description]. Verify visually or run /design-review." Never AUTO-FIX.
---
## Output Format
```
Design Review: N issues (X auto-fixable, Y need input, Z possible)
**AUTO-FIXED:**
- [file:line] Problem → fix applied
**NEEDS INPUT:**
- [file:line] Problem description
Recommended fix: suggested fix
**POSSIBLE (verify visually):**
- [file:line] Possible issue — verify with /design-review
```
Optional: `test_stub` — skeleton test code for this finding using the project's test framework.
If no issues found: `Design Review: No issues found.`
If no frontend files changed: skip silently, no output.
---
## Categories
### 1. AI Slop Detection (6 items) — highest priority
These are the telltale signs of AI-generated UI that no designer at a respected studio would ship.
- **[MEDIUM]** Purple/violet/indigo gradient backgrounds or blue-to-purple color schemes. Look for `linear-gradient` with values in the `#6366f1``#8b5cf6` range, or CSS custom properties resolving to purple/violet.
- **[LOW]** The 3-column feature grid: icon-in-colored-circle + bold title + 2-line description, repeated 3x symmetrically. Look for a grid/flex container with exactly 3 children that each contain a circular element + heading + paragraph.
- **[LOW]** Icons in colored circles as section decoration. Look for elements with `border-radius: 50%` + a background color used as decorative containers for icons.
- **[HIGH]** Centered everything: `text-align: center` on all headings, descriptions, and cards. Grep for `text-align: center` density — if >60% of text containers use center alignment, flag it.
- **[MEDIUM]** Uniform bubbly border-radius on every element: same large radius (16px+) applied to cards, buttons, inputs, containers uniformly. Aggregate `border-radius` values — if >80% use the same value ≥16px, flag it.
- **[MEDIUM]** Generic hero copy: "Welcome to [X]", "Unlock the power of...", "Your all-in-one solution for...", "Revolutionize your...", "Streamline your workflow". Grep HTML/JSX content for these patterns.
### 2. Typography (4 items)
- **[HIGH]** Body text `font-size` < 16px. Grep for `font-size` declarations on `body`, `p`, `.text`, or base styles. Values below 16px (or 1rem when base is 16px) are flagged.
- **[HIGH]** More than 3 font families introduced in the diff. Count distinct `font-family` declarations. Flag if >3 unique families appear across changed files.
- **[HIGH]** Heading hierarchy skipping levels: `h1` followed by `h3` without an `h2` in the same file/component. Check HTML/JSX for heading tags.
- **[HIGH]** Blacklisted fonts: Papyrus, Comic Sans, Lobster, Impact, Jokerman. Grep `font-family` for these names.
### 3. Spacing & Layout (4 items)
- **[MEDIUM]** Arbitrary spacing values not on a 4px or 8px scale, when DESIGN.md specifies a spacing scale. Check `margin`, `padding`, `gap` values against the stated scale. Only flag when DESIGN.md defines a scale.
- **[MEDIUM]** Fixed widths without responsive handling: `width: NNNpx` on containers without `max-width` or `@media` breakpoints. Risk of horizontal scroll on mobile.
- **[MEDIUM]** Missing `max-width` on text containers: body text or paragraph containers with no `max-width` set, allowing lines >75 characters. Check for `max-width` on text wrappers.
- **[HIGH]** `!important` in new CSS rules. Grep for `!important` in added lines. Almost always a specificity escape hatch that should be fixed properly.
### 4. Interaction States (3 items)
- **[MEDIUM]** Interactive elements (buttons, links, inputs) missing hover/focus states. Check if `:hover` and `:focus` or `:focus-visible` pseudo-classes exist for new interactive element styles.
- **[HIGH]** `outline: none` or `outline: 0` without a replacement focus indicator. Grep for `outline:\s*none` or `outline:\s*0`. This removes keyboard accessibility.
- **[LOW]** Touch targets < 44px on interactive elements. Check `min-height`/`min-width`/`padding` on buttons and links. Requires computing effective size from multiple properties — low confidence from code alone.
### 5. DESIGN.md Violations (3 items, conditional)
Only apply if `DESIGN.md` or `design-system.md` exists:
- **[MEDIUM]** Colors not in the stated palette. Compare color values in changed CSS against the palette defined in DESIGN.md.
- **[MEDIUM]** Fonts not in the stated typography section. Compare `font-family` values against DESIGN.md's font list.
- **[MEDIUM]** Spacing values outside the stated scale. Compare `margin`/`padding`/`gap` values against DESIGN.md's spacing scale.
---
## Suppressions
Do NOT flag:
- Patterns explicitly documented in DESIGN.md as intentional choices
- Third-party/vendor CSS files (node_modules, vendor directories)
- CSS resets or normalize stylesheets
- Test fixture files
- Generated/minified CSS

220
review/greptile-triage.md Normal file
View File

@@ -0,0 +1,220 @@
# Greptile Comment Triage
Shared reference for fetching, filtering, and classifying Greptile review comments on GitHub PRs. Both `/review` (Step 2.5) and `/ship` (Step 3.75) reference this document.
---
## Fetch
Run these commands to detect the PR and fetch comments. Both API calls run in parallel.
```bash
REPO=$(gh repo view --json nameWithOwner --jq '.nameWithOwner' 2>/dev/null)
PR_NUMBER=$(gh pr view --json number --jq '.number' 2>/dev/null)
```
**If either fails or is empty:** Skip Greptile triage silently. This integration is additive — the workflow works without it.
```bash
# Fetch line-level review comments AND top-level PR comments in parallel
gh api repos/$REPO/pulls/$PR_NUMBER/comments \
--jq '.[] | select(.user.login == "greptile-apps[bot]") | select(.position != null) | {id: .id, path: .path, line: .line, body: .body, html_url: .html_url, source: "line-level"}' > /tmp/greptile_line.json &
gh api repos/$REPO/issues/$PR_NUMBER/comments \
--jq '.[] | select(.user.login == "greptile-apps[bot]") | {id: .id, body: .body, html_url: .html_url, source: "top-level"}' > /tmp/greptile_top.json &
wait
```
**If API errors or zero Greptile comments across both endpoints:** Skip silently.
The `position != null` filter on line-level comments automatically skips outdated comments from force-pushed code.
---
## Suppressions Check
Derive the project-specific history path:
```bash
REMOTE_SLUG=$(browse/bin/remote-slug 2>/dev/null || ~/.claude/skills/gstack/browse/bin/remote-slug 2>/dev/null || basename "$(git rev-parse --show-toplevel 2>/dev/null || pwd)")
PROJECT_HISTORY="$HOME/.gstack/projects/$REMOTE_SLUG/greptile-history.md"
```
Read `$PROJECT_HISTORY` if it exists (per-project suppressions). Each line records a previous triage outcome:
```
<date> | <repo> | <type:fp|fix|already-fixed> | <file-pattern> | <category>
```
**Categories** (fixed set): `race-condition`, `null-check`, `error-handling`, `style`, `type-safety`, `security`, `performance`, `correctness`, `other`
Match each fetched comment against entries where:
- `type == fp` (only suppress known false positives, not previously fixed real issues)
- `repo` matches the current repo
- `file-pattern` matches the comment's file path
- `category` matches the issue type in the comment
Skip matched comments as **SUPPRESSED**.
If the history file doesn't exist or has unparseable lines, skip those lines and continue — never fail on a malformed history file.
---
## Classify
For each non-suppressed comment:
1. **Line-level comments:** Read the file at the indicated `path:line` and surrounding context (±10 lines)
2. **Top-level comments:** Read the full comment body
3. Cross-reference the comment against the full diff (`git diff origin/main`) and the review checklist
4. Classify:
- **VALID & ACTIONABLE** — a real bug, race condition, security issue, or correctness problem that exists in the current code
- **VALID BUT ALREADY FIXED** — a real issue that was addressed in a subsequent commit on the branch. Identify the fixing commit SHA.
- **FALSE POSITIVE** — the comment misunderstands the code, flags something handled elsewhere, or is stylistic noise
- **SUPPRESSED** — already filtered in the suppressions check above
---
## Reply APIs
When replying to Greptile comments, use the correct endpoint based on comment source:
**Line-level comments** (from `pulls/$PR/comments`):
```bash
gh api repos/$REPO/pulls/$PR_NUMBER/comments/$COMMENT_ID/replies \
-f body="<reply text>"
```
**Top-level comments** (from `issues/$PR/comments`):
```bash
gh api repos/$REPO/issues/$PR_NUMBER/comments \
-f body="<reply text>"
```
**If a reply POST fails** (e.g., PR was closed, no write permission): warn and continue. Do not stop the workflow for a failed reply.
---
## Reply Templates
Use these templates for every Greptile reply. Always include concrete evidence — never post vague replies.
### Tier 1 (First response) — Friendly, evidence-included
**For FIXES (user chose to fix the issue):**
```
**Fixed** in `<commit-sha>`.
\`\`\`diff
- <old problematic line(s)>
+ <new fixed line(s)>
\`\`\`
**Why:** <1-sentence explanation of what was wrong and how the fix addresses it>
```
**For ALREADY FIXED (issue addressed in a prior commit on the branch):**
```
**Already fixed** in `<commit-sha>`.
**What was done:** <1-2 sentences describing how the existing commit addresses this issue>
```
**For FALSE POSITIVES (the comment is incorrect):**
```
**Not a bug.** <1 sentence directly stating why this is incorrect>
**Evidence:**
- <specific code reference showing the pattern is safe/correct>
- <e.g., "The nil check is handled by `ActiveRecord::FinderMethods#find` which raises RecordNotFound, not nil">
**Suggested re-rank:** This appears to be a `<style|noise|misread>` issue, not a `<what Greptile called it>`. Consider lowering severity.
```
### Tier 2 (Greptile re-flags after prior reply) — Firm, overwhelming evidence
Use Tier 2 when escalation detection (below) identifies a prior GStack reply on the same thread. Include maximum evidence to close the discussion.
```
**This has been reviewed and confirmed as [intentional/already-fixed/not-a-bug].**
\`\`\`diff
<full relevant diff showing the change or safe pattern>
\`\`\`
**Evidence chain:**
1. <file:line permalink showing the safe pattern or fix>
2. <commit SHA where it was addressed, if applicable>
3. <architecture rationale or design decision, if applicable>
**Suggested re-rank:** Please recalibrate — this is a `<actual category>` issue, not `<claimed category>`. [Link to specific file change permalink if helpful]
```
---
## Escalation Detection
Before composing a reply, check if a prior GStack reply already exists on this comment thread:
1. **For line-level comments:** Fetch replies via `gh api repos/$REPO/pulls/$PR_NUMBER/comments/$COMMENT_ID/replies`. Check if any reply body contains GStack markers: `**Fixed**`, `**Not a bug.**`, `**Already fixed**`.
2. **For top-level comments:** Scan the fetched issue comments for replies posted after the Greptile comment that contain GStack markers.
3. **If a prior GStack reply exists AND Greptile posted again on the same file+category:** Use Tier 2 (firm) templates.
4. **If no prior GStack reply exists:** Use Tier 1 (friendly) templates.
If escalation detection fails (API error, ambiguous thread): default to Tier 1. Never escalate on ambiguity.
---
## Severity Assessment & Re-ranking
When classifying comments, also assess whether Greptile's implied severity matches reality:
- If Greptile flags something as a **security/correctness/race-condition** issue but it's actually a **style/performance** nit: include `**Suggested re-rank:**` in the reply requesting the category be corrected.
- If Greptile flags a low-severity style issue as if it were critical: push back in the reply.
- Always be specific about why the re-ranking is warranted — cite code and line numbers, not opinions.
---
## History File Writes
Before writing, ensure both directories exist:
```bash
REMOTE_SLUG=$(browse/bin/remote-slug 2>/dev/null || ~/.claude/skills/gstack/browse/bin/remote-slug 2>/dev/null || basename "$(git rev-parse --show-toplevel 2>/dev/null || pwd)")
mkdir -p "$HOME/.gstack/projects/$REMOTE_SLUG"
mkdir -p ~/.gstack
```
Append one line per triage outcome to **both** files (per-project for suppressions, global for retro):
- `~/.gstack/projects/$REMOTE_SLUG/greptile-history.md` (per-project)
- `~/.gstack/greptile-history.md` (global aggregate)
Format:
```
<YYYY-MM-DD> | <owner/repo> | <type> | <file-pattern> | <category>
```
Example entries:
```
2026-03-13 | garrytan/myapp | fp | app/services/auth_service.rb | race-condition
2026-03-13 | garrytan/myapp | fix | app/models/user.rb | null-check
2026-03-13 | garrytan/myapp | already-fixed | lib/payments.rb | error-handling
```
---
## Output Format
Include a Greptile summary in the output header:
```
+ N Greptile comments (X valid, Y fixed, Z FP)
```
For each classified comment, show:
- Classification tag: `[VALID]`, `[FIXED]`, `[FALSE POSITIVE]`, `[SUPPRESSED]`
- File:line reference (for line-level) or `[top-level]` (for top-level)
- One-line body summary
- Permalink URL (the `html_url` field)

View File

@@ -0,0 +1,49 @@
# API Contract Specialist Review Checklist
Scope: When SCOPE_API=true
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"api-contract","summary":"...","fix":"...","fingerprint":"path:line:api-contract","specialist":"api-contract"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
## Categories
### Breaking Changes
- Removed fields from response bodies (clients may depend on them)
- Changed field types (string → number, object → array)
- New required parameters added to existing endpoints
- Changed HTTP methods (GET → POST) or status codes (200 → 201)
- Renamed endpoints without maintaining the old path as a redirect/alias
- Changed authentication requirements (public → authenticated)
### Versioning Strategy
- Breaking changes made without a version bump (v1 → v2)
- Multiple versioning strategies mixed in the same API (URL vs header vs query param)
- Deprecated endpoints without a sunset timeline or migration guide
- Version-specific logic scattered across controllers instead of centralized
### Error Response Consistency
- New endpoints returning different error formats than existing ones
- Error responses missing standard fields (error code, message, details)
- HTTP status codes that don't match the error type (200 for errors, 500 for validation)
- Error messages that leak internal implementation details (stack traces, SQL)
### Rate Limiting & Pagination
- New endpoints missing rate limiting when similar endpoints have it
- Pagination changes (offset → cursor) without backwards compatibility
- Changed page sizes or default limits without documentation
- Missing total count or next-page indicators in paginated responses
### Documentation Drift
- OpenAPI/Swagger spec not updated to match new endpoints or changed params
- README or API docs describing old behavior after changes
- Example requests/responses that no longer work
- Missing documentation for new endpoints or changed parameters
### Backwards Compatibility
- Clients on older versions: will they break?
- Mobile apps that can't force-update: does the API still work for them?
- Webhook payloads changed without notifying subscribers
- SDK or client library changes needed to use new features

View File

@@ -0,0 +1,48 @@
# Data Migration Specialist Review Checklist
Scope: When SCOPE_MIGRATIONS=true
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"data-migration","summary":"...","fix":"...","fingerprint":"path:line:data-migration","specialist":"data-migration"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
## Categories
### Reversibility
- Can this migration be rolled back without data loss?
- Is there a corresponding down/rollback migration?
- Does the rollback actually undo the change or just no-op?
- Would rolling back break the current application code?
### Data Loss Risk
- Dropping columns that still contain data (add deprecation period first)
- Changing column types that truncate data (varchar(255) → varchar(50))
- Removing tables without verifying no code references them
- Renaming columns without updating all references (ORM, raw SQL, views)
- NOT NULL constraints added to columns with existing NULL values (needs backfill first)
### Lock Duration
- ALTER TABLE on large tables without CONCURRENTLY (PostgreSQL)
- Adding indexes without CONCURRENTLY on tables with >100K rows
- Multiple ALTER TABLE statements that could be combined into one lock acquisition
- Schema changes that acquire exclusive locks during peak traffic hours
### Backfill Strategy
- New NOT NULL columns without DEFAULT value (requires backfill before constraint)
- New columns with computed defaults that need batch population
- Missing backfill script or rake task for existing records
- Backfill that updates all rows at once instead of batching (locks table)
### Index Creation
- CREATE INDEX without CONCURRENTLY on production tables
- Duplicate indexes (new index covers same columns as existing one)
- Missing indexes on new foreign key columns
- Partial indexes where a full index would be more useful (or vice versa)
### Multi-Phase Safety
- Migrations that must be deployed in a specific order with application code
- Schema changes that break the current running code (deploy code first, then migrate)
- Migrations that assume a deploy boundary (old code + new schema = crash)
- Missing feature flag to handle mixed old/new code during rolling deploy

View File

@@ -0,0 +1,46 @@
# Maintainability Specialist Review Checklist
Scope: Always-on (every review)
Output: JSON objects, one finding per line. Schema:
{"severity":"INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"maintainability","summary":"...","fix":"...","fingerprint":"path:line:maintainability","specialist":"maintainability"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
## Categories
### Dead Code & Unused Imports
- Variables assigned but never read in the changed files
- Functions/methods defined but never called (check with Grep across the repo)
- Imports/requires that are no longer referenced after the change
- Commented-out code blocks (either remove or explain why they exist)
### Magic Numbers & String Coupling
- Bare numeric literals used in logic (thresholds, limits, retry counts) — should be named constants
- Error message strings used as query filters or conditionals elsewhere
- Hardcoded URLs, ports, or hostnames that should be config
- Duplicated literal values across multiple files
### Stale Comments & Docstrings
- Comments that describe old behavior after the code was changed in this diff
- TODO/FIXME comments that reference completed work
- Docstrings with parameter lists that don't match the current function signature
- ASCII diagrams in comments that no longer match the code flow
### DRY Violations
- Similar code blocks (3+ lines) appearing multiple times within the diff
- Copy-paste patterns where a shared helper would be cleaner
- Configuration or setup logic duplicated across test files
- Repeated conditional chains that could be a lookup table or map
### Conditional Side Effects
- Code paths that branch on a condition but forget a side effect on one branch
- Log messages that claim an action happened but the action was conditionally skipped
- State transitions where one branch updates related records but the other doesn't
- Event emissions that only fire on the happy path, missing error/edge paths
### Module Boundary Violations
- Reaching into another module's internal implementation (accessing private-by-convention methods)
- Direct database queries in controllers/views that should go through a service/model
- Tight coupling between components that should communicate through interfaces

View File

@@ -0,0 +1,52 @@
# Performance Specialist Review Checklist
Scope: When SCOPE_BACKEND=true OR SCOPE_FRONTEND=true
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"performance","summary":"...","fix":"...","fingerprint":"path:line:performance","specialist":"performance"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
## Categories
### N+1 Queries
- ActiveRecord/ORM associations traversed in loops without eager loading (.includes, joinedload, include)
- Database queries inside iteration blocks (each, map, forEach) that could be batched
- Nested serializers that trigger lazy-loaded associations
- GraphQL resolvers that query per-field instead of batching (check for DataLoader usage)
### Missing Database Indexes
- New WHERE clauses on columns without indexes (check migration files or schema)
- New ORDER BY on non-indexed columns
- Composite queries (WHERE a AND b) without composite indexes
- Foreign key columns added without indexes
### Algorithmic Complexity
- O(n^2) or worse patterns: nested loops over collections, Array.find inside Array.map
- Repeated linear searches that could use a hash/map/set lookup
- String concatenation in loops (use join or StringBuilder)
- Sorting or filtering large collections multiple times when once would suffice
### Bundle Size Impact (Frontend)
- New production dependencies that are known-heavy (moment.js, lodash full, jquery)
- Barrel imports (import from 'library') instead of deep imports (import from 'library/specific')
- Large static assets (images, fonts) committed without optimization
- Missing code splitting for route-level chunks
### Rendering Performance (Frontend)
- Fetch waterfalls: sequential API calls that could be parallel (Promise.all)
- Unnecessary re-renders from unstable references (new objects/arrays in render)
- Missing React.memo, useMemo, or useCallback on expensive computations
- Layout thrashing from reading then writing DOM properties in loops
- Missing loading="lazy" on below-fold images
### Missing Pagination
- List endpoints that return unbounded results (no LIMIT, no pagination params)
- Database queries without LIMIT that grow with data volume
- API responses that embed full nested objects instead of IDs with expansion
### Blocking in Async Contexts
- Synchronous I/O (file reads, subprocess, HTTP requests) inside async functions
- time.sleep() / Thread.sleep() inside event-loop-based handlers
- CPU-intensive computation blocking the main thread without worker offload

View File

@@ -0,0 +1,45 @@
# Red Team Review
Scope: When diff > 200 lines OR security specialist found CRITICAL findings. Runs AFTER other specialists.
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"red-team","summary":"...","fix":"...","fingerprint":"path:line:red-team","specialist":"red-team"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
This is NOT a checklist review. This is adversarial analysis.
You have access to the other specialists' findings (provided in your prompt). Your job is to find what they MISSED. Think like an attacker, a chaos engineer, and a hostile QA tester simultaneously.
## Approach
### 1. Attack the Happy Path
- What happens when the system is under 10x normal load?
- What happens when two requests hit the same resource simultaneously?
- What happens when the database is slow (>5s query time)?
- What happens when an external service returns garbage?
### 2. Find the Silent Failures
- Error handling that swallows exceptions (catch-all with just a log)
- Operations that can partially complete (3 of 5 items processed, then crash)
- State transitions that leave records in inconsistent states on failure
- Background jobs that fail without alerting anyone
### 3. Exploit Trust Assumptions
- Data validated on the frontend but not the backend
- Internal APIs called without authentication (assuming "only our code calls this")
- Configuration values assumed to be present but not validated
- File paths or URLs constructed from user input without sanitization
### 4. Break the Edge Cases
- What happens with the maximum possible input size?
- What happens with zero items, empty strings, null values?
- What happens on the first run ever (no existing data)?
- What happens when the user clicks the button twice in 100ms?
### 5. Find What the Other Specialists Missed
- Review each specialist's findings. What's the gap between their categories?
- Look for cross-category issues (e.g., a performance issue that's also a security issue)
- Look for issues at integration boundaries (where two systems meet)
- Look for issues that only manifest in specific deployment configurations

View File

@@ -0,0 +1,61 @@
# Security Specialist Review Checklist
Scope: When SCOPE_AUTH=true OR (SCOPE_BACKEND=true AND diff > 100 lines)
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"security","summary":"...","fix":"...","fingerprint":"path:line:security","specialist":"security"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
This checklist goes deeper than the main CRITICAL pass. The main agent already checks SQL injection, race conditions, LLM trust, and enum completeness. This specialist focuses on auth/authz patterns, cryptographic misuse, and attack surface expansion.
## Categories
### Input Validation at Trust Boundaries
- User input accepted without validation at controller/handler level
- Query parameters used directly in database queries or file paths
- Request body fields accepted without type checking or schema validation
- File uploads without type/size/content validation
- Webhook payloads processed without signature verification
### Auth & Authorization Bypass
- Endpoints missing authentication middleware (check route definitions)
- Authorization checks that default to "allow" instead of "deny"
- Role escalation paths (user can modify their own role/permissions)
- Direct object reference vulnerabilities (user A accesses user B's data by changing an ID)
- Session fixation or session hijacking opportunities
- Token/API key validation that doesn't check expiration
### Injection Vectors (beyond SQL)
- Command injection via subprocess calls with user-controlled arguments
- Template injection (Jinja2, ERB, Handlebars) with user input
- LDAP injection in directory queries
- SSRF via user-controlled URLs (fetch, redirect, webhook targets)
- Path traversal via user-controlled file paths (../../etc/passwd)
- Header injection via user-controlled values in HTTP headers
### Cryptographic Misuse
- Weak hashing algorithms (MD5, SHA1) for security-sensitive operations
- Predictable randomness (Math.random, rand()) for tokens or secrets
- Non-constant-time comparisons (==) on secrets, tokens, or digests
- Hardcoded encryption keys or IVs
- Missing salt in password hashing
### Secrets Exposure
- API keys, tokens, or passwords in source code (even in comments)
- Secrets logged in application logs or error messages
- Credentials in URLs (query parameters or basic auth in URL)
- Sensitive data in error responses returned to users
- PII stored in plaintext when encryption is expected
### XSS via Escape Hatches
- Rails: .html_safe, raw() on user-controlled data
- React: dangerouslySetInnerHTML with user content
- Vue: v-html with user content
- Django: |safe, mark_safe() on user input
- General: innerHTML assignment with unsanitized data
### Deserialization
- Deserializing untrusted data (pickle, Marshal, YAML.load, JSON.parse of executable types)
- Accepting serialized objects from user input or external APIs without schema validation

View File

@@ -0,0 +1,46 @@
# Testing Specialist Review Checklist
Scope: Always-on (every review)
Output: JSON objects, one finding per line. Schema:
{"severity":"CRITICAL|INFORMATIONAL","confidence":N,"path":"file","line":N,"category":"testing","summary":"...","fix":"...","fingerprint":"path:line:testing","specialist":"testing"}
Optional: line, fix, fingerprint, evidence, test_stub.
If no findings: output `NO FINDINGS` and nothing else.
---
## Categories
### Missing Negative-Path Tests
- New code paths that handle errors, rejections, or invalid input with NO corresponding test
- Guard clauses and early returns that are untested
- Error branches in try/catch, rescue, or error boundaries with no failure-path test
- Permission/auth checks that are asserted in code but never tested for the "denied" case
### Missing Edge-Case Coverage
- Boundary values: zero, negative, max-int, empty string, empty array, nil/null/undefined
- Single-element collections (off-by-one on loops)
- Unicode and special characters in user-facing inputs
- Concurrent access patterns with no race-condition test
### Test Isolation Violations
- Tests sharing mutable state (class variables, global singletons, DB records not cleaned up)
- Order-dependent tests (pass in sequence, fail when randomized)
- Tests that depend on system clock, timezone, or locale
- Tests that make real network calls instead of using stubs/mocks
### Flaky Test Patterns
- Timing-dependent assertions (sleep, setTimeout, waitFor with tight timeouts)
- Assertions on ordering of unordered results (hash keys, Set iteration, async resolution order)
- Tests that depend on external services (APIs, databases) without fallback
- Randomized test data without seed control
### Security Enforcement Tests Missing
- Auth/authz checks in controllers with no test for the "unauthorized" case
- Rate limiting logic with no test proving it actually blocks
- Input sanitization with no test for malicious input
- CSRF/CORS configuration with no integration test
### Coverage Gaps
- New public methods/functions with zero test coverage
- Changed methods where existing tests only cover the old behavior, not the new branch
- Utility functions called from multiple places but tested only indirectly