PROOF Series #1 — Pilot Study Plan

Drafted: May 14, 2026 Status: Pre-execution. Optional methodology-validation pass before the full study runs. Estimated execution time: 5–10 days end-to-end Estimated cost: $200–400 (API + compute)

Purpose

The full PROOF Series #1 study (200 firms, 60 days, 4 engines, 50 prompts) is a significant commitment. A small pilot validates the methodology before the full investment lands. The pilot is also a way to surface unforeseen practical problems — broken data extraction logic, prompt sets that consistently fail to elicit firm names, engine response formats that don't parse — while there's still time to fix them.

The pilot is NOT a publishable substitute for the full study. Sample size is too small for statistical inference. It is purely a methodology de-risking pass.

Pilot design

Sample

20 firms total drawn from one practice area (recommendation: personal injury — highest expected citation volume, biggest competitive set, cleanest prompts).

10 firms classified Bin A (materially compliant per the coding rubric v1.0)
10 firms classified Bin C (materially non-compliant)

Sampling: from the candidate pool already screened for the full study, randomly select 10 from each Bin within the personal-injury stratum. The seeded random selection is documented.

Engines

All four engines from the full study: ChatGPT, Perplexity, Google AI Overviews, Claude.

Prompts

8 prompts — the personal-injury subset of the full study's prompt list. These are the same prompts that will be used in the full study, so the pilot exercises the actual prompt set rather than a separate one.

Cadence

Once per day for 7 consecutive days. Total observations: 20 firms × 8 prompts × 4 engines × 7 days = 4,480 binary citation observations.

What the pilot tests

Data extraction pipeline. Does the firm-name extraction routine correctly identify firms in the engines' generated answers and cited-sources lists? Acceptable threshold: 95% precision + recall against a hand-coded subset of 100 responses.
Coder reliability. Do the two coders applying the rubric to the 20 firms agree at acceptable kappa? Target: Cohen's κ ≥ 0.70. If lower, the rubric is revised before the full study.
Prompt elicitation. Do the prompts produce parseable, firm-name-yielding responses on each engine? Specifically watching for prompts that consistently return "I can't help with that" refusals or that name only the engine's pre-trained content rather than retrieval.
Engine behavior variation. Day-to-day variation in engine outputs for the same prompt. If variance is high enough that 60 days of data won't stabilize the rate estimate within ±5 percentage points, the cadence may need to increase.
Bin separation. Is there directional separation between Bin A and Bin C citation rates in the pilot data? Pilot results are not statistically interpretable, but a flat or reversed pattern would be a flag to revisit either the rubric or the hypothesis before scaling.

What the pilot does NOT test

The full study's statistical power (sample too small)
Cross-practice-area patterns (only one practice area)
The 60-day longitudinal stabilization curve (only 7 days)
The Bin A vs. Bin C citation-rate effect in any inferential sense

These remain for the full study.

Execution timeline

Day 0 (prep): - Confirm 20 firms drawn from candidate pool - Confirm prompt list locked - Confirm data pipeline operational and validated against test queries - Confirm coding spreadsheet and coder access - Coders begin compliance coding (target: 5 firms per coder per day = 4 days to code 20 firms)

Days 1–7 (data collection): - Run 8 prompts × 4 engines × 1 day = 32 queries per day - 7 days × 32 = 224 query sessions total - Each query session captures and stores: prompt text, engine, timestamp, full response, cited-sources list, derived firm-name extraction - Each session is approximately 15 minutes of human-supervised execution if running manually; potentially 30 min/day automated

Day 8–9 (analysis): - Coder reliability check (Cohen's kappa) - Data extraction validation (hand-coded sample comparison) - Descriptive analysis: Bin A vs. Bin C citation rates per engine and per prompt - Variance analysis: day-to-day stability of per-firm citation rate

Day 10 (decision): - Go / no-go on the full study based on: - Coder reliability ≥ 0.70 → proceed; if lower, revise rubric and re-run coding - Data extraction precision/recall ≥ 0.95 → proceed; if lower, fix extraction logic - Bin A / Bin C directional separation in pilot data → proceed; if reversed, revisit hypothesis with the team - No unexpected technical blockers

Deliverables from the pilot

Internal pilot report (5–8 pages). Methodology validation findings, any methodology revisions made, coder reliability stats, data extraction precision/recall, descriptive citation-rate observations. Not for public publication.
Updated rubric (v1.1 if needed). If pilot kappa is below 0.70, the rubric is revised based on coder disagreement patterns. The revised version is published as v1.1.
Updated pre-registration (v1.1 if needed). Any methodology changes prompted by pilot findings are documented as a pre-registration revision before the full study begins.
A short public methodology note. Approximately 800 words, published at wtt.digital/proof/cal-bar-citation-study/methodology-note describing what the pilot validated and what was learned about engine behavior. No firm-specific data; methodology only. This is a public-facing artifact that demonstrates ongoing rigor.

Cost and resource estimate

Researcher time: Roughly 15–20 hours of focused work over 10 days (coding, pipeline supervision, analysis, write-up)
Coder time: 6–10 hours per coder × 2 coders = 12–20 hours total
API and compute costs: Approximately $200–400 (ChatGPT API + Perplexity API + Claude API + automated browser sessions for Google AI Overviews)
Total cost: Roughly $300–500 in direct expenses plus the researcher's own time

These are sunk costs against the full study, not separate. If the pilot validates the methodology cleanly, the full study runs with minimal additional infrastructure work.

Risk acknowledgment

The pilot may reveal that the methodology has a fatal flaw — for example:

The four engines may respond too variably day-to-day for any reasonable sample size to produce stable rate estimates
The coding rubric may have inter-rater reliability issues that can't be resolved with revision
The firm-name extraction may fail catastrophically on Chinese-language or otherwise non-standard responses
The hypothesis may be directionally wrong, in which case the full study still runs (the null result is publishable) but expectations shift

If the pilot reveals a fatal flaw that requires a fundamental methodology rewrite, the full study is delayed and the pre-registration is revised. The delay is publicly disclosed. The intent is to ship a real result, not a fast result.

Decision: run the pilot, or skip directly to the full study?

Argument for running the pilot: De-risks the full $2,000–4,000 commitment. Surfaces problems while they're cheap to fix. Builds incremental credibility through public methodology notes. Generates preliminary data that can be cited (carefully) in pre-publication conversations.

Argument for skipping the pilot: Adds 2–3 weeks to the timeline. The methodology is already specified at high fidelity. The data pipeline can be validated on test queries without a formal pilot. Pilot results are uninterpretable for the hypothesis anyway.

Recommendation: run the pilot. The added 2–3 weeks is small relative to the protection against discovering a methodology flaw mid-full-study. The public methodology note also gives PROOF Series #1 an interim artifact to point at between pre-registration (May 14) and publication (target September 1), which keeps the research visible during the data-collection period.

Pilot study plan v1.0 — May 14, 2026. This is a working document; revisions are versioned. The full pre-registration is published separately and is the binding methodology document for the study.