Itinerary v2 Prompt - 11-way Doability Benchmark

2026-06-25 - OLD vs NEW vs 9 research-grounded variants - 20 real-saves scenarios - 220 Sonnet generations - blind Opus panel

Verdict (two rounds). Round 1 (11-way) proved the soft-priority graft beats both OLD (5.96) and NEW (7.01) outright - every graft did. Round 2 (champion: the 4 strongest head-to-head) settled the port config: the anchors-pool structural split is essential, and V9 (anchors split + intensity tags + rhythm + variety + clustering) wins the decisive comparison at 8.11/10 (avg rank 2.07). V4 - which topped the wide 11-way field - drops to LAST head-to-head, exposing that its rhythm rule rode a weak field rather than a strong structure.

Which prompt builds the best, most doable trips - using saves wisely (no forced cramming), no repeats, in-city except labeled day-trips?

Key findings

Soft-priority is the foundational win. Every one of the 9 grafts beats NEW's hard "every save MUST appear" rule on the overall judge score - even the minimal V1 (7.59) clears NEW (7.01) and OLD (5.96). The biggest single lift is just replacing the must-include rule with soft-priority + no-dup + same-city + don't-overload.
Doability is where OLD/NEW actually fail. OLD doability 5.16, NEW 6.29 vs the best graft's 7.89. NEW's hard must-include over-packs days; OLD invents (gap-fill 6.2/plan) and schedules restaurants as activities (1.15/plan, vs ~0.3 for pool-driven prompts).
The anchors-pool structural split is the load-bearing ingredient. The champion round (the 4 strongest variants head-to-head) is decisive: V4 (rhythm rule alone, the 11-way winner) drops to LAST (7.28) because it lacks the split, while every variant that HAS it (V7/V10/V9) scores 7.77+. V4 won the wide 11-way field only because weak singletons diluted it; head-to-head it is exposed.
V9 wins the decisive head-to-head: 8.11 overall, 17 firsts, avg rank 2.07 - a clean monotonic V9 (8.11) > V10 (7.93) > V7 (7.77). Layering rhythm + variety + clustering ON TOP of the anchors split keeps improving the trip; the earlier "kitchen-sink over-constrains" read was an artifact of the wide field.
V10 (anchors + rhythm) is the deterministically cleanest near-tie: 93% priority coverage (tied best), 0.05 duplicates (lowest), 0.8 gap-fill (lowest). The combination beats rhythm-alone head-to-head (V10 over V4 in 12/20 scenarios), confirming the levers are additive, not redundant.

Method

Real data. 20 scenarios, each a real prod user's actual saves for a real destination set (8 human collections + corpus). Shapes span short/medium/long, single/multi-city, dense/thin/zero-save, hearted-heavy/all-iconic, family-kids/solo/couple/friends, chill/balanced/packed, 4 seasons, day-trip-heavy, domestic + 12 countries.
11 variants from identical pools (NEW + 9 surgical-override grafts; OLD = main names-only). System prompt byte-identical (4023b) - each A/B isolates the user-message lever.
Generation: 220 Claude Sonnet subagents as the production model (8192-tok, strict schema) - no product/vendor API.
Judging: per scenario, 3 Claude Opus judges blind-rank all 11 (anonymized, rotated) on 5 dims + overall; 20/20 scenarios returned verdicts. Plus a deterministic 7-dim objective scorer.

What makes a good trip (research-grounded): don't over-schedule (cardinal sin); cluster by area; intensity rhythm; variety/anti-fatigue; time-of-day sequencing; light arrival/departure; personalization; feasibility is the bottleneck (academic LLM-planning benchmarks: best models ~33% feasibility, collapsing past ~10 coupled constraints).

The 11 variants

Variant	Lever	Delta on top of NEW pool
OLD	reference	main names-only prompt (USER SAVES / PRIORITY SAVES / KNOWN PLACES; model picks + invents)
NEW	reference	Ashwin branch as-is: collection pool, every HEARTED/SAVED/ICONIC MUST appear
V1	rules (minimal)	soft-priority + no-dup + same-city + dont-overload (pool unchanged)
V2	rules (pacing)	V1 + hard per-day caps by pace + full-day = +1 stop + light arrival/departure
V3	rules (clustering)	V1 + cluster each day by neighbourhood, minimize backtracking
V4	rules (rhythm)	V1 + alternate heavy/light days, mid-trip light day on 5d+
V5	rules (variety)	V1 + no two same-type anchors in a row, mix types (anti-fatigue)
V6	pool tags	V1 + per-item [FULL-DAY]/[HALF-DAY]/[QUICK] tags + time-of-day ordering
V7	structural	V1 + pool split ANCHORS vs IF TIME PERMITS (hearts/iconics never trimmed)
V8	combo	doability-max: pacing + clustering + time-of-day + intensity tags + ROUTING block
V9	combo	best-of: anchors split + intensity tags + rhythm + variety + clustering

Anatomy of each variant (real prompt content)

Every variant shares an identical system prompt, trip-context, day-allocation, seasonality, and output-schema block - they differ ONLY in how the user's saves are delivered. Excerpts below are the real rendered text from scenario s6_japan_epic (Japan 10d, dense + 1 hearted).

OLD (main) - names-only

USER SAVES (collection, weave in where natural, ignore mismatches):
PRIORITY SAVES (the user hearted these - include them in the itinerary where they fit the day's city + flow, ahead of other saves): "Nishiki Market"
KNOWN PLACES IN OUR DATABASE (real, popular places we have photos + details for - STRONGLY PREFER these EXACT names when they fit the day's city + theme, so the trip shows rich detail; you may still add other well-known places where these don't cover the day):

Loose name lists; the model picks/invents and may ignore saves ("ignore mismatches").

NEW (Ashwin) - hard must-include

RULES: pick the best 3-4 items PER CITY PER DAY from that city's pool, ordered sensibly through the day. Every [HEARTED] item and every [SAVED]/[ICONIC] item MUST appear somewhere in the trip. Do NOT invent places outside the pool except thin gap-fill when a day is short - and write " (gap-fill)" at the end of the note for anything you add. Copy pool names EXACTLY.

Pool-driven, but every SAVED/ICONIC must appear - this is what over-packs dense days.

Shared soft-priority graft (the new baseline, in ALL of V1-V9)

RULES:
- PRIORITY (no forced use): schedule [HEARTED] first, then [ICONIC], then [SAVED]/[GEM]. Cover as many high-priority items as fit a COMFORTABLE day. It is better to leave a low-priority [SAVED] out than to overload a day - do NOT force every pool item in.
- ONE PLACE ONCE: never schedule the same place on two days, or twice in a day.
- STAY IN THE DAY'S CITY: every item must be in that day's city, or a well-known day-trip you return from (label any day-trip in the note).
- DON'T OVERLOAD: respect the PACE; cramming is the cardinal mistake. Order each day as a sensible route with minimal backtracking.
- GAP-FILL: invent a place only to round out a thin day, only in that day's city, never a duplicate; append " (gap-fill)" to its note. Copy pool names EXACTLY.