2026-06-25 - OLD vs NEW vs 9 research-grounded variants - 20 real-saves scenarios - 220 Sonnet generations - blind Opus panel
Which prompt builds the best, most doable trips - using saves wisely (no forced cramming), no repeats, in-city except labeled day-trips?
What makes a good trip (research-grounded): don't over-schedule (cardinal sin); cluster by area; intensity rhythm; variety/anti-fatigue; time-of-day sequencing; light arrival/departure; personalization; feasibility is the bottleneck (academic LLM-planning benchmarks: best models ~33% feasibility, collapsing past ~10 coupled constraints).
| Variant | Lever | Delta on top of NEW pool |
|---|---|---|
| OLD | reference | main names-only prompt (USER SAVES / PRIORITY SAVES / KNOWN PLACES; model picks + invents) |
| NEW | reference | Ashwin branch as-is: collection pool, every HEARTED/SAVED/ICONIC MUST appear |
| V1 | rules (minimal) | soft-priority + no-dup + same-city + dont-overload (pool unchanged) |
| V2 | rules (pacing) | V1 + hard per-day caps by pace + full-day = +1 stop + light arrival/departure |
| V3 | rules (clustering) | V1 + cluster each day by neighbourhood, minimize backtracking |
| V4 | rules (rhythm) | V1 + alternate heavy/light days, mid-trip light day on 5d+ |
| V5 | rules (variety) | V1 + no two same-type anchors in a row, mix types (anti-fatigue) |
| V6 | pool tags | V1 + per-item [FULL-DAY]/[HALF-DAY]/[QUICK] tags + time-of-day ordering |
| V7 | structural | V1 + pool split ANCHORS vs IF TIME PERMITS (hearts/iconics never trimmed) |
| V8 | combo | doability-max: pacing + clustering + time-of-day + intensity tags + ROUTING block |
| V9 | combo | best-of: anchors split + intensity tags + rhythm + variety + clustering |
Every variant shares an identical system prompt, trip-context, day-allocation, seasonality, and output-schema block - they differ ONLY in how the user's saves are delivered. Excerpts below are the real rendered text from scenario s6_japan_epic (Japan 10d, dense + 1 hearted).
USER SAVES (collection, weave in where natural, ignore mismatches): PRIORITY SAVES (the user hearted these - include them in the itinerary where they fit the day's city + flow, ahead of other saves): "Nishiki Market" KNOWN PLACES IN OUR DATABASE (real, popular places we have photos + details for - STRONGLY PREFER these EXACT names when they fit the day's city + theme, so the trip shows rich detail; you may still add other well-known places where these don't cover the day):
Loose name lists; the model picks/invents and may ignore saves ("ignore mismatches").
RULES: pick the best 3-4 items PER CITY PER DAY from that city's pool, ordered sensibly through the day. Every [HEARTED] item and every [SAVED]/[ICONIC] item MUST appear somewhere in the trip. Do NOT invent places outside the pool except thin gap-fill when a day is short - and write " (gap-fill)" at the end of the note for anything you add. Copy pool names EXACTLY.
Pool-driven, but every SAVED/ICONIC must appear - this is what over-packs dense days.
RULES: - PRIORITY (no forced use): schedule [HEARTED] first, then [ICONIC], then [SAVED]/[GEM]. Cover as many high-priority items as fit a COMFORTABLE day. It is better to leave a low-priority [SAVED] out than to overload a day - do NOT force every pool item in. - ONE PLACE ONCE: never schedule the same place on two days, or twice in a day. - STAY IN THE DAY'S CITY: every item must be in that day's city, or a well-known day-trip you return from (label any day-trip in the note). - DON'T OVERLOAD: respect the PACE; cramming is the cardinal mistake. Order each day as a sensible route with minimal backtracking. - GAP-FILL: invent a place only to round out a thin day, only in that day's city, never a duplicate; append " (gap-fill)" to its note. Copy pool names EXACTLY.
- DAY BUDGET: chill = 2-3 items/day; balanced = 3-4; packed = 4-5. A full-day site (theme park, big fort/palace, national park, safari) fills a day - pair it with at most 1 light nearby stop. - LIGHT EDGES: the arrival day and the departure day are light (1-2 items); never pack a travel day.
- CLUSTER BY AREA: build each day around ONE neighbourhood/zone so lunch and the afternoon are near the morning anchor. Minimize cross-town backtracking.
- INTENSITY RHYTHM: alternate heavier and lighter days - energy does not reset overnight, so avoid two heavy days back-to-back. On trips of 5+ days make one mid-trip day deliberately light.
- VARIETY: do not stack two same-type anchors in a row (e.g. two temples, two museums). Mix each day across types - a sight, some nature/outdoors, a neighbourhood walk, a hands-on experience - to avoid sightseeing fatigue.
- TIME-OF-DAY: order each day by fit - markets and sunrise spots early; viewpoints/sunsets late; indoor museums for midday. Each pool item is tagged [FULL-DAY]/[HALF-DAY]/[QUICK]; budget the day by those (one [FULL-DAY] is most of a day).
Tokyo (days 1,2,3,4): [ICONIC] [HALF-DAY] "Tsukiji Outer Market" [ICONIC] [HALF-DAY] "Senso-ji Temple" [SAVED] [HALF-DAY] "Akihabara Electric Town" [SAVED] [HALF-DAY] "Robot Restaurant Show — Shinjuku" [SAVED] [HALF-DAY] "Golden Gai — Shinjuku" [SAVED] [HALF-DAY] "teamLab Borderless" [SAVED] [HALF-DAY] "Shibuya Crossing & Hachiko" [GEM] [HALF-DAY] "Kyu-Furukawa Gardens" [GEM] [HALF-DAY] "Ukima Park"
- ANCHORS FIRST: each city's pool is split into ANCHORS (do these first) and IF TIME PERMITS. Fill each day from ANCHORS first; pull from IF TIME PERMITS only to round out a lighter day. Never exceed a comfortable day to cram more in.
Tokyo (days 1,2,3,4):
ANCHORS (schedule these first):
[ICONIC] "Tsukiji Outer Market"
[ICONIC] "Senso-ji Temple"
[ICONIC] "Kinkaku-ji"
[ICONIC] "Yasaka Shrine"
[SAVED] "Akihabara Electric Town"
[SAVED] "Robot Restaurant Show — Shinjuku"
[SAVED] "Golden Gai — Shinjuku"
[SAVED] "teamLab Borderless"- DAY BUDGET: chill = 2-3 items/day; balanced = 3-4; packed = 4-5. A full-day site (theme park, big fort/palace, national park, safari) fills a day - pair it with at most 1 light nearby stop. - LIGHT EDGES: the arrival day and the departure day are light (1-2 items); never pack a travel day. - CLUSTER BY AREA: build each day around ONE neighbourhood/zone so lunch and the afternoon are near the morning anchor. Minimize cross-town backtracking. - TIME-OF-DAY: order each day by fit - markets and sunrise spots early; viewpoints/sunsets late; indoor museums for midday. Each pool item is tagged [FULL-DAY]/[HALF-DAY]/[QUICK]; budget the day by those (one [FULL-DAY] is most of a day).
Tokyo (days 1,2,3,4): [ICONIC] [HALF-DAY] "Tsukiji Outer Market" [ICONIC] [HALF-DAY] "Senso-ji Temple" [SAVED] [HALF-DAY] "Akihabara Electric Town" [SAVED] [HALF-DAY] "Robot Restaurant Show — Shinjuku" [SAVED] [HALF-DAY] "Golden Gai — Shinjuku" [SAVED] [HALF-DAY] "teamLab Borderless" [SAVED] [HALF-DAY] "Shibuya Crossing & Hachiko" [GEM] [HALF-DAY] "Kyu-Furukawa Gardens" [GEM] [HALF-DAY] "Ukima Park"
ROUTING & FEASIBILITY (build a DOABLE day, then fill it): - For each day first pick the anchor: one [FULL-DAY], or up to two [HALF-DAY] sights in the same area. - Then add nearby [QUICK] stops that are on the way - cluster by zone, minimize backtracking. - Sequence by time-of-day (markets AM, viewpoints/sunset PM) and respect opening hours. - Honour SEASONALITY: avoid weather-exposed sites in the worst window for the month; keep heavy days off arrival/departure.
- ANCHORS FIRST + CLUSTER: fill each day from ANCHORS first (IF TIME PERMITS only rounds out a light day), and build the day around one zone to minimize backtracking. - INTENSITY RHYTHM: alternate heavier and lighter days - energy does not reset overnight, so avoid two heavy days back-to-back. On trips of 5+ days make one mid-trip day deliberately light. - VARIETY: do not stack two same-type anchors in a row (e.g. two temples, two museums). Mix each day across types - a sight, some nature/outdoors, a neighbourhood walk, a hands-on experience - to avoid sightseeing fatigue.
Tokyo (days 1,2,3,4):
ANCHORS (schedule these first):
[ICONIC] [HALF-DAY] "Tsukiji Outer Market"
[ICONIC] [HALF-DAY] "Senso-ji Temple"
[ICONIC] [HALF-DAY] "Kinkaku-ji"
[ICONIC] [HALF-DAY] "Yasaka Shrine"
[SAVED] [HALF-DAY] "Akihabara Electric Town"
[SAVED] [HALF-DAY] "Robot Restaurant Show — Shinjuku"
[SAVED] [HALF-DAY] "Golden Gai — Shinjuku"
[SAVED] [HALF-DAY] "teamLab Borderless"| # | Variant | Overall | Doability | Geo-efficiency | Rhythm+Variety | Selection | Narrative | Borda | 1st | avgRank |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | V4 rules (rhythm) | 7.92 | 7.89 | 8.06 | 8.04 | 7.62 | 8.01 | 415 | 7 | 4.08 |
| 2 | V9 combo | 7.76 | 7.59 | 7.65 | 7.80 | 7.73 | 8.03 | 349 | 9 | 5.18 |
| 3 | V7 structural | 7.75 | 7.54 | 7.76 | 7.59 | 7.91 | 7.96 | 406 | 9 | 4.23 |
| 4 | V8 combo | 7.71 | 7.81 | 7.82 | 7.79 | 7.27 | 7.86 | 354 | 8 | 5.10 |
| 5 | V1 rules (minimal) | 7.59 | 7.35 | 7.57 | 7.58 | 7.54 | 7.92 | 335 | 7 | 5.42 |
| 6 | V2 rules (pacing) | 7.59 | 7.46 | 7.90 | 7.52 | 7.20 | 7.86 | 299 | 5 | 6.02 |
| 7 | V3 rules (clustering) | 7.49 | 7.15 | 7.57 | 7.42 | 7.39 | 7.92 | 306 | 5 | 5.90 |
| 8 | V5 rules (variety) | 7.44 | 7.17 | 7.52 | 7.45 | 7.15 | 7.92 | 280 | 5 | 6.33 |
| 9 | V6 pool tags | 7.43 | 7.07 | 7.68 | 7.28 | 7.25 | 7.89 | 273 | 3 | 6.45 |
| 10 | NEW reference | 7.01 | 6.29 | 7.12 | 6.74 | 7.08 | 7.84 | 221 | 2 | 7.32 |
| 11 | OLD reference | 5.96 | 5.16 | 5.86 | 6.13 | 5.22 | 7.43 | 62 | 0 | 9.97 |
Overall = mean of the 5 dimension means. Borda = rank points summed across all judges (higher better). avgRank = mean placement (1 best).
| Variant | Priority cov | Saves used | Dups | Rest-as-activity | City-lock | Empty days | Overloaded days | Gap-fill | Out-of-city |
|---|---|---|---|---|---|---|---|---|---|
| OLD | 81% | 66% | 0.2 | 1.15 | 100% | 0 | 0.15 | 6.2 | 0.4 |
| NEW | 90% | 86% | 0.15 | 0.3 | 100% | 0 | 0.1 | 0.7 | 0.2 |
| V1 | 91% | 80% | 0.1 | 0.35 | 100% | 0 | 0.05 | 1.3 | 0.15 |
| V2 | 86% | 74% | 0.05 | 0.35 | 100% | 0 | 0 | 1 | 0.1 |
| V3 | 89% | 79% | 0.05 | 0.35 | 100% | 0 | 0.05 | 1.45 | 0.25 |
| V4 | 86% | 78% | 0.05 | 0.35 | 100% | 0 | 0.1 | 1.3 | 0.15 |
| V5 | 86% | 77% | 0 | 0.4 | 100% | 0 | 0 | 1.8 | 0.2 |
| V6 | 87% | 75% | 0.1 | 0.35 | 100% | 0 | 0 | 1.1 | 0.1 |
| V7 | 94% | 78% | 0.1 | 0.35 | 100% | 0 | 0 | 1.35 | 0.15 |
| V8 | 86% | 73% | 0.05 | 0.3 | 100% | 0 | 0.1 | 0.65 | 0.1 |
| V9 | 92% | 75% | 0.2 | 0.4 | 100% | 0 | 0 | 1.5 | 0 |
Priority cov = % of HEARTED+ICONIC saves scheduled. Saves used = % of all day-eligible saves (context, not a target - smart omission is good). Restaurant-as-activity = saved eateries wrongly scheduled as day items (the OLD bug). Lower is better for dups / rest / empty / overloaded / gap-fill / out-of-city.
The decisive differentiator is the hearted must-have The Ancient City (Muang Boran), which only P9 schedules - and it does so sanely, isolating it on day 3 with just Chinatown and the evening flight south, while still covering Wat Saket, Jim Thompson, BACC, Chatuchak, Grand Palace, all three Phuket iconics, and both Krabi anchors; its only real flaw is scheduling Cafe 8 98 (a cafe) as a day item, which costs it some selection points but not the top spot given it alone honors the full hearted set. P8, P6, and P3 form a tight second tier: each covers every must-have EXCEPT The Ancient City, keeps days well-clustered and well-paced (light arrival/departure, beach-and-temple alternation, no restaurants scheduled), and writes specific non-cliche summaries - P8/P6 edge P3 slightly on rhythm. P11 and P5 match that coverage and P11 even adds IconSiam, but P11 also schedules Cafe 8 98 as a day item. The bottom of the field fails on doability and coverage: P1 is weakest - it omits THREE must-haves (The Ancient City hearted, Freedom Beach iconic, Tiger Cave hearted) and crams a ferry, a longtail Railay run, and the 1,237-step Tiger Cave climb onto a single travel day. P2 is nearly as weak, missing two hearted items (Wat Saket and The Ancient City) and overloading its departure day with a dawn Tiger Cave climb plus a full James Bond Island day-trip. P7 misses Wat Tham Suea as a distinct item and squeezes a sunset Tiger Cave climb into a ferry-arrival day, while P4 and P10 each drop an iconic (Karon Beach) and The Ancient City. Across the board the strongest plans separated themselves by must-have completeness and by refusing to bolt full-day boat trips or step-climbs onto arrival/departure days.
The central trap is Villa del Balbianello: a hearted must-have nominally filed under Rome but physically on Lake Como (4+ hours each way), so a Rome day-trip is infeasible. The strongest plans either omit it cleanly while staying well-paced (P2, P5) or include it with honest acknowledgment of its infeasibility (P4 flags it as a pre-trip add-on rather than faking a normal day). P2 leads because every Amalfi/Capri must-have lands (Duomo, Marina Grande Beach, Gardens of Augustus, plus the saved Piazzetta/Faraglioni/Marina Piccola), pace is balanced with a light arrival day, no restaurant is mis-scheduled, and routing is feasible; its only blemish is the omitted Villa. P5 mirrors this with correct Salerno-ferry routing. P4 honestly handles Villa (rewarded on selection_judgment) but pays in doability with a 4-item d2 and a sprawling Ravello+Vietri+Amalfi d4. The weakest tier collapses on doability and coverage: P7, P9, and P10 each drop TWO must-haves (Villa plus the iconic Marina Grande Beach), and P10/P9 pair that with sprawling 4-item Amalfi days. P11 is worst by a wide margin -- it crams five anchors into d2 (full Vatican Museums + Sistine + St Peter's + Piazza Navona + Pantheon + Spanish Steps), an exhausting and infeasible day, AND it schedules two restaurants as day items (Ristorante Belvedere on d4, the Michelin Il Riccio on d6), a direct violation of the food-rail rule, while still missing Villa. P1 and P6 sit mid-low because they fabricate Villa feasibility -- P1 falsely calls Como "a feasible day trip from Rome by high-speed train" and P6 calls it "Rome's garden escapes... a short transfer out of the city centre" -- which is a serious doability/geography fault even though it preserves nominal must-have coverage.
The single strongest signal is the three ICONIC must-haves (Osaka Aquarium Kaiyukan, Umeda Sky Building, Universal Studios Japan). Only P3, P7, P9 and P11 schedule all three; the rest (P1, P2, P4, P5, P6, P8, P10) drop USJ entirely - a full-day iconic anchor - which is a heavy must-have miss that caps them. P9 wins because it not only covers all three iconics but treats USJ correctly: a dedicated Day 2 with only Shinsekai appended (light, feasible), a genuinely balanced Day 3 (Kaiyukan, Tempozan, America Mura, Kuromon), and a sane 3-stop final Kyoto day - the best pace/coverage balance, with smart omission of low-priority saves rather than cramming. P7 and P3 also cover all iconics but pack their USJ day with extra stops (P7 adds Sumiyoshi+Kuromon; P3 adds Umeda+Nakanoshima+America Mura), slightly straining the theme-park day. P11 has the richest must-have coverage (all iconics plus Amazing Pass, Abeno Harukas, teamLab) but pays for it in doability: it crams USJ together with Kaiyukan and Tempozan on one waterfront day and stacks a five-item Day 3, the exact over-scheduling the rubric penalizes. The weakest is P10: it omits USJ, invents out-of-pool filler (Osaka Castle, an entire "Dotonbori Nightlife"/"Dotonbori Street Food" pseudo-item, Tsutenkaku Tower, Yasaka Pagoda), and bloats Day 3 to five heavy anchors including Abeno Harukas, Umeda, Tempozan and Kaiyukan back-to-back - poor doability, weak selection judgment, and the most generic narrative. No plan scheduled a restaurant/cafe/bar as a day item, so none was penalized on that axis.
All eleven plans honor the city/day allocation and schedule every ICONIC must-have (Hawa Mahal, Mehrangarh, Umaid Bhawan, Jaisalmer Fort, Sam Dunes Camel Safari), and none schedule a restaurant as a day item, so differentiation came down to pacing, day-trip sanity, and restraint with SAVED items. P2 leads: it covers all five iconics plus a sensible saved subset (bazaars, Amber, Blue City walk, Bishnoi, Suryagarh, stargazing), keeps arrival/departure light, and its notes are the most concrete (distances, gate times, why-now reasoning) without cramming. P10, P7, and P11 are nearly identical in quality with clean clustering and balanced days. P5 dips slightly because day 5 stacks a full-day Bishnoi safari onto a 5.5-hour evening drive arriving "by night," compressing the rhythm. P9 is decisively the weakest: it loads day 6 with Patwon Ki Haveli + War Museum + camel safari, inserts a 65km Osian detour on a travel day, and most damningly schedules Tanot Mata Temple (150km each way toward the Pakistan border, a ~300km round trip) plus a hot-air-balloon and a City Palace on a 7-day budget couple's departure day — geographically and time-wise infeasible, and it forces in many unrequested heavy stops rather than exercising the smart omission the brief rewards.
The corpus is unusually tight on must-haves: nearly every plan correctly groups the Chiba/Mount Nokogiri saved cluster (Tokyo Wan Ferry, Ropeway, Jigoku Nozoki, Hyaku-shaku Kannon, Nihon-ji, Hamakanaya) and the northern-Kyoto Tango cluster (Amanohashidate, Viewland, Ine, Chionji), so ranking turns on (a) how cleanly those clusters sit on a single sane round-trip day and (b) whether the iconic/gem anchors (TeamLab Planets, Edo Museum, Shin-Fuji Station, Nishiki Market, Ine) all land. P7 leads: it covers all 18 must-haves including the full Tango cluster on one well-paced Day 7, keeps anchors intact, and is on-voice; its only real flaw is splitting Nokogiri across two Tokyo days (two bay crossings). P2 is the cleanest single-day execution of both clusters with strong pacing and an honest "save Ine for next time" hedge, dropping only three low-priority saved (Nike, Hamakanaya, Hyaku-shaku). P11 and P9 also cover the clusters well on dedicated days but each slips on one iconic-ish item (P11 drops Shin-Fuji + Hamakanaya; P9 crams TeamLab + the ferry + Hamakanaya onto the arrival day). P3 is the most realistic, beautifully paced plan on the board but it is the cautionary case the rubric punishes: it silently DROPS the gem Ine, the entire Amanohashidate cluster, AND iconic Shin-Fuji in favor of generic Fushimi/Philosopher's-Path filler, so its selection_judgment cripples it despite top doability. The weakest plans fail on substance, not polish: P6 crams a Harajuku-Nike morning with a south-bay Nokogiri ferry day-trip (geographic whiplash) and loses Ine/Nihon-ji/Ropeway; P10 and P5 pair the far-west Edo museum with the far-southeast Chiba ferry on one day and strand Ine on a departure day; and P8 is disqualifying — it breaks the day allocation (3 Tokyo / 2 Hakone), omits the ENTIRE Nokogiri cluster, Shin-Fuji, Ine, and the Tango cluster, and pads with generic Kinkaku-ji/Kiyomizu/Byodo-in/kimono-rental tourism, scoring lowest on must-have coverage and selection.
Coverage of the 12 critical hearted/iconic anchors (Tokyo: Tsukiji, Senso-ji; Kyoto: Nishiki Market, Fushimi Inari, Kinkaku-ji, Yasaka Pagoda, Kiyomizu-dera, Byodo-in; Osaka: Dotonbori, Umeda Sky, USJ, Osaka Castle) is the headline. P11, P3 and P10 are the strongest: each respects the user's 4/3/3 city split, hits all or all-but-one iconic anchor, keeps USJ as a clean dedicated full day, paces arrival/departure lightly, and uses a labeled Uji day-trip for Byodo-in correctly. P11 edges ahead on clean per-day clustering (Higashiyama lanes grouped, Uji as a single afternoon hop) and full anchor coverage; P3 mirrors it with sunrise Fushimi Inari and a tidy Uji trip; P10 matches but slightly over-stacks day 7 (Kinkaku-ji + Sanjusangen-do + a full Uji day trip is a stretch). P1 is solid but drops the iconic Byodo-in/Uji entirely. The weakest are P7 and P9. P7 breaks the allocation outright (Kyoto 4 / Osaka 2), which forces it to drop the iconic Universal Studios Japan completely and pad days with non-saved filler (Osaka Aquarium, Abeno Harukas, Tempozan Ferris Wheel, Ryoan-ji, Arashiyama) - a major must-have miss plus over-reach. P9 crams its full-day theme park: day 9 stacks Umeda Sky Building, Shinsekai AND a half-day Universal Studios visit, which is unrealistic, while omitting the iconic Kiyomizu-dera and duplicating Shibuya (Crossing + Scramble Crossing on the same day). Between them P9 ranks above P7 only because it at least keeps USJ on the itinerary and honors the 4/3/3 split. No plan scheduled a sit-down restaurant as a day item (Dotonbori/Tsukiji/Nishiki are food-streets/markets, correctly allowed), so no plan was penalized on that axis.
All 11 plans correctly treat the saved places as activities (no restaurants are mis-scheduled) and stay within London, so the spread comes down to must-have coverage (15 distinct score-2 iconic+gem places, treating "Big Ben" and "Big Ben & Houses of Parliament" as one), pacing on a packed solo trip, and geographic sanity. P4 wins: 14/15 must-haves (only Conduit Mews is absent, and it actually includes Conduit on day 3), tight area clusters (Westminster, South Bank+West End lanes, royal-parks-to-Little-Venice, north heights, then a deliberately light 3-stop departure day grouping St Dunstan/Wapping/St Katharine), with no cross-city zigzags. P5 is a close second - 13/15, clean clusters, and the only plan besides P11 that builds in a labeled mid-trip "rest" day, which suits a packed itinerary's rhythm. P11 has the best raw coverage tie (14/15, missing only St Dunstan) plus a rest day, but its day-5 departure crams 7 items across Hampstead → Bayswater mews → Hyde/Green Park, dragging doability. The weakest plans fail on coverage and pacing: P2 misses four must-haves (Buckingham Palace, St Dunstan, Wapping, Hertford Union Canal) for the lowest selection_judgment; P6 only schedules 12 must-haves (it buries Conduit and Bathurst Mews inside a Paddington note rather than as items, drops Wapping, and stacks a 6-stop "departure" day ending with Buckingham Palace as the final pre-Heathrow stop, which is geographically backwards); P9 is well-paced but also leaves four must-haves unscheduled. P7 has strong 14/15 coverage but is dragged down by a brutal final-day zigzag (far-north Hampstead → west Bayswater mews → far-east Wapping/St Katharine, 7 items), the worst geographic efficiency of the top-coverage group.
All eleven plans share the same correct city/day skeleton, so ranking turns on must-have coverage, pacing, and clean execution. P4 leads: it captures all 5 Rome iconics (incl. the Spanish Steps), all 6 Dolomites iconics, all 4 Positano iconics, and 5 of 6 Como villas, while explicitly flagging Day 7 as a lighter "rest" day after the heavy Tre Cime circuit and keeping arrival/departure days reasonable - the textbook balance the "balanced" pace asks for. P2 and P8 are nearly as strong: both achieve full Rome/Dolomites/Positano iconic coverage with deliberately light arrival days (P2's Day-5 Val Badia settle-in, P8's two-item Day-1) and sensible villa subsets; P8 loses a hair for stacking four lakes including the Sorapis hike on Day 6. The mid-pack (P1, P7, P6) hit all the Rome iconics but pay for it with heavier days - P1's three-major Day 6 plus a 7-8h transfer day, P7's four-anchor Rome Day 2 - and P6 commits a real error by scheduling Rome's Belvedere Cederna as a Lake Como item (out-of-city). The weakest tier drops required iconics: P3 omits the Colosseum and Dobbiaco; P9, P10 and P11 all skip the Spanish Steps (an iconic) and Arienzo, with P10 additionally missing Lago di Sorapis. P5 is clearly last - it breaks the prescribed day allocation (only 2 Dolomites days vs. 3, padding Positano to 3), overloads Rome Day 2 with five marquee sites, crams four Dolomite anchors into a single day while still missing Sorapis, drops the Spanish Steps, and schedules a "Lakeside Aperitivo" as a day item (a food/bar entry that should never be a scheduled activity). Restaurant-rail discipline was otherwise clean across the field; none of the legitimately ranked plans forced low-value saved items, and smart omission of unplaceable noise (the Florence-based Officina perfumery mis-tagged under Dolomites, the redundant sixth Como villa) was correctly rewarded rather than penalized.
The decisive axes here are coverage of the 7 ICONIC must-haves (Kandy View Point; Udawalawe NP, Ravana Falls, Nine Arches; Labookellie Tea Estate, Hanuman Temple, Ramboda Falls) and day-4 pacing around the Udawalawe safari. P1 leads: it carries all seven ICONIC, places Udawalawe alone with only an easy Little Adam's Peak on the safari day, keeps the Kandy and Nuwara Eliya days clustered and varied, schedules no restaurants as day items, and reads on-voice without brochure cliches. P3 and P2 are close behind with full ICONIC coverage and the same sane Udawalawe + Little Adam's day-4 structure (P3 slightly cleaner geographically; P2 mislabels the Ella-to-Nuwara-Eliya train ride as "Ella Rock"). The middle pack (P6, P8, P9, P5, P7) all cover the ICONIC set but stack a 3-4h Ella Rock hike with the Udawalawe safari (and sometimes Little Adam's too) on the same day, which is genuinely exhausting and the headline doability fault. The weakest plans fail on must-haves or overload: P11 crams two hikes plus a FULL-day Udawalawe safari into day 4 (physically impossible as written); P10 omits Udawalawe entirely (a missing ICONIC) while doubling up hikes; and P4 is worst, missing two ICONIC (Udawalawe and Hanuman Temple) while substituting a Horton Plains/World's End full-day padded with Hakgala and Peradeniya, over-scheduling its national-park day and ignoring the user's actual saved set.
All eleven share the same correct three-city structure (Ubud/Canggu/Uluwatu, 2-2-2) and most lead with a light arrival, so differentiation comes from iconic must-have coverage, correct placement, and respecting the chill pace for a couple. The decisive must-have is Blue Point Beach, which is ICONIC in Uluwatu (on the Bukit Peninsula, NOT Canggu). P9 is strongest: it is the only plan that covers ALL eight distinct iconic must-haves (Tirta Empul, Tegallalang, Monkey Forest, The Lawn, Potato Head, Savaya, Padang Padang, Uluwatu/Kecak) AND places Blue Point in Uluwatu correctly, with chill, well-clustered days and clean narrative (its only ding is a 3-item departure day). P10/P2/P8/P7/P11 all cover the seven non-Blue-Point iconic items cleanly at a relaxed pace, differing mainly in whether Tanah Lot is folded into the Canggu drive vs a full day; they only miss the lower-priority Blue Point. The weak tail is clear: P5 and P6 mis-place Blue Point in Canggu (an out-of-city error since it sits on the southern peninsula), costing geographic efficiency; P4 covers everything including Blue Point but cripples the chill brief by stacking a 4am Mount Batur trek + quad biking + a river canyon on a single Day 2, the antithesis of a couple's slow trip; and P3 is weakest of all, omitting TWO iconic must-haves (Potato Head and Savaya) and substituting non-pool venues (La Brisa, Old Man's, Rock Bar, Nyang Nyang) that the user never saved, plus Old Man's reads as a surf bar pulled in as a day item. None scheduled true restaurants as day items, and the beach clubs are legitimate iconic anchors here, so the food-rail rule was not triggered.
This is a THIN case with zero hearted/iconic must-haves, so ranking turns on doability, smart subset selection without cramming, and avoiding the three concrete failure modes: full-day-trip overload (Cu Chi half-day and My Son half-day each effectively eat a half/whole day), duplicate places, and restaurants/bars scheduled as day items. P7 wins because it is the only plan that deliberately builds in a lighter mid-trip day (Day 6 typed "rest": My Son trip + a single relaxed night market), keeps Cu Chi+War Remnants as a sane standalone history day, runs light arrival/departure days, never duplicates, and schedules no food items. P4 and P1 are nearly as clean - balanced, no duplicates, no restaurants, coherent geographic clusters - P4 just slightly edges P1 on rhythm and P1 keeps a couple of three-item days that lean fuller. P5 is solid but its Day 8 crams a half-day Cu Chi trip with the War Remnants Museum and the Opera House. The weakest plans fail on concrete grounds: P2 is worst - it schedules a cafe (Egg Coffee at Giang Cafe) and a food hall (Ben Thanh Street Food Market) plus a bar street (Bui Vien) as day items, doubles up two pottery villages on one overloaded four-item Hoi An day, and front-loads a five-item history slog on Day 2. P6 and P8 both schedule a redundant second "Coconut Boat Tour" gap-fill the day after a Bay Mau basket-boat tour, and P6 additionally crams a half-day Cu Chi trip onto the departure-day flight. P3 stacks a near-duplicate "War Museum" gap-fill alongside the War Remnants Museum AND Cu Chi on one day. P9 and P10 each push a heavy three-anchor war day or a Cu Chi run onto a travel/departure day. Across all eleven, narrative quality is uniformly strong and on-voice with little brochure cliche, so it barely separates the field; doability and cram-avoidance do the real sorting.
Both ICONIC must-haves are The Dubai Mall & Aquarium and Dubai Frame; the rest are saved items where smart subsetting is fine. The two structural traps are (a) cramming the full-day desert safari (afternoon pickup, returns ~10pm) or the Atlantis Aquaventure water park with extra heavy stops, and (b) burying an iconic. P6 is the clear winner: both iconics present, every saved item placed, and clean area-clustered days that give the desert safari its own day (old-Dubai morning + afternoon pickup) and Atlantis its own Palm day, with a light beach/Frame departure. P1, P3 and P10 are close behind with both iconics, sane safari days, and well-paced light arrival/departure days. By contrast, P5 and P11 collapse on doability: P5 stacks Atlantis (2-3h) AND the full desert safari into a single day, and P11 pairs the Atlantis water park with a metro hop to Deira's Gold Souk and Al Fahidi - both over-cram an anchor and force backtracking across the city. P7 dilutes Atlantis into a lobby/exterior stroll and runs two 4-item days (including a safari day padded with a Marina loop), weakening doability. P4 is the worst on selection_judgment because it omits the iconic Dubai Frame entirely, a heavy penalty despite otherwise sound pacing. P2, P8 and P9 schedule the desert safari on the departure day (returns late, then fly home) which is the only blemish on otherwise complete, well-clustered itineraries. No plan scheduled a restaurant/cafe/bar as a day item; the fountain shows and abra crossing are legitimate free attractions, not penalized.
All four iconic must-haves (Jewel Changi, Gardens by the Bay, Supertree Grove, Merlion Park) are the deciding axis, and no plan schedules a restaurant/cafe as a day item, so judging turns on pace, clustering, and avoiding cram. The strongest plans, P3 and P5, hit all four iconics, keep arrival/departure days deliberately light (P3 runs 2-item bookend days; P5 marks its lone gap-fill honestly), and exercise smart omission of low-priority saved items rather than forcing all eight per city - matching the balanced, kids-of-two brief in June heat where cooled conservatories are sensibly mid-day anchors. P9 is similarly disciplined (light bookends, all iconics) but trims a touch more aggressively. P1, P7, and P8 cover all iconics with rich saved coverage but pay a doability tax for stacking two 90-minute museums (ArtScience + teamLab) onto a departure day. P4 and P10 weaken on geographic efficiency by pairing a "Sentosa Merlion" with mainland Marina Bay items in the same day, implying real back-and-forth. P6 actually omits "Gardens by the Bay" as a named iconic and over-stuffs its transition day. The clear weakest is P11: it violates the 2+2 day allocation (3 days city + 1 Sentosa), drops the iconic Merlion Park entirely and never names Gardens by the Bay, invents an entire wildlife day (Zoo, River Wonders, Night Safari, Botanic Gardens) absent from the user's pool, and crams Universal Studios with a beach, Cloud Forest and Supertree onto a single departure day - an exhausting, off-brief, must-have-missing day that no balanced family could realistically execute.
All 11 plans cover the 5 Cairo iconics (Pyramids, Sphinx, Egyptian Museum, Citadel, Khan el-Khalili) and the easy Luxor iconics; the discriminators are the awkward 6th Luxor iconic (Sahara Desert Safari), out-of-city errors, and departure/arrival-day overloading. The strongest plan, P3, schedules all 5 Cairo iconics plus 5/6 Luxor iconics (only the ill-fitting Sahara Safari is omitted), commits zero out-of-city errors, uses genuinely on-route gap-fills (Colossi of Memnon on the VoK return road, Corniche walk between morning Karnak and night Luxor Temple), and paces temple days with correct early/dusk timing and a light departure day. P4, P9, P6 and P2 are close behind: clean geography, alternating intensity, and smart omission of the low-value Sahara - they trade only that one odd iconic for excellent doability. P7 and P11 are the only plans (besides flawed P1) to land all 6 Luxor iconics, but pay for the Sahara by overloading - P11 crams an Edfu day-trip plus a desert safari onto the day-7 departure, and P7 stacks Karnak + a half-day 4WD safari then a VoK+Hatshepsut+Edfu mega-day. The weakest plans fail on feasibility and geography: P1 hallucinates "Giza Necropolis" as a Luxor West Bank viewpoint (a city away) and crams Edfu + Sahara onto departure day; P5 and P8 both misplace Cairo's National Museum of Egyptian Civilization as a Luxor item and P5 also front-loads VoK+Hatshepsut+a Cairo museum onto the Luxor arrival day; and P10 lands at the bottom by missing the Sahara iconic entirely while inventing a north-bound Dendera day-trip on the departure day and overloading day 6 with Karnak + Luxor Museum + a 3-hour Edfu round-trip. No plan scheduled a restaurant as a day item (Nile Dinner Cruise is a legitimate evening experience), so that rule did not separate them.
All 11 plans cover every one of the 11 hearted/iconic must-haves and correctly treat "Lima Ceviche & Food Tour" as an experience (not a penalizable restaurant day-item), so differentiation comes down to doability, duplicate-avoidance, and Rainbow Mountain restraint. The strongest plans (P4, P6) keep the brutal 5,200m Vinicunca day SOLO (or with only a light on-route stop), honor a genuinely light Cusco arrival/acclimatization day, run Pisac as a clean Sacred Valley round-trip, and carry no duplicate sites; P4 edges P6 with the most disciplined Rainbow day and zero overload. P1 is nearly as clean. The weakest plans actively hurt the traveler: P8 mislabels a Pisac DAY-TRIP day as "rest," then stacks three items (Vinicunca + Qoricocha Lagoon + San Blas workshops) onto the hardest day at altitude and crams the arrival/departure days; P10 bolts a full second major Sacred Valley site (a scenic drive plus Ollantaytambo Fortress) onto an already-full Pisac day-trip, turning it into an exhausting valley marathon. P5, P7, P2, P3 and P11 each schedule the SAME fortress twice (listing both "Sacsayhuaman Fortress" and the seeded "Saqsaywaman" entry on different days), a real duplicate that costs geographic/selection points, and several of them tack an extra stop onto the post-Vinicunca return. Narrative quality is broadly good and on-voice across the field, so it barely separates the pack; doability and pacing decide the order.
All eleven plans correctly schedule the single iconic must-have (Eravikulam National Park) plus the two saved Kochi anchors (Chinese Fishing Nets, Kathakali) and Munnar Tea Gardens, and all sensibly omit the strenuous Meesapulimala full-day trek given the chill pace, young kid, and February timing, so differentiation comes down to doability, duplicates, and coverage of the remaining saved item, Top Station. The top tier (P4, P11, P7, P5) pairs clean geographic clustering and a properly light arrival/departure rhythm with full saved coverage including Top Station, while avoiding duplicates or restaurant-as-day-items; P4 edges ahead on the richest, best-balanced variety, and P11/P5 are the most chill-appropriate (2-3 well-spaced items per day). P2, P6, and P3 are clean and well-paced but drop the saved Top Station, a minor selection ding. The bottom is clear: P1 schedules Cherai Beach twice as two near-identical entries on Day 3, P9 stacks both "Eravikulam National Park" and its "Nilgiri Tahr Trail" duplicate on the same morning, and P8 is the weakest by a wide margin - it breaks the Kochi day-3 allocation entirely by inserting an Alleppey overnight houseboat (out of region, then driving Alleppey-to-Munnar on Day 4) and schedules a "Kerala Sadya Feast" meal as a day item, both of which the rubric explicitly penalizes.
All 11 plans nail the three iconic Cherrapunji anchors (Mawsmai, Double Decker, Wei Sawdong) and both Shillong iconics (Laitlum, Don Bosco), so differentiation comes down to (a) capturing the third Dawki iconic Krang Shuri Falls without wrecking the departure day, and (b) keeping the brutal Day-4 Nongriat trek (3,500 steps) from being over-stacked. P9 wins: it schedules all eight must-haves including Krang Shuri as a sane on-route stop toward Guwahati, keeps Day 4 to the trek + Wei Sawdong + an easy gap-fill cave, varies intensity well, and writes specific, cliche-free copy. P2, P5, and P11 are tightly clustered behind it - clean pacing and good clustering, but each omits the iconic Krang Shuri (a must-have miss) in exchange for a lighter, very doable trip. The weakest plans over-stack the hardest day or cram the departure day: P8 piles Rainbow Falls Trek ON TOP of the Double Decker descent (an exhausting beyond-the-bridge extension) AND still misses Krang Shuri, while P10 does the same Rainbow Falls + Double Decker + Wei Sawdong triple-stack on one trek day and also misses the iconic. P7 turns the final day into a five-item Dawki + Shnongpdeng + far-flung Krang Shuri (~30km) marathon after the long drive, hurting doability. P4 and P6 earn high selection_judgment by capturing Krang Shuri but lose doability points for crowding the trek/departure days; overall the spread is narrow because the corpus is thin and every plan respects the city-day allocation with no out-of-city or restaurant-as-day-item errors.
All 11 plans cover both ICONIC must-haves (Leh Palace + Shanti Stupa, with Shanti often via the "Acclimatisation Walk" variant) and most saved items (bazaar, rafting, Khardung La, Alchi, Magnetic Hill), so differentiation comes almost entirely from doability. The strongest plan, P4, respects altitude (Khardung La pushed to day 4, day-1/2 kept low and in-town), keeps every day a sane Leh-based round-trip, never schedules the un-doable Tso Moriri, mixes types well, and ends with a genuinely light departure day. P8 and P11 are close behind: both pace acclimatization correctly and keep transitions realistic; P11 earns top rhythm marks for an explicit mid-trip rest day (day 5) but loses a little for dangling an optional Tso Moriri on the departure day. The weakest plans cram an un-doable Tso Moriri (~220km/4hr+ each way) as a single-day Leh round-trip on the arrival-adjacent or departure day: P1 puts Tso Moriri on the DEPARTURE day before an evening flight (and also stacks Khardung La day-6 + Tso Moriri day-7 back-to-back as two exhausting 5,000m+ days), and P7 strings together Khardung La (day 5), Chang La + Tso Moriri (day 6), then a cooking class far from town on the departure day — three brutal high-altitude days running with no recovery, plus the closing day items scattered across Leh, Alchi-area and Spituk. P6 routes a realistic Nubra/Pangong overnight loop but mislabels every item as city "Leh" (Diskit, Hunder, Pangong are not in Leh), tanking geographic efficiency and breaking the single-base brief. No plan scheduled a restaurant as a pure day item (the Alchi cooking class reads as an experience, correctly paired with the Alchi day-trip), and duplicate Shanti Stupa/Acclimatisation-walk pairings were mostly split across distinct days rather than stacked, so duplicate penalties were minor.
All 11 plans cover the 8 iconic must-haves (4 Seoul + 4 Busan) and most also fold in the 4 Seoul saved items (Jimjilbang, K-pop Gangnam, DMZ, Hongdae), so ranking turns on doability, geographic clustering, and avoiding over-cramming. P2 leads: it sequences the heaviest Seoul day cleanly (morning DMZ to far north, afternoon jimjilbang, night Hongdae), clusters Busan day 5 west (Gamcheon+Jagalchi+BIFF) and day 6 northeast (Haedong then Haeundae before departure), keeps arrival/departure light, and reads on-voice without brochure cliches. P8 is nearly identical in quality with the same disciplined 3-item Busan day and clean DMZ/K-pop/Jimjilbang chain. P10 earns points for an explicit mid-trip "rest" day (DMZ + Hongdae) that genuinely alternates intensity. P3 and P6 are clean but under-fill for a "packed" traveler (P3's departure day is a single Gamcheon stop; P6 has several 2-item days). The weakest plans cram or backtrack: P11 routes Gangnam-to-DMZ-to-Yeouido-to-Ewha across one day (huge north-south-northwest backtracking) and loads a far-north mountain temple (Beomeosa) onto the departure day; P9 crams temple+village+market into Busan day 5 and pushes iconic Haeundae onto a travel-day evening. P5 is the clear last: it is massively over-scheduled (six items on multiple days for an already exhausting pace) and, critically, schedules restaurants as day items - notably "Korean BBQ in Gangnam" plus food-market stops framed as eating stops - which violates the food-rail rule and tanks both doability and selection judgment.
The deciding tension is iconic coverage (Dudhsagar, Basilica, Palolem on the Panaji side; Cubana, Curlies, Anjuna on the Arpora side) versus doability over a chill 3-day window with 12 must-haves that cannot all sanely fit. P3 wins by capturing all six iconics while keeping pacing honest: it pairs Dudhsagar (a full-day SE jeep-safari) with Palolem framed as an evening arrival rather than a crammed midday stop, light arrival/island days bookend it, and day 3 is a clean Anjuna-Curlies-Cubana nightlife cluster. P2 is the most geographically disciplined (tight Dudhsagar+Spice inland day, light arrival) but drops two iconics (Palolem and Curlies), which costs it the top spot. P1 also covers all six iconics but loads the arrival day with Spice and stacks Dudhsagar+Palolem (far-apart SE/south anchors) on one day, hurting doability. The large P6/P7/P9/P10/P11 family is highly doable and clean but each omits both Palolem and Curlies (two iconics) in favor of the safe Dudhsagar+Spice / Anjuna+SatMarket+Cubana template, so they cluster mid-pack. P8 swaps Cubana for Curlies but then misses both Palolem and the iconic Cubana. The weakest are P4 and P5: P4 violates the day allocation (Panaji only day 1) and misses three iconics (Dudhsagar, Palolem, Curlies) while detouring 1.5h south to Cola on arrival day; P5 is the most exhausting, cramming Dudhsagar+Palolem+a 21:30 club on day 2 and FIVE anchors including three back-to-back clubs (Sat Market, Chronicle, Cubana) on day 3, gutting doability and selection judgment despite raw coverage.
The 11-way winner V4 (a rhythm rule with no structural change) looked suspicious - it might have ridden a wide field thinned by weak singletons. So the 4 strongest distinct mechanisms were re-run head-to-head on the same 20 scenarios (60 fresh blind Opus verdicts): V4 (rhythm only), V7 (anchors split only), V10 (the V4+V7 union), V9 (anchors + intensity tags + rhythm + variety + clustering).
| # | Variant | Overall | Doability | Selection | avgRank | 1st | Priority cov | Dups | Gap-fill |
|---|---|---|---|---|---|---|---|---|---|
| 1 | V9 anchors split + intensity tags + rhythm + variety + clustering | 8.11 | 8.03 | 7.99 | 2.07 | 17 | 91% | 0.15 | 0.95 |
| 2 | V10 union: anchors split + rhythm rule | 7.93 | 7.71 | 7.96 | 2.38 | 18 | 93% | 0.05 | 0.8 |
| 3 | V7 anchors-pool structural split only | 7.77 | 7.38 | 8.01 | 2.35 | 18 | 93% | 0.15 | 1.05 |
| 4 | V4 rhythm rule only (no anchors split) - the 11-way winner | 7.28 | 6.79 | 6.97 | 3.20 | 7 | 86% | 0.05 | 1.3 |
Clean monotonic V9 > V10 > V7, and V4 (no anchors split) falls to last - the structural split is the load-bearing piece. Head-to-head V10 beats V4 in 12/20 scenarios (V4 wins 6, 2 ties), so rhythm + anchors are additive. Deterministically V10 is cleanest (dups 0.05, gap-fill 0.8); V9 is most selective (saves-util 74% - smart omission - and best restaurant discipline 0.3).
V9 (the recommended port) vs OLD vs NEW on three scenarios. Watch how OLD invents/over-lists, NEW over-packs, and V9 stays doable.
prompts-v2.ts) - replaces NEW's hard "every save MUST appear". The single biggest win across both rounds; fixes over-packing plus OLD's restaurant-as-activity + invention bugs.candidate-pool.ts) - the load-bearing ingredient. Split each city's pool into ANCHORS (hearts+iconics, never trimmed) vs IF-TIME-PERMITS. Every champion leader has it; the one without it (V4) came last.prompts-v2.ts) - layered on the split, these took V9 to the top (8.11). Additive, not redundant.repairCityCoverage as a floor the benchmark omits, so prompts are tested on their own merit. Generated with Claude Code.