Deterministic Scaffolding for VLM Image Generation

Marcus Webb

Senior Backend Analyst

The Pitch

Frontier models like Gemini 3.0 Pro and GPT-5 still cannot natively handle complex spatial tasks such as numbering a 50-step spiral game board (source: samcollins.blog). The Underdrawing Method uses deterministic SVG or Python scripts to create a structural scaffold before any pixels are generated. By separating logic from aesthetics, developers can force 100% accuracy in text and numbering that native one-shot prompting still fails to deliver in May 2026.

Under the Hood

Gemini 3.0 Pro and ChatGPT Images 2 consistently fail to correctly number 50 consecutive items in a spiral natively (source: samcollins.blog). Asking GPT-5 to number a spiral is currently the quickest way to turn a logic problem into a surrealist painting. This method solves the hallucination by requiring a two-phase workflow: Layer 1 is a deterministic SVG or Python-based outline, and Layer 2 uses generative Image-to-Image models to apply textures (source: Sam Collins blog).

Research from WACV 2026 suggests that current AI editors only fulfill about 33% of precise editing requests correctly. This confirms a persistent gap in the 2026 stack that necessitates external geometric constraints (source: WACV 2026 Paper #2231-2241). The Hacker News community views this as a sophisticated evolution of early Stable Diffusion img2img workflows, now adapted for VLM reasoning (source: HN comment by vunderba).

Current limitations and unknowns:
- High technical friction requiring knowledge of SVG, Python, or Mermaid.
- Potential "Prompt Neglect" where models ignore descriptive style adjectives (source: HN).
- Increased agentic latency due to the multi-step code-and-vision execution.
- No public library yet exists to automate Layer 1 for non-engineers.
- Performance deltas between Claude 4.5 Opus and Gemini 3.0 Pro are currently undocumented.

Marcus's Take

This is the only viable way to ship production assets involving data visualization or precise spatial layouts in May 2026. If your product relies on GPT-5's intuition to place 50 numbers correctly, you are shipping broken features. It is a cumbersome workflow that increases latency and friction, but until vision models can actually count, you must use it for any project where accuracy is non-negotiable.

Ship clean code,
Marcus.

Marcus Webb

Marcus Webb - Senior Backend Analyst at UsedBy.ai

Trend Analysis·3 min read

Audiomass: Multitrack Audio Editing via 100kb of Vanilla JavaScript

Audiomass is a browser-based, multitrack audio editor that operates entirely client-side with a remarkably small 100kb footprint (audiomass.co). It provides a workflow reminiscent of classic editors l

Trend Analysis·3 min read

Magnifica Humanitas: The Vatican’s Framework for the GPT-5 Era

The document, signed May 15 and officially released today, was presented at the Vatican alongside Christopher Olah, co-founder of Anthropic and lead of its interpretability team (ncronline.org, Forbes

Trend Analysis·3 min read

The Zero-Click Economy: Kagi Search vs. Google AI Mode

Google has effectively pivoted to an "answer engine" where Gemini 3.5 Flash provides conversational summaries, while Kagi remains the primary refuge for users seeking a human-centric, ad-free index. W

Stay Ahead of AI Adoption Trends

Get our latest reports and insights delivered to your inbox. No spam, just data.

The Pitch

Under the Hood

Marcus's Take

Related Articles

Audiomass: Multitrack Audio Editing via 100kb of Vanilla JavaScript

Magnifica Humanitas: The Vatican’s Framework for the GPT-5 Era

The Zero-Click Economy: Kagi Search vs. Google AI Mode

Stay Ahead of AI Adoption Trends