It Looked Done. It Wasn't.

The system was working. Articles were good. But every time I opened the final Markdown, I saw the same thing: a wall of text. This is Part 3 — where I added image generation, discovered bugs that had been hiding since Day 1, and learned that pipeline debt compounds faster than you'd think.

The Gap I Kept Ignoring

Cartoon of a proud robot handing a towering scroll of dense text to a tired-looking reader who looks politely overwhelmed

Every article the system produced was good. Sometimes really good. But every time I opened the final Markdown file, I noticed the same thing: a wall of text.

That's fine for a draft. It's not fine for something you'd actually publish. Real articles have images — a hero shot that sets the mood, a few visuals that break up the reading experience and anchor the reader in specific moments. I'd listed "Image Suggester" as a future enhancement in the original design and promptly forgotten about it.

Then I ran the pipeline on a piece about why cats and dogs are humanity's best friends, and the output was genuinely moving — and still just a wall of text. That was the push I needed.

The Design: One More Agent, Two New Phases

The image pipeline had to integrate cleanly with what already existed. I didn't want to redesign the orchestrator — I wanted to slot new phases between Harmonization and Export.

The new flow:

Tone Harmoniser

Image Planner

reads article · produces prompts

Human Gate 3

review · edit · regenerate · skip

Image Generation

Gemini Flash · saves PNGs

article_with_images.md

images injected at placements

Export

The key design constraint: existing articles should be unaffected. If you don't pass --main-image or --content-images, none of this runs. The entire image phase is opt-in.

1# Before (unchanged)
2agentic-writer run "topic" --words 2000
3
4# After (opt-in)
5agentic-writer run "topic" --words 2000 --main-image --content-images 3

The Image Planner Agent

Cartoon of a small robot standing at an easel painting a landscape, with a scroll of structured prompt notes clipped to the easel

The core problem with naive image generation is the prompts. Most AI-assisted image workflows produce one-liner prompts: "a cat and a dog sitting together". These produce generic stock-photo results.

What makes a good image generation prompt is specificity: composition, lighting, color palette, mood, what to avoid, whether to use photorealistic or cinematic style. A skilled photographer or art director thinks about all of these. An LLM can too — if you ask it to.

So the ImagePlannerAgent reads the final harmonized article and produces 80–150 word prompts per image, structured like a cinematographer's shot description:

Primary subject: Two animals in silhouette against a golden sunset window
Composition: Wide shot, rule of thirds, animals at left frame
Lighting: Warm backlight, dust motes visible in beam
Color palette: Amber, ochre, deep shadow
Style: Cinematic documentary photography
Mood: Intimate, still, ancient companionship
Setting: Simple domestic interior, minimal furniture
Avoid: Text, logos, human faces, clutter

I wrote the system prompt to enforce this structure — not just "describe the image" but specifically enumerate each visual dimension. The difference in output quality between a one-liner and a structured 120-word prompt is not subtle.

The agent uses the placement field to anchor each image:

"Article header" → hero image, always 16:9
"After: The Bond" → content image after that section, always 4:3

The Third Human Gate

The system already had two human gates: query review and angle selection. The image prompts needed a third — not because I don't trust the agent, but because image generation has real cost and real irreversibility. Once you've generated four images, you've spent the money regardless of whether you like them.

The approval UI shows a table of all proposed images with a preview of each prompt, and supports:

Command	What it does
`a`	Approve all, generate
`1 3`	Toggle specific images on/off
`s 2`	Show full 120-word prompt for image #2
`e 2`	Edit the prompt inline before generating
`r`	Discard all — call the agent again with a fresh article read
`q`	Skip image generation entirely

The r option was the one I thought I wouldn't need. I used it on every test run. Seeing the prompts in the table almost always made me want to tweak the angle for at least one image. Having the agent regenerate from scratch occasionally produced a completely different framing that was better. The loop — generate prompts, review, regenerate if needed, approve — felt right.

Gemini Flash for Image Generation

The provider choice was easy: I already use Google's genai SDK for Gemini text, and Gemini Flash's image generation is available through the same SDK with the same API key. No new dependency, no new credential management.

1response = await client.aio.models.generate_content(
2    model="gemini-2.0-flash-preview-image-generation",
3    contents=[prompt],
4    config=types.GenerateContentConfig(
5        safety_settings=[...],
6    ),
7)
8for part in response.parts:
9    if image := part.as_image():
10        return image.image_bytes

The output goes to images/main.png, images/content_1.png, etc. The original final/article.md is never modified.

The image placement is actually a second LLM call inside the ImagePlannerAgent — not a regex over headings. After generating the prompts, the agent sends the full article back to the model with a layout brief: place  tags between paragraphs at natural visual break points, never in the middle of a sentence, distribute evenly. The model returns the article with tags inserted — stored as article_with_placeholders. Then after the images are actually generated, those placeholder tags get replaced with real Markdown image links to produce final/article_with_images.md. The hero image ends up at the top; content images land where the article naturally pauses — which an LLM judges better than any heading-distance heuristic would.

The Pipeline Debt I Discovered

Cartoon of a plumber discovering two leaking pipes inside the wall of an otherwise beautifully finished house, each tagged with a bug name

Here's what I didn't expect: adding the image phase exposed bugs in the resume system that had been hiding since the beginning.

Bug 1: The review loop had no checkpoint guard.

Every pipeline phase had a flag: queries_approved, merged_research, plan, assembled_draft, final_article. When a resume call came in, each phase checked its flag and skipped if already done.

The review/revision loop had no such flag. It just... ran every time. I'd never noticed because orch.resume() had its own DB-based early-exit that short-circuited on completed jobs. But when I added --from images (which calls orch.run() directly to bypass that short-circuit), suddenly every resume was re-running three supervisors and potentially triggering another revision round before reaching the image phase.

The fix was a revision_complete: bool field in state:

1if not state.revision_complete:
2    for round_num in range(1, self._max_revision_rounds + 1):
3        ...
4    state.revision_complete = True
5    state.save_to(state.project_path)

One field. But finding it required understanding why the resume flow had two separate entry points (run vs resume) and what each one checked.

Bug 2: orch.resume() checked the DB, not the state.

When you add --from images, you clear the image checkpoint flags in memory and want the pipeline to re-run from that phase. But orch.resume() checks the SQLite job history for its resume point — and if the job shows as completed in the DB, it returns immediately without running anything.

The fix: when --from is specified, call orch.run() directly instead of orch.resume(). The state flags are the source of truth; the DB is just the index.

1run_fn = orch.run if from_stage else orch.resume
2await run_fn(state, job.id)

Both bugs existed from the beginning. They were invisible until I started doing multi-stage resumes. Pipeline debt is real — it just doesn't show up until you add the feature that stresses the seam.

What Else Changed in the UX

Beyond images, this session added three things that changed how the pipeline feels to use.

The voting breakdown table.

Before this session, the review phase told you: "Sections to revise: section_4_better_than_humans." That's it. You had no idea which supervisor flagged it, what their score was, or why.

Now you get a full table after every round:

  Section                    anthropic    google    openai    Avg
  ──────────────────────────────────────────────────────────────
  Introduction               ✓ 8.5       ✓ 9.2    ✓ 8.9    8.9   pass
  Better Than Humans         ✗ 7.5       ✓ 8.2    ✗ 7.8    7.8   revise

And during revision, you see exactly who flagged it and what they said:

  → Revising Better Than Humans via section_writer (gemini-2.5-pro)
    ✗ anthropic [7.5]: Claims too absolute — needs citations…
    ✗ openai [7.8]: Disconnected from the emotional tone…

This should have been there from the start. When the system revises something, you should know why.

--from for targeted re-runs.

1agentic-writer resume <job-id> --from images     # redo images, keep article
2agentic-writer resume <job-id> --from harmonize  # redo harmonize onwards
3agentic-writer resume <job-id> --from images --archive  # move old images to archive first

The --archive flag moves existing output to a timestamped subfolder before regenerating. Non-destructive by default, with an easy escape hatch if you want to compare old and new versions.

Cross-session usage tracking.

Each pipeline run appends its token and image usage to projects/<job>/usage.json. When a job spans multiple sessions, the end-of-run summary shows both:

              This Session
  google     gemini-2.0-flash           4 images    $0.1600

              All Sessions Combined (2)
  anthropic  claude-sonnet-4-6         24 calls    $1.0382
  google     gemini-2.5-pro            16 calls    $0.4795
  google     gemini-2.0-flash           8 images    $0.3200
  ────────────────────────────────────────────────────────
  Total                                            $1.8377

Gemini Flash image generation costs $0.04/image. Four images costs $0.16. Worth knowing before you run the pipeline.

The Lesson From Part 3

The initial two parts built a pipeline that produced good articles. This part made it produce publishable ones.

But the more interesting lesson was about where complexity hides. The image pipeline itself was straightforward — new agent, new provider, new CLI flags. What wasn't straightforward was the resume system, which had accumulated assumptions about how it would be called that broke the moment I added a new calling pattern.

Every time you add a feature to a pipeline, you're also stress-testing the assumptions baked into every earlier feature. Sometimes those assumptions hold. Sometimes you find out they were always wrong — they just never mattered before.

That's not a reason to over-engineer things upfront. It's a reason to have good checkpointing, clear state semantics, and tests for your resume paths. The pipeline debt was real, but it was also fixable — and finding it made the system meaningfully more robust.

The system was now producing articles with images, reviewing them with visible reasoning, and resuming cleanly from any phase. The last thing it couldn't do: turn a project folder full of Markdown and PNGs into a deploy-ready MDX file and publish it. That's Part 4.

The articles this pipeline produces are published at blueandyeliwrite.com.