Building an AI Pipeline That Generates Webpage Content

Sketch of a focused man working at a desk with a laptop and creative elements around him.

Open Table of contents

The Problem with AI Content
What We’re Trying to Build
Final thoughts

The Problem with AI Content

We live in an age where online cats no longer fall from a wardrobe, or slap each other, or steal icecream from their owners. With the capabilities of AI, cats now do the dishes, ride motorcycles and fight sharks.

Those kinds of visuals are easy to make with some prompt-tinkering. But text is a bit trickier: LLMs have a distinctive style that’s often unpleasant to read: Verbose, jargon-packed, and wrapped in the dull docility of marketing language. It lacks bite.

Compare this:

Leveraging cutting-edge algorithms, our innovative solution synergizes deep learning methodologies […]

With a paragraph written by a software blogger:

Value: The concept of a value is a bit abstract. It’s a “thing”. A value to JavaScript is what a number is to math, or what a point is to geometry. When your program runs, its world is full of values. […]

However, AI-generated content isn’t impossible to get right. The usual tips - precise style instructions and strong examples - can give you a head start. But what is most crucial to understand is that AI doesn’t create value - it only helps to refine preexisting value.

So if you want to write original content with AI, you are going to need original data.

And if you don’t have original data, you can scrape it from the web.

What We’re Trying to Build

The idea is simple: You build a web crawler that loops over a list of URLs and dumps relevant data into JSON files or a database. The raw data is then processed by a series of prompts, which transform it into a fully readable article. At its core, this is a glorified Object-to-Paragraph / JSON-to-Article pipeline:

{
  "title": "Ikiru",
  "year": 1952,
  "director": "Akira Kurosawa",
  "genre": ["Drama"],
  "summary": "A bureaucrat with a terminal illness searches for meaning in his final days."
}

Ikiru (1952), directed by Akira Kurosawa, is a quiet, haunting drama about a dying bureaucrat’s search for meaning. Faced with his mortality, he tries to break free from the routines of his life and do something that matters - if only once.

The Tech Stack

Because you want your website to rank highly on google, it’s important to optimize for SEO. Textually, this mostly means using clear headings and relevant keywords. Performance wise, you need to be careful: excessive client-side rendering and large bundles can slow a page down. Static site generators like Astro help by offloading the costly rendering process from the client’s browser. Instead this is done at build time, so that at runtime, the server’s compute is utilized for quick serving.
A database is essential for storing and querying all the scraped data. SQLite works well for a quick local setup, but if scalability is a concern, PostgreSQL is the better choice.
The LLM is the backbone of content generation, and while OpenAI is the obvious go-to, AI Inference Providers (like DeepInfra or OpenRouter) allow you to experiment quickly with different open source models.

Component	Tool	Why
Frontend	Astro	Fast static sites, great for SEO
Database	SQLite / PostgreSQL	Local dev vs scalable setup
LLM	OpenAI / DeepInfra	Content generation options

Scraping, Selecting, and Scrubbing Data

Websites that are ideal for scraping include raw data, alongside metadata and metrics: Categories, tags, and likes. These make it much easier to organize and prioritize your precious scrap heap.

You can also generate some metadata using LLM magic. LLMs are great at categorization, but things get risky when you try to generate metrics. Results intensify LLM biases, and are often unreliable or wacky.

The collected data needs to be scrubbed clean and organized so we can make some reasonable assumptions about its structure. Essentially, we need to schematize the scrap soup. This requires some reformatting and type standardization.

Another challenge is determining whether a scrap of data is legitimately informative - where does it fit on the page? Does it add value or fluff?

Prompting

I’m generally in favor of Prompt-Tinkering rather than Prompt-Engineering. Prompting isn’t the art it is sometimes made out to be. With that in mind:

Allow the LLM some creative freedom - just enough to make the writing feel natural, but not so much that it loses structure. Instead of asking for a paragraph, it’s better to request a response in Markdown format with headings, lists, and tables.
Instructing the LLM to write conversationally, occasionally ask questions, and avoid excessive jargon improves the readability. Simplifying language as much as possible forces the LLM to avoid jargon.
Steal other people’s prompts! (Check out r/ChatGPTPromptGenuis)
Only give the LLM the information it needs to perform its token prediction task - anything extra adds confusion.

Final thoughts

Hope you had fun with this. Once you’ve tweaked the knobs and start getting results you’re happy with, it can be very satisfying to see it all come together.