Skip to content
Go back

Building an AI Pipeline That Generates Webpage Content

Published:  at  02:00 PM

Sketch of a focused man working at a desk with a laptop and creative elements around him.

Table of contents

Open Table of contents

The Problem with AI Content

We live in an age where online cats no longer fall from a wardrobe, or slap each other, or steal icecream from their owners. With the capabilities of AI, cats now do the dishes, ride motorcycles and fight sharks.

Those kinds of visuals are easy to make with some prompt-tinkering. But text is a bit trickier: LLMs have a distinctive style that’s often unpleasant to read: Verbose, jargon-packed, and wrapped in the dull docility of marketing language. It lacks bite.

Compare this:

Leveraging cutting-edge algorithms, our innovative solution synergizes deep learning methodologies […]

With a paragraph written by a software blogger:

Value: The concept of a value is a bit abstract. It’s a “thing”. A value to JavaScript is what a number is to math, or what a point is to geometry. When your program runs, its world is full of values. […]

However, AI-generated content isn’t impossible to get right. The usual tips - precise style instructions and strong examples - can give you a head start. But what is most crucial to understand is that AI doesn’t create value - it only helps to refine preexisting value.

So if you want to write original content with AI, you are going to need original data.

And if you don’t have original data, you can scrape it from the web.


What We’re Trying to Build

The idea is simple: You build a web crawler that loops over a list of URLs and dumps relevant data into JSON files or a database. The raw data is then processed by a series of prompts, which transform it into a fully readable article. At its core, this is a glorified Object-to-Paragraph / JSON-to-Article pipeline:

{
  "title": "Ikiru",
  "year": 1952,
  "director": "Akira Kurosawa",
  "genre": ["Drama"],
  "summary": "A bureaucrat with a terminal illness searches for meaning in his final days."
}

Ikiru (1952), directed by Akira Kurosawa, is a quiet, haunting drama about a dying bureaucrat’s search for meaning. Faced with his mortality, he tries to break free from the routines of his life and do something that matters - if only once.

The Tech Stack

ComponentToolWhy
FrontendAstroFast static sites, great for SEO
DatabaseSQLite / PostgreSQLLocal dev vs scalable setup
LLMOpenAI / DeepInfraContent generation options

Scraping, Selecting, and Scrubbing Data

Websites that are ideal for scraping include raw data, alongside metadata and metrics: Categories, tags, and likes. These make it much easier to organize and prioritize your precious scrap heap.

You can also generate some metadata using LLM magic. LLMs are great at categorization, but things get risky when you try to generate metrics. Results intensify LLM biases, and are often unreliable or wacky.

The collected data needs to be scrubbed clean and organized so we can make some reasonable assumptions about its structure. Essentially, we need to schematize the scrap soup. This requires some reformatting and type standardization.

Another challenge is determining whether a scrap of data is legitimately informative - where does it fit on the page? Does it add value or fluff?

Prompting

I’m generally in favor of Prompt-Tinkering rather than Prompt-Engineering. Prompting isn’t the art it is sometimes made out to be. With that in mind:


Final thoughts

Hope you had fun with this. Once you’ve tweaked the knobs and start getting results you’re happy with, it can be very satisfying to see it all come together.


Suggest Changes