Optimising the build pipeline
At some point this site will have a lot more posts than it does now, and I got curious about whether the build would hold up or quietly become a problem. So I sat down and read through the build code properly, which is something I had not done in a while.
Two things got fixed. A couple of others got thought about and left alone. This is a writeup of both.
What the build does
The build is a Go program. It scans the content directory, parses every markdown file, renders it to HTML, and writes everything to a public directory. It also generates a search index, sitemap, Atom feed, and a few other things. No incremental state: every build starts from scratch and rebuilds everything.
At twenty posts it runs in well under a second. That is fine. The question is what happens at two hundred.
Parallel markdown rendering
Every post gets parsed and syntax-highlighted on each build. What I had not really noticed before is that this was happening serially, one post at a time. There is no reason for that. Nothing about rendering one post depends on the result of any other. It just happened to be written that way.
Making it parallel in Go is straightforward, but there was a question about the markdown parser. It was a single global instance, and I was not sure whether sharing it across goroutines was safe. Goldmark probably handles this fine internally, but I did not want to rely on behaviour that is not documented and might change. Three options: put a mutex around every call (safe, but serialises everything again and defeats the point), create a new instance per goroutine (safe, but wasteful), or use sync.Pool. The pool felt right. Goroutines check out an instance, use it, and return it. New ones get created as needed.
The rendering itself uses a bounded worker pool rather than spawning one goroutine per page. With a large number of posts, unlimited goroutines just adds scheduling overhead. The pool caps at the number of CPU cores.
Reference lookups
Some posts link to other posts via a references field in their frontmatter, which the build resolves into titles and URLs for the “see also” section. The way this was working: for every reference on every page, scan the entire page list to find a match. O(n) per lookup, which compounds across a large number of pages with references.
In hindsight it is obvious that this should be a map. Build it once after the content scan, then every lookup is a direct key access. I am not sure why it was written as a linear scan in the first place, probably just because it was easy and at the time there were not enough pages for it to matter.
What I looked at and left alone
Incremental builds
The most obvious thing for long-term build performance is not rebuilding pages that have not changed. Hash the content, cache the output, skip anything with a matching hash on the next run.
The problem is that a page’s output is not just a function of its markdown. It also depends on the templates, the site config, the titles of any pages it references. A template change should invalidate everything. Getting the cache key right means accounting for all of that, which is a meaningful amount of complexity for a tool that currently has none.
There is also the CI side of it. Each pipeline starts clean, so any cache would need to be stored as an artifact and retrieved at the start of the next run. That is extra configuration and an extra thing to break.
Honestly it just felt premature. At a few hundred posts the parallelism should be more than enough. If build time ever becomes a genuine problem I will add incremental logic then, when the need is clear rather than hypothetical.
Pre-building and committing generated files
Another thought was committing some generated outputs to the repository so the build does not have to produce them. syntax.css was the obvious candidate since it barely ever changes. Committing it and removing the generation step would be a small simplification.
The thing that put me off is that it creates a manual step: remember to regenerate and commit when the library version bumps. That is easy to forget, and the result is a file in the repo that no longer matches what the current library would produce. For most other generated outputs like the search index and sitemap the case is even weaker: they change on every post addition, so every new post would come with a noisy diff of regenerated files.
Left it as is.
Was it worth it
At twenty posts, not measurably. The map lookup is just the right way to do it so that was worth fixing regardless of scale. The parallelism is harder to justify right now but I think it is the kind of thing that is easier to do when the code is simple and nothing else depends on the order things run. Build time should now scale with the slowest individual post rather than the total count, which feels like a reasonable property to have baked in early.