This blog has an RSS feed

2020-06-27 (last modified 2020-06-28)

I recently added an RSS feed to my blog, prompted by a friend asking if I had one. I write these posts in HTML. I don’t use a static site generator. So I assumed I wouldn’t be able to generate a feed using existing software. (Later, I realized there exist tools that build an RSS feed from a given webpage. I looked at one and saw the feed it built didn’t have anywhere near the format I wanted.)

I wrote a small TypeScript program to generate the RSS feed. It uses htmlparser2 to parse the posts and rss to generate the RSS feed from the contents.

The right tool for the job

I soon realized htmlparser2 wasn’t the easiest tool to use. The library uses a callback interface: It lets you provide functions that it then calls when it encounters opening and closing tags, text nodes, and other document features. It’s up to the code using the library to track its location in the parse tree.

Instead, I could have used a library like parse5, which parses the HTML into a JavaScript object and makes it easy to access elements with known locations in the parse tree. It’s a tradeoff, though: If I’d written the script using that approach, small changes in the layout of the HTML would have broken it. In fact, I think it would have been easiest to write this script using browser APIs like Document#querySelector. For that reason, maybe I should have used a DOM implementation like jsdom.

Parsing the posts

In any case, solving this problem with htmlparser2 was a fun challenge. I wanted to extract three pieces of information from each post: its title and creation date and the post body. First, I worked on extracting the title.

I added callbacks for opening tags, closing tags, and text nodes. htmlparser2 does a pre-order traversal of the parse tree: If you identify the calls to the opening and closing tag callbacks for a given element, any calls to the text node callback in between must be for text nodes inside that element. Each post’s title tag contains the title, so all I had to do was extract the text node inside that tag. Since each post only has one title tag, I did this by setting an inTitle variable to true when I saw an opening title tag, storing the contents of the text node seen when inTitle was true in a second variable, and setting inTitle to false on a closing title tag.

I did something similar to extract the creation date, which is contained in a p tag with the class timestamp. The opening tag callback takes a map of HTML attributes as well as a tag name, so it was easy to recognize this particular p tag by its class. The closing tag callback doesn’t provide the tag’s attributes, but I realized I could set inDate to false on any closing p tag, since the paragraph containing the timestamp doesn’t have any paragraphs inside it.

Getting the post contents

Extracting the contents of each post was more difficult. The body of each blog post lives in a section tag. It can contain HTML itself—so far I’ve mainly used a and code tags. However, htmlparser2 doesn’t let you read the contents of a specific HTML element. It just calls your callback functions when it sees opening tags, closing tags, and text nodes. I needed to rebuild the HTML inside the section tag only using the information from these function calls.

To keep track of whether it’s inside the section tag, the script uses an inSection Boolean variable, similar to how it keeps track of whether it’s inside the title tag. It appends any text it sees inside the section tag to a content string variable that starts off empty. When the script sees an opening tag inside the section tag, it uses the tag name and map of attributes to build a string containing an HTML opening tag, then appends it to content. This logic closes self-closing tags but leaves other tags open. On encountering a closing tag inside the section, the script appends a closing tag to content, as long as the tag isn’t self-closing. The pre-order traversal ensures the script appends tags and text to content in the same order it appears in the original post.

Generating the RSS feed

Generating the feed was simpler. Each post lives in its own folder, so the script gets a list of those folders, reads and parses each post, adds an item to the feed for each parsed post, and writes the feed to an XML file. After running the script, I commit the updated file to my website’s Git repository and push to publish.

Why TypeScript?

I wrote this script in TypeScript but didn’t benefit much from doing so compared to writing it in JavaScript. Most of the compile errors I encountered would have been runtime errors in JS, but the cause would have been equally obvious. On the other hand, I didn’t waste much time using TS. It didn’t take long to install ts-node and type definitions for the libraries I was using, and I don’t expect ts-node was much slower than Node for such a small script. Plus, I would have benefited more from TS with a different development environment. I wrote this script using Vim, without even TS syntax highlighting. If I’d used an IDE with type-aware code completion and integrated type-checking, or installed Vim plugins for those features, TS would have been an improvement over JS.

Why RSS?

So why RSS? I’m a big consumer of RSS feeds myself and I know I’m not alone, even if they’re not as popular as they used to be. RSS is a simple way to share new content with people interested in your work. Of course, it’s not the only way. I could share my posts over Twitter or Facebook, but I started using RSS in the first place to get off social media. (A topic for another post.) Substack and Mailchimp make it easy to share content over email, but I like having full control over the distribution of my posts. I’d rather not give that control to a third party, at least for a passion project that I don’t plan to monetize, like this blog. I’d definitely consider using Substack for a paid newsletter to avoid integrating with a payment processor.

P.S.: Why not an SSG or a CMS?

It’s also worth examining why I don’t use a static site generator or a CMS for my blog. I certainly could have built it more quickly using one of those tools. Well, again, I like having full control over my website, both its appearance and its code. A tool like Wordpress doesn’t give you that control without a lot of customization. I also appreciate that my website’s code is simple and human-readable. Finally, I enjoy solving problems and writing software. I don’t see the time spent working on projects like this one as wasted. I even enjoy the process of turning each post into an HTML document. It’s a good Vim exercise and the process somehow makes me feel more proud of my posts.

Edit: The day after publishing this blog post, I decided to move my posts into YAML files and generate both the HTML for them and my blog's RSS feed from those files. Each post shares a lot of code and it's painful to change something in all of them (e.g. adding a link to the header). Now that each post is based on a template HTML file, it's easy to make these kinds of changes. It's also a lot easier to parse YAML than HTML. And I still get to hand-edit each post's HTML! (Although I might look into generating them from Markdown instead.)