Implementing a fast feed aggregator

I’d like to read and write more this year. So I’m subscribing to some feeds.

Feeds are great! The reader has control over their subscriptions, and how they read the content.

Rather than use services like Feedly or IFTTT, I wanted a simple and fast reader.

There are some cool self-hosted options like Planet, but I decided to implement my own.

Rendering feeds quickly

One benefit of using Feedly is that your feed loads pretty quickly, even if you’ve subscribed to loads of blogs.

That can be tricky to achieve if you implement a simple reader yourself.

Unlike GOV.UK, lots of blogs don’t limit the content returned in feeds (e.g. by using pagination).

If the author has written a lot over the years, that can make for a very heavy piece of XML one needs to request and parse.

I’ve added some optimisations to my feed reader so it doesn’t slow down as I subscribe to lots of blogs.

Caching

One obvious optimisation is caching. The page is cached by both my site using the rack-cache gem and by CloudFlare for 60 seconds by providing the following header:

Cache-Control: public, must-revalidate, max-age=60

The Cache-Control response header is interpreted by clients and proxies (and can be ignored!):

public indicates any cache between my server and the user can cache the page.
max-age tells CloudFlare to cache the page for a maximum of 60 seconds.
must-revalidate tells proxies that after max-age seconds, they shouldn’t return a stale page and should behave in a read-through manner.

I’m not expecting the feed to update more frequently than once a minute, so this works:

use Rack::Cache

before do
  expires 60, :public, :must_revalidate
end

get('/following') {
  erb :index, layout: :blog_layout, locals: { feed: { entries: Feed.me } }
}

I had a read of how the Rack::Cache gem handles caching. As a default it has an in-memory heap store that uses a hash as a simple cache.

Further, the heap storage provides no mechanism for purging unused entries so memory use is guaranteed to exceed that available, given enough time and utilization.

Using an in-memory cache has some downsides. For one thing, every time I deploy, it would purge the cache!

Rather than caching in application memory, I use memcached. I spun up a dockerised memcached instance on the same machine as the feed reader, and amended the Rack::Cache code as follows:

use Rack::Cache,
  metastore:    "memcached://localhost:11211/meta",
  entitystore:  "memcached://localhost:11211/body"

Now only one request a minute will load slowly.

Unfortunately, my usage pattern is going to be to read the posts every once in a while, not every minute. This doesn’t help much if I read posts every 30 minutes.

Concurrent requests

The main bottleneck is the time taken to fetch all of the feeds. Some feeds are pretty slow to load!

For example, some people I follow return all of their content in their RSS feeds. That can add several seconds to the request time when I’m only looking for their most recent few posts.

The next optimisation was to request each feed in a thread, so we can fetch all of the feeds concurrently. That’s simple to do with the concurrent-ruby gem.

Before (synchronous requests):

FEEDS = %w(
  https://www.bilbof.com/feed.xml
  https://technology.blog.gov.uk/feed/
)

posts = FEEDS.each_with_object([]) do |url, posts|
  xml = HTTParty.get(url).body
  feed = Feedjira.parse(xml)
  posts.concat(feed.entries.first(20).map { |item|
    { title: item.title, link: item.url }
  })
end

After (using promises to make requests asynchronously):

def fetch_posts(feed_url)
  Concurrent::Promises.future(feed_url.dup) do |url|
    xml = HTTParty.get(url).body
    feed = Feedjira.parse(xml)
    feed.entries.first(20).map { |item|
      { title: item.title, link: item.url }
    }
  end
end

promised_posts = FEEDS.each_with_object([]) do |url, posts|
   posts << fetch_posts(url)
end
entries = Concurrent::Promises.zip(*promised_posts).value!

This is substantially faster, it makes the reader pretty much as slow as the slowest feed you subscribe to.

Making it faster (smarter caching)

Now we’re only as slow as the slowest feed request. But if one feed takes several seconds to load, that’s still going to have a big impact on how quickly I can load my feed.

There are other web acceleration techniques we can employ to speed this thing up, such as:

content validation with ETags
implement a read-ahead cache, so that the latest content is always cached
cache some feeds for longer than others

I ended up implementing a read-ahead cache, and caching some feeds for longer than others.

In my mind a read-ahead caching approach is to request and cache data before a user requests it, and periodically refresh that cache (hence, reading ahead).

This is in contrast to a read-through cache, where you request and cache the data only when the user asks for it, which has the downside of making the user wait if you don’t already have the content they want. But correct me if I’m wrong here!

The first thing to do when implementing read-ahead mode is a cron job to periodically request results. That’s easy to do with the whenever gem.

One option for caching responses is to make my outgoing feed requests through something like Varnish and have that cache everything. But there’s not a nice way to tune Varnish to do that (I’m sure it’s possible!).

Instead I re-used the memcached instance I’m already running to cache posts from each site. This allows me to cache some blogs for variable lengths of time, depending on how often they blog.

Elasticsearch-backed feeds

This isn’t the first RSS/Atom/feed-y thing I’ve implemented this year.

In my work at GDS I helped add some Atom feeds powered by our Search API, which uses Elasticsearch.

Using Elasticsearch as a backing for Atom feeds is nice, since you can allow users to do lots of search-y things with feeds, like filtering and paginating.

For example, you can get a feed for News about Brexit from the UK Prime Minister.

You can even build your own feed on top of the Search API, such as an Atom feed for a query about micropigs. Though that’s not necessary, because there’s already a micropig feed.

Essentially it’s just presenting search results in Atom format (they’re also available in JSON), but the use of open standards gives a lot opportunity to others. Want to be first to hear about an official update? Use the Atom feed - you’ll get updates quicker than you would by subscribing via email.

If you’d like to read more about micropigs and search at GOV.UK, my colleague Michael has written a post on GOV.UK search. We’re planning to write some more blog posts about search soon so stay tuned.

Conclusion

The reader is now running at /following. It’s a separate Sinatra application, and requests to /following are proxied to the application using Nginx.

The rest of this site is generated by Jekyll and the static files are served by Nginx.

I had initially thought about running the Sinatra app only as an API which would be called with Ajax from a Jekyll static page, but I thought it would be nicer to keep this blog fast and JS-free by rendering the feed server-side.

The code for the reader is available on Github at bilbof/feed.