memory-reduced planet?

Mon Jul 31 21:59:27 EST 2006

Richard Dawe wrote:
> 
> I've spent a bit of time trying to work out where Planet was spending
> its time. I got fed up with it chewing up 100% of my CPU and
> interrupting my mp3 playback. ;)
> 
>>From some profiling ("python -m profile ...") I noticed that it was
> spending about 30% of its time in the getters/setters for the cache.
> These getters/setters go through a couple of layers of subroutine calls
> before returning data. Subroutine calls are apparently slow in Python.
> (I'm new-ish to Python.)

I'd like to propose a radical rewrite.  One with an entirely different 
cache implementation.

To illustrate, it might be helpful to refer to the Atom feed produced by 
a planet, for example: http://planet.intertwingly.net/atom.xml

Note that each entry has an id and an updated date (even if Planet has 
to generate one or both of these), and is self contained (the source 
information contains information about the original feed), and author, 
language and xml:base information is preserved.

And that all this information is in a format that we already have a 
parser for.

At the moment, planet stores all of the information about a single feed 
in a single file, using dbhash.  The name of the file is based on the 
URI used to fetch the feed.

The proposal is to instead have one file per entry, named based on the 
id.  The file update date in the file system would be set to match the 
updated date of the entry.  The format would be an Atom entry.

A normal planet run would consist of two phases, each of which could be 
run separately if desired.  The first phase would simply fetch feeds and 
write entries.  The second phase would get a directory listing, sort by 
date, and read only the "n" most recent entries into memory, and would 
run templates based on this information.

There is a small amount of per-feed information (like permanent 
redirects) that won't fit into such a structure.  This can be handled by 
treating the config.ini as read/write.  For those that desire such a 
thing, a web based front end could be defined for managing this, 
implemented as a python server.

But in any case, at this point, all information would be in text 
formats, and Planet would have no required external dependencies.

For those that wonder about such a thing, my weblog is based on 
Blosxom's file format - every entry and every comment is a separate 
file.  I currently have over 18,000 files.  The planet cache need not 
ever grow that big.  We could have it automatically limit the number of 
cache files to some small multiple of the number of entries on the first 
page.

- Sam Ruby