memory-reduced planet?
Sam Ruby
rubys at intertwingly.net
Mon Jul 31 21:59:27 EST 2006
Richard Dawe wrote:
>
> I've spent a bit of time trying to work out where Planet was spending
> its time. I got fed up with it chewing up 100% of my CPU and
> interrupting my mp3 playback. ;)
>
>>From some profiling ("python -m profile ...") I noticed that it was
> spending about 30% of its time in the getters/setters for the cache.
> These getters/setters go through a couple of layers of subroutine calls
> before returning data. Subroutine calls are apparently slow in Python.
> (I'm new-ish to Python.)
I'd like to propose a radical rewrite. One with an entirely different
cache implementation.
To illustrate, it might be helpful to refer to the Atom feed produced by
a planet, for example: http://planet.intertwingly.net/atom.xml
Note that each entry has an id and an updated date (even if Planet has
to generate one or both of these), and is self contained (the source
information contains information about the original feed), and author,
language and xml:base information is preserved.
And that all this information is in a format that we already have a
parser for.
At the moment, planet stores all of the information about a single feed
in a single file, using dbhash. The name of the file is based on the
URI used to fetch the feed.
The proposal is to instead have one file per entry, named based on the
id. The file update date in the file system would be set to match the
updated date of the entry. The format would be an Atom entry.
A normal planet run would consist of two phases, each of which could be
run separately if desired. The first phase would simply fetch feeds and
write entries. The second phase would get a directory listing, sort by
date, and read only the "n" most recent entries into memory, and would
run templates based on this information.
There is a small amount of per-feed information (like permanent
redirects) that won't fit into such a structure. This can be handled by
treating the config.ini as read/write. For those that desire such a
thing, a web based front end could be defined for managing this,
implemented as a python server.
But in any case, at this point, all information would be in text
formats, and Planet would have no required external dependencies.
For those that wonder about such a thing, my weblog is based on
Blosxom's file format - every entry and every comment is a separate
file. I currently have over 18,000 files. The planet cache need not
ever grow that big. We could have it automatically limit the number of
cache files to some small multiple of the number of entries on the first
page.
- Sam Ruby
More information about the devel
mailing list