http://planete.websemantique.org/ and user defined content filtering

Thu Oct 12 21:27:29 EST 2006

Eric van der Vlist wrote:
> Hi,
> 
> We have created a new planet for semantic web oriented blogs in French:
> http://planete.websemantique.org/
> 
> Right now, this is just a plain vanilla installation of the venus
> flavor. The install has been really trouble free and I'd like to thank
> you for the quality of this software.
> 
> Our users have two requests that involve filtering the blog entries that
> appear on the planet.
> 
> The first one is generic to most of the planet sites: most of the blogs
> are not focused on a single topic and planets have a lot of entries that
> are irrelevant to the planet main topic.
> 
> I personally find that this is a feature more than a bug since it's nice
> to have a broad vision of what blogers write outside the scope of the
> planet topic but I can also understand that it would be useful to give
> users the ability to get a filtered view of the planet (I am not so
> keen ).
> 
> The other one is more specific to multilingual planets. Although the
> planet is primarily targeted to French speaking visitors, most of the
> blogs that we federate contain both French and English posts. Again, I
> personally prefer to see entries in both languages but can also
> understand that some users might prefer to see only posts in a single
> language.
> 
> The main issue is that we'd like to add these features on existing
> feeds. These feeds do not always include subject or categories that can
> be used for topic filtering and none of them include any specific
> information about the language.
> 
> My idea would be to detect that an item is relevant to the planet main
> topic by checking a number of keywords (for the semantic web, this seems
> quite doable) but other algorithm could be used including Bayesian
> filters like those used by anti spam systems but this would require a
> phase of training.
> 
> For the language detection, I would try to find an open source system to
> do that. Another option would be to check spell against different
> languages (in our case there are only two) and take the language for
> which the fewer errors have been detected.
> 
> I have taken a look at the filter mechanism and all that seems to be
> pretty easy to implement (BTW, I am wondering if there is already a
> generic XSLT filter). The only things which is still somewhat mysterious
> to me is the options mechanism but I can probably find out by myself how
> it works...
> 
> The difference between these filters and the other similar content
> filters that I have found in the list archives would be that these one
> would not remove entries but add new metadata based on their findings.
> These metadata could then be copied into the XHTML pages and a piece of
> JavaScript would hide entries based on user preferences. 
> 
> This seems quite simple and obvious to implement but I am wondering if
> this has already been done and, if not, if you have any advise for me. 

My guess is that you have figured out pretty much everything you need to 
know already, but here's an overview:

Filters are simple Unix pipes.  Input comes in stdin, parameters come 
from the config file, and output goes to stdout.  Anything written to 
stderr is logged as an ERROR message.  If no stdout is produced, the 
entry is not written to the cache or processed further.

Input to filters is a aggressively normalized entry.  Everything is 
converted to Atom 1.0, XHTML, and utf-8, meaning that you don't have to 
worry about funky feeds, tag soup, or encoding.  If a feed is RSS 1.0 
with 10 items, your filter will be called ten times, each with a single 
Atom 1.0 entry.

There is a small set of example filters in the 'filters' directory. 
coral_cdn_filter will change links to images in the entry itself.  The 
filters in the stripAd directory are conceptually similar.

excerpt is closest to what you describe.  It adds metadata (in the form 
of a planet:excerpt element) to the feed itself.  You can see examples 
of how parameters are passed to this program in either 
tests/data/filter/excerpt-images.ini or examples/opml-top100.ini.  Note: 
templates written using htmltmpl currently only have access to a fixed 
set of fields, whereas xslt templates have access to the everything.

xpath_sifter is a variation of the above, including or excluding feeds 
based on the presence (or absence) of data specified by xpath expressions.

Final notes:

  * the file extension of the filter is significant.  .py invokes python.
    .xslt involkes xslt.  sed and tmpl (a.k.a. htmltmp) are also options.
    If you wanted, say perl or ruby or class/jar (java), these would be
    easy to add.

  * at the moment, xslt based filters don't have access to parameters.
    This is definitely doable, just not yet implemented.  The change
    would be to planet/shells/xslt.py.

  * Any filters listed in the [planet] section of your config.ini will
    have be invoked on all feeds.  Filters listed in individual [feed]
    sections will only be invoked on those feeds.

  * If you list multiple filters, they are simply invoked in the order
    you list them (think unix pipes).

  * You mention javascript, and that's definitely doable.  Another
    more low tech but effective approach is to use multiple output
    templates to produce varying results.  If you look at
    themes/mobile/config.ini, you will see that both
    index.html.xslt and mobile.html.xslt listed.

Hopefully, this will be enough to get you started.  But mostly, have 
fun!, and if you have any questions, ask them here.

- Sam Ruby