Limit Posting Length

Baz brian.ewins at gmail.com
Fri Mar 3 04:24:19 EST 2006


(Sorry Minh, meant this for the list)

An attempt at doing this. This is the first time I've written any
python to speak of, I just read most of the bits in the manual or by
looking at feedparser code, so feel free to barf. I stuck this in
'truncate.py' and tested it like so:

>>> import feedparser
>>> d = feedparser.parse("http://www.rumble.net/blog/?flav=rss")
>>> import truncate
>>> t = truncate._TruncateHTMLProcessor(500)
>>> t.feed(d.entries[1].summary)
>>> trunc = t.output();
>>> len(trunc);
>>> trunc

... but it really belongs in the planet code (and consider it licensed
as per planet); it needs reviewed though before bundling into a patch.

I used the BaseHtmlProcessor in feedparser as the basis for truncating
the html, since it knows which tags need to be closed, or not. I
didn't look at this too hard, there are other parsers in feedparser
which may be a better superclass. It seemed like the obvious thing to
do though, feedparser produces output we're happy with right now,
incorporating something else might give unexpected results.

I'm using textwrap to break lines at an appropriate point to truncate.
I only count 'plain' text towards the stuff to truncate - I imagine
textwrap might wreck entities by considering them to be punctuation.
Textwrap also condenses whitespace, so the truncated html is usually
significantly longer than the text length you've asked for. I didn't handle
truncation for non-western languages (even assuming textwrap is ok
for most of the western ones), but I noticed there's some code to
handle chinese on ActiveState:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/358117


import feedparser;
import textwrap;
# truncate a chunk of html. based on feedparser so I inherit
# its understanding of html, and textwrap for its understanding
# of word breaks. I override 'handle_data' to figure out at
# what point I should stop emitting text, and override unknown*tag
# to make sure tags get closed. only lightly tested.
class _TruncateHTMLProcessor(feedparser._BaseHTMLProcessor):

    def __init__(self, length, encoding='utf-8'):
        self.encoding = encoding
        self.tags = []
        self.recording = True;
        self.depth = 0;
        self.length = length;
        self.wrapper = textwrap.TextWrapper(width=length);
        self.text_so_far = '';
        feedparser._BaseHTMLProcessor.__init__(self, encoding)

    def reset(self):
        self.pieces = []
        feedparser._BaseHTMLProcessor.reset(self)
        self.tags = []
        self.recording = True;
        self.depth = 0;

    def unknown_starttag(self, tag, attrs):
        if self.recording:
                self.depth = self.depth + 1;
                feedparser._BaseHTMLProcessor.unknown_starttag(self,
tag, attrs);

    def unknown_endtag(self, tag):
        # ignore omitted end tags
        while len(self.tags) > 0 and self.tags[len(self.tags) - 1] != tag:
                self.tags.pop();

        if self.recording or len(self.tags) < self.depth:
                self.depth = len(self.tags);
                feedparser._BaseHTMLProcessor.unknown_endtag(self, tag);

    def handle_charref(self, ref):
        # textwrapper might not handle entities properly.
        if self.recording:
                feedparser._BaseHTMLProcessor.handle_charref(self, ref);

    def handle_entityref(self, ref):
        # textwrapper might not handle entities properly.
        if self.recording:
                feedparser._BaseHTMLProcessor.handle_entityref(self, ref);

    def handle_data(self, text):
        if self.recording:
                current = self.text_so_far + text;
                lines = self.wrapper.wrap(self.text_so_far);
                if len(lines) > 1:
                        self.recording = False;
                        # truncated text...
                        l1 = len(lines[0]);
                        l2 = len(self.text_so_far);
                        if l1 > l2:
                                text = text[0:(l1 - l2)];
                        else:
                                text = '';
                        text = text + '&hellip;';
                else:
                        self.text_so_far = current + ' ';
                feedparser._BaseHTMLProcessor.handle_data(self, text);

    def handle_comment(self, text):
        if self.recording:
                feedparser._BaseHTMLProcessor.handle_comment(self, text);

    def handle_pi(self, text):
        if self.recording:
                feedparser._BaseHTMLProcessor.handle_pi(self, text);

    def handle_decl(self, text):
        if self.recording:
                feedparser._BaseHTMLProcessor.handle_decl(self, text);

On 3/2/06, Minh Nguyen <mxn at mxn.f2o.org> wrote:
> For me, the issue isn't only that the page gets too long, but also that
> the page gets too large and takes far too long to load. Hiding the text
> via CSS doesn't solve the latter issue; it has to be done on the
> server-side. I read somewhere that a major problem with clipping entries
> is that you might cut off the entry mid-character (especially a problem
> with scripts like Chinese) and thus mangle the rest of the page.
>
> Is there perhaps an installable Python package that can handle stuff
> like this for us?


More information about the devel mailing list