Too long a filename...

Thu Oct 5 19:33:38 EST 2006

Amit Chakradeo (अमित चक्रदेव) wrote:
> Hi,
> 
>    I am seeing some errors about filename too long. How do I deal with 
> this ?
> 
> Example feed: http://www.expressindia.com/syndications/ei.xml
> 
> (Those are only advertisements anyway, so I can just skip them, but can 
> I just change the spider.filename function to truncate the name if it is 
> too large ?)

I'm concerned that truncation would cause duplicates, and would prefer 
something like the following:

   === modified file 'planet/spider.py'
   --- planet/spider.py
   +++ planet/spider.py
   @@ -34,6 +34,16 @@
        filename = re_initial_cruft.sub("", filename)
        filename = re_final_cruft.sub("", filename)

   +    # limit length of filename
   +    if len(filename)>250:
   +        parts=filename.split(',')
   +        for i in range(len(parts),0,-1):
   +            if len(','.join(parts[:i])) < 220:
   +                import md5
   +                filename = ','.join(parts[:i]) + ',' + \
   +                    md5.new(','.join(parts[i:])).hexdigest()
   +                break
   +
        return os.path.join(directory, filename)

    def write(xdoc, out):

> Is there a general way to skip something in a feed ? I looked at scrub 
> function which looks at ignore_in_feed config options, but that probably 
> is for skipping some field within a post ??? Maybe we can just add a 
> regexp for a particular feed in the config which when matched to an 
> entry will make the spider to skip that entry ?

That's exactly what filters are designed for.  Filters are arbitrary 
programs written in the programming language of your choice.  Each time 
they are invoked, they are passed a single entry which has been 
sanitized and normalized to UTF-8, XHTML, and Atom 1.0.

Normally, filters copy stdin to stdout.  Clearly they can modify the 
data in transit.  More importantly to you, if zero bytes are output, the 
entry is ignored.

Filters can be defined at the [planet] level, or at the individual 
[feed] level in the configuration file.

> Thanks!
> Amit
> 
> 
> Errors:
> ERROR:planet.runner:IOError: [Errno 36] File name too long:
> +u'examples/cache/banners
> .expressindia.com,adsnew,adclick.php,bannerid=2161&zone
> +id=&source=&dest=https%3A%2F%2Fwww.online.citibank.co.in%2Fportal%2Fcitiinforms 
> 
> +.jsp%3Fform_id%3DfrmRcaEnglish%26Site%3DExpressindia%26Creative%3DTextlink%26Se 
> 
> +ction%3DROS%26Agency_Code%3DDBS%26Campaign_Code%3DRCAO%26Product_Code%3DRCA%26e
> +OfferCode%3DIEX03TXT'
> ERROR:planet.runner:  File
> +"/home/.caterina/amitc/planet/intertwingly/venus/planet/spider.py", 
> line 275,
> +in spiderPlanet
>     spiderFeed(feed)
> ERROR:planet.runner:  File
> +"/home/.caterina/amitc/planet/intertwingly/venus/planet/spider.py", 
> line 224,
> +in spiderFeed
>     write(output, cache_file)
> ERROR:planet.runner:  File
> +"/home/.caterina/amitc/planet/intertwingly/venus/planet/spider.py", 
> line 41, in
> +write
>     file = open(out,'w')
> 

- Sam Ruby