feedparser icon indicating copy to clipboard operation
feedparser copied to clipboard

Debian's UDD feeds freak out feedparser

Open anarcat opened this issue 6 years ago • 8 comments

My personal UDD todo list breaks feedparser. If you add the tests to the "illformed" directory, tox says:

GLOB sdist-make: /home/anarcat/dist/feedparser/setup.py
py27 create: /home/anarcat/dist/feedparser/.tox/py27
py27 inst: /home/anarcat/dist/feedparser/.tox/dist/feedparser-5.2.1.zip
py27 installed: feedparser==5.2.1,pkg-resources==0.0.0
py27 runtests: PYTHONHASHSEED='1353716627'
py27 runtests: commands[0] | /home/anarcat/dist/feedparser/.tox/py27/bin/python tests/runtests.py
Traceback (most recent call last):
  File "tests/runtests.py", line 835, in <module>
    runtests()
  File "tests/runtests.py", line 789, in runtests
    description, evalString, skipUnless = getDescription(xmlfile, data)
  File "tests/runtests.py", line 740, in getDescription
    raise RuntimeError("can't parse %s" % xmlfile)
RuntimeError: can't parse ./tests/illformed/udd.xml
ERROR: InvocationError: '/home/anarcat/dist/feedparser/.tox/py27/bin/python tests/runtests.py'
py35 create: /home/anarcat/dist/feedparser/.tox/py35
py35 inst: /home/anarcat/dist/feedparser/.tox/dist/feedparser-5.2.1.zip
py35 installed: feedparser==5.2.1,pkg-resources==0.0.0,sgmllib3k==1.0.0
py35 runtests: PYTHONHASHSEED='1353716627'
py35 runtests: commands[0] | /home/anarcat/dist/feedparser/.tox/py35/bin/python tests/runtests.py
Traceback (most recent call last):
  File "tests/runtests.py", line 835, in <module>
    runtests()
  File "tests/runtests.py", line 789, in runtests
    description, evalString, skipUnless = getDescription(xmlfile, data)
  File "tests/runtests.py", line 740, in getDescription
    raise RuntimeError("can't parse %s" % xmlfile)
RuntimeError: can't parse ./tests/illformed/udd.xml
ERROR: InvocationError: '/home/anarcat/dist/feedparser/.tox/py35/bin/python tests/runtests.py'
_______________________________________________________________________________ summary ________________________________________________________________________________
ERROR:   py27: commands failed
ERROR:   py35: commands failed

the problem seems to be there is no guid field and an empty link field on some entries, which breaks (reasonable) expectations from feedparser...

anarcat avatar Sep 06 '17 14:09 anarcat

What behavior do you expect from feedparser in this case? Should the invalid entries be silently ignored? Should feedparser produce entries without a link?

Maybe UDD should be fixed? That feed is not valid.

twm avatar Jan 13 '18 23:01 twm

it should:

  1. not crash

  2. make an educated guess at a UID

I do this in feed2exec:

        if not item.get('id'):
            item['id'] = item.get('title')

it's just a dumb heuristic, but it works better than crashing on an arbitrary feed.

at the very least, i would want feedparser to be robust (ie. not crash) on bad content. delivering a non-empty feed is extra...

anarcat avatar Jan 15 '18 02:01 anarcat

Hmm, that heuristic would work in this particular case but in the wild repeated entry titles are pretty common (e.g., http://www.pusheen.com/rss) so I wouldn't want it built into feedparser except on an opt-in basis. As a feedparser user I'd rather have no ID than a heuristic that I can't fix.

My first inclination for a heuristic would have been to use the item date as a final fall-back, but that doesn't work for this feed either. :-/ So maybe skipping 'id' or making it the empty string is best in this case. Then you can add heuristics on top (e.g., a more robust one would be to hash all the item fields in cases like this).

twm avatar Jan 19 '18 06:01 twm

yep, i don't mind rolling my own heuristics here... i guess what i need here is for feedparser to ... er... not crash. :)

anarcat avatar Jan 23 '18 19:01 anarcat

@anarcat, are you still seeing this behavior? If so, I'll jump in on this and work to get feedparser to quit crashing.

Re: GUID heuristics, feedparser won't be updated to inject GUID's but you're right, feedparser shouldn't be crashing!! =)

kurtmckee avatar Apr 27 '18 20:04 kurtmckee

i still get the same error than originally reported. should i send a PR to get the failing unit test in place?

to reproduce, you simply need to do this:

wget -O tests/illformed/udd.xml 'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'

and run the test suite.

anarcat avatar May 07 '18 13:05 anarcat

Perfect, I'll try to get this fixed.

On May 7, 2018 1:36:10 PM UTC, anarcat [email protected] wrote:

i still get the same error than originally reported. should i send a PR to get the failing unit test in place?

to reproduce, you simply need to do this:

wget -O tests/illformed/udd.xml
'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'

and run the test suite.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/kurtmckee/feedparser/issues/112#issuecomment-387066806

kurtmckee avatar May 07 '18 17:05 kurtmckee

FYI: There is also another problem with debian related feeds. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926074

Please open a bug report on for Debian against the tracker.debian.org package and post the link here. Thanks.

buhtz avatar Jul 14 '19 20:07 buhtz