infoqscraper
infoqscraper copied to clipboard
Scraping fails due to metadata changes
Found in version 0.1.5
As of March 2019, scraping presentations no longer works due to format changes in the presentation HTML page.
Traceback (most recent call last):
File "/usr/local/bin/infoqscraper", line 33, in <module>
sys.exit(main.main())
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 374, in main
return module.main(infoq_client, args.module_args)
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 194, in main
return command.main(infoq_client, args.command_args)
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 314, in main
builder.create_presentation()
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 82, in create_presentation
video = self.download_video()
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 103, in download_video
rvideo_path = self.presentation.metadata['video_path']
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 171, in metadata
'title': get_title(pres_div),
File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 91, in get_title
return pres_div.find('h1', class_="general").div.get_text().strip()
AttributeError: 'NoneType' object has no attribute 'find'
In fact, the fields that scrap.py
is looking for are metadata and are not used by the main application. Removing them allows presentation to be grabbed correctly.
Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.
Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.
No problem, I have a 'fix' (just removed the unused metadata fields) on my fork (https://github.com/andreweacott/infoqscraper/tree/bugfix/resolve_scraper_failure) but I've not been able to get the tests to complete so didn't want to raise a PR. The fixed app works for me though.
Even with the fixed fork by @andreweacott I keep getting the following error:
> ~/.local/bin/infoqscraper presentation download soa-without-esb
Failed to create presentation soa-without-esb.avi: Failed to download video at rtmpe://video.infoq.com/cfx/st/: rtmpdump exited with -11.
Output:
b''
This happens with both older videos (like the one above) and new ones (e.g. work-purpose). Using Gentoo Linux with RTMPDump 2.4 (version dated 2016/12/10) and Python 2.7.15 / 3.6.5 (not sure which this program runs on). The outcome seems a little weird as rtmpdump's source only ever seems to exit with 0, 1, 2 or 3 (i.e. one of the RD_* constants), and infoqscraper's subprocess.check_output
call should pass the child's exit code as-is. I'm not a Python person, but it seems the forked infoqscraper invokes rtmpdump here with the equivalent of
rtmpdump -q -e -r rtmpe://video.infoq.com/cfx/st/ -y mp4:presentations/qcon08-howbigismybus.mp4 -o temp_video.avi
which I can only get to return 1 – it fails to get the last keyframe and closes the connection. If I omit the -e flag, it connects and handshakes but then invariably segfaults with a resulting exit code of 139.
Sorry for just dumping this here, but there's no issue tracker for the fork and this is my first time using rtmpdump directly. Do you have any idea if my problem is in infoqscraper, in my version of rtmpdump itself or perhaps some misunderstanding?