infoqscraper icon indicating copy to clipboard operation
infoqscraper copied to clipboard

Scraping fails due to metadata changes

Open andreweacott opened this issue 5 years ago • 3 comments

Found in version 0.1.5

As of March 2019, scraping presentations no longer works due to format changes in the presentation HTML page.

Traceback (most recent call last):
  File "/usr/local/bin/infoqscraper", line 33, in <module>
    sys.exit(main.main())
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 374, in main
    return module.main(infoq_client, args.module_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 194, in main
    return command.main(infoq_client, args.command_args)
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/main.py", line 314, in main
    builder.create_presentation()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 82, in create_presentation
    video = self.download_video()
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/convert.py", line 103, in download_video
    rvideo_path = self.presentation.metadata['video_path']
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 171, in metadata
    'title': get_title(pres_div),
  File "/usr/local/lib/python2.7/dist-packages/infoqscraper/scrap.py", line 91, in get_title
    return pres_div.find('h1', class_="general").div.get_text().strip()
AttributeError: 'NoneType' object has no attribute 'find'

In fact, the fields that scrap.py is looking for are metadata and are not used by the main application. Removing them allows presentation to be grabbed correctly.

andreweacott avatar Jul 09 '19 20:07 andreweacott

Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.

cykl avatar Jul 09 '19 21:07 cykl

Thanks for the report (I haven't used infoqscraper for a while). I'm kind of busy these days, but I will fix that.

No problem, I have a 'fix' (just removed the unused metadata fields) on my fork (https://github.com/andreweacott/infoqscraper/tree/bugfix/resolve_scraper_failure) but I've not been able to get the tests to complete so didn't want to raise a PR. The fixed app works for me though.

andreweacott avatar Jul 09 '19 21:07 andreweacott

Even with the fixed fork by @andreweacott I keep getting the following error:

> ~/.local/bin/infoqscraper presentation download soa-without-esb
Failed to create presentation soa-without-esb.avi: Failed to download video at rtmpe://video.infoq.com/cfx/st/: rtmpdump exited with -11.
	Output:
b''

This happens with both older videos (like the one above) and new ones (e.g. work-purpose). Using Gentoo Linux with RTMPDump 2.4 (version dated 2016/12/10) and Python 2.7.15 / 3.6.5 (not sure which this program runs on). The outcome seems a little weird as rtmpdump's source only ever seems to exit with 0, 1, 2 or 3 (i.e. one of the RD_* constants), and infoqscraper's subprocess.check_output call should pass the child's exit code as-is. I'm not a Python person, but it seems the forked infoqscraper invokes rtmpdump here with the equivalent of

rtmpdump -q -e -r rtmpe://video.infoq.com/cfx/st/ -y mp4:presentations/qcon08-howbigismybus.mp4 -o temp_video.avi

which I can only get to return 1 – it fails to get the last keyframe and closes the connection. If I omit the -e flag, it connects and handshakes but then invariably segfaults with a resulting exit code of 139.

Sorry for just dumping this here, but there's no issue tracker for the fork and this is my first time using rtmpdump directly. Do you have any idea if my problem is in infoqscraper, in my version of rtmpdump itself or perhaps some misunderstanding?

skrinakron avatar Oct 26 '19 11:10 skrinakron