autocards
autocards copied to clipboard
Support client-side rendered content
Many sites aren't rendered server-side and so are unusable with consume_web
, for example all the articles on KhanAcademy https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam
Integration with Selenium, splash, etc would be one way to fix this
Hi! Thanks for your interest in autocards.
I've contributed quite a lot to PRs of autocards (see for ex the pending PR) but sadly I'm terrible at webdesign so will very probably not do this myself.
If you provide a clean way to simply get text data from a URL I can manage integrating it to the codebase very quickly though if you want.
Have a nice day!
I looked into it and it seems that basically every solution either requires 1) integration with a web browser or 2) using a paid service (which probably uses 1 under the hood).
Here's one working example
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
# need Firefox installed, and the corresponding Firefox driver
# see https://selenium-python.readthedocs.io/installation.html#drivers
opts = FirefoxOptions()
# I'm using WSL, so I need this option
opts.add_argument("--headless")
url = "https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam"
driver = webdriver.Firefox(options=opts)
driver.get(url)
soup = BeautifulSoup(driver.page_source)
# close(), or quit()
driver.quit()
Unfortunately it requires having Firefox installed and installing the corresponding web driver into your PATH. There is also requests-html which is supposed to be a drop-in replacement for requests
. It supports 'rendering' the JS in the page, but it also seems to work by just downloading a Chromium instance the first time you call it. And, I'm getting an error with it anyway (maybe WSL related)
This is to say that all of these methods are brittle and trying to support it in the library itself would be a pain. But, including instructions on how to do it somewhere might be useful.
Yes that's my conclusion as well. I think dynamic website can be exported to PDF or just copied and pasted to autocards so that's "fine" :/
Thanks for looking into this!