autocards icon indicating copy to clipboard operation
autocards copied to clipboard

Support client-side rendered content

Open deklanw opened this issue 3 years ago • 3 comments

Many sites aren't rendered server-side and so are unusable with consume_web, for example all the articles on KhanAcademy https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam

Integration with Selenium, splash, etc would be one way to fix this

deklanw avatar Sep 03 '21 14:09 deklanw

Hi! Thanks for your interest in autocards.

I've contributed quite a lot to PRs of autocards (see for ex the pending PR) but sadly I'm terrible at webdesign so will very probably not do this myself.

If you provide a clean way to simply get text data from a URL I can manage integrating it to the codebase very quickly though if you want.

Have a nice day!

thiswillbeyourgithub avatar Sep 03 '21 23:09 thiswillbeyourgithub

I looked into it and it seems that basically every solution either requires 1) integration with a web browser or 2) using a paid service (which probably uses 1 under the hood).

Here's one working example

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import FirefoxOptions

# need Firefox installed, and the corresponding Firefox driver
# see https://selenium-python.readthedocs.io/installation.html#drivers
opts = FirefoxOptions()

# I'm using WSL, so I need this option
opts.add_argument("--headless")

url = "https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam"

driver = webdriver.Firefox(options=opts)
driver.get(url)

soup = BeautifulSoup(driver.page_source)

# close(), or quit()
driver.quit()

Unfortunately it requires having Firefox installed and installing the corresponding web driver into your PATH. There is also requests-html which is supposed to be a drop-in replacement for requests. It supports 'rendering' the JS in the page, but it also seems to work by just downloading a Chromium instance the first time you call it. And, I'm getting an error with it anyway (maybe WSL related)

This is to say that all of these methods are brittle and trying to support it in the library itself would be a pain. But, including instructions on how to do it somewhere might be useful.

deklanw avatar Sep 06 '21 15:09 deklanw

Yes that's my conclusion as well. I think dynamic website can be exported to PDF or just copied and pasted to autocards so that's "fine" :/

Thanks for looking into this!

thiswillbeyourgithub avatar Sep 06 '21 16:09 thiswillbeyourgithub