yt-videos-list Before you continue to YouTube

Before you continue to YouTube - Cookie consent

Open milosb793 opened this issue 3 years ago • 7 comments

Hello there,

I'm facing an issue with Youtube consent, getting the message:

The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message: 

Traceback (most recent call last):
  File "venv/lib/python3.8/site-packages/yt_videos_list/execute.py", line 142, in logic
    wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))
  File "venv/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

I'm running this piece of code, which works great on my local machine:

from yt_videos_list import ListCreator

my_driver = 'firefox'

lc = ListCreator(csv=True,
                 md=False,
                 txt=False,
                 headless=True,
                 driver=my_driver,
                 scroll_pause_time=1,
                 reverse_chronological=True)

print(lc.create_list_for("https://www.youtube.com/channel/<channel id>", True))

but on the server, it fails. After a lot of debugging, I found that it got redirected to "Before you continue to Youtube" page running this code sample, simulating the code from create_list_for function:

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support   import expected_conditions as EC
from selenium.webdriver.common.by import By

url = "https://www.youtube.com/channel/<channel id>/videos"

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options)

driver.get(url)
driver.set_window_size(780, 800)
driver.set_window_position(0, 0)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 9)

print(driver.title)

wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

print("Done")

and output is Before you continue to YouTube with the same error as above.

Is there any way case covering this, or am I doing something wrong?

May 05 '21 00:05 milosb793

Hey milosb793, thanks for filing this issue!

Don't worry, you aren't doing anything wrong. 🙂 This is a new problem associated with YouTube's privacy compliant tracking rollout that requires users to indicate how they want to be tracked, and I'll provide some workarounds below on how to get the program running again. Also note, I'll make a future release (that'll probably incorporate the changes I suggest below) to enable the yt_videos_list program to handle the consent form automatically (or avoid it altogether if run with the user profile), so I'll add those changes when I get the chance to test everything properly.

A simple workaround would be to include a check like the following to see if YouTube is asking for cookie consent and accept the form if it does:

if 'consent.youtube.com' in driver.current_url:
    driver.find_element_by_xpath('//button[@aria-label="Agree to the use of cookies and other data for the purposes described"]').click()

before this wait.until line in dev/logic.py (and your sample code):

wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

To see if these changes work, do the following:

git clone [email protected]:slow-but-steady/yt-videos-list.git

cd yt_videos_list
cd python

# make the changes you want in dev/logic.py (more details below)
# make sure you're still in the yt_videos_list/python/ path, then

# run minifier.py to bundle code from yt_videos_list/python/dev/
# into the yt_videos_list/python/yt_videos_list directory
python3 minifier.py   # macOS/Linux
python  minifier.py.  # Windows

# install the changes you made locally with
pip3 install .   # macOS/Linux
pip  install .   # Windows
# NOTE the dot after "install" is required!

When making changes, you might also need to add some sleep timers to wait for the page to load before/after agreeing to the cookies, so the code in logic.py might look something like the following after you make changes:

            driver.get(url)
            driver.set_window_size(780, 800)
            driver.set_window_position(0, 0)
            wait = selenium.webdriver.support.ui.WebDriverWait(driver, 9)
            try:
                # might need a sleep timer here to wait for the consent page to load
                # time.sleep(3)
                if 'consent.youtube.com' in driver.current_url: # THIS IS THE CHECK
                    driver.find_element_by_xpath('//button[@aria-label="Agree to the use of cookies and other data for the purposes described"]').click() # THIS ACCEPTS THE COOKIE CONSENT FORM
                wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))
            except selenium.common.exceptions.TimeoutException as error_message:
...
... # rest of code probably unchanged

After you make the changes you want in /dev/logic.py, make sure to run minifier.py with python3 minifier.py (python minifier.py on Windows) and install the changes with pip3 install . (pip install . on Windows), then run yt_videos_list on a YouTube channel you want to scrape to see if the changes worked.

Another workaround you can use involves setting your user profile for the driver (firefox, opera, chrome, etc.) as mentioned in discussion #14 Problem with cookies.

Note that you shouldn't face problem I described in this comment following commit d90c29f7b8d117a9a1f600219d286ca3240c8207, so doing what sirodus describes in this comment (I explained what the code there is doing in this comment from the thread above) should be as simple as going to the configure_{SPECIFIC}driver() function for the specific driver you're working with (firefox, opera, chrome, etc.) in dev/logic.py and adding your personal user profile for the browser you're using.

Using the sample code you provided as an example, this would look something like:

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support   import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile   # NOTE this new import!

url = "https://www.youtube.com/channel/<channel id>/videos"

options = Options()
options.headless = True

# setting FirefoxProfile on Windows:
profile = FirefoxProfile('C:\\Users\\USERNAME\\AppData\\Roaming\\Mozilla\\Firefox\\Profiles\\CHARACTERS.EXTENSION')

# setting FirefoxProfile on macOS:
profile = FirefoxProfile('/Users/USERNAME/Library/Application Support/Firefox/Profiles/CHARACTERS.EXTENSION')

# setting FirefoxProfile on Linux:
profile = FirefoxProfile('/.mozilla/firefox/CHARACTERS.EXTENSION/')

### NOTE: you might have multiple profiles, so you'll need to check them ###
### individually to figure out which directory actually corresponds to the ###
### actual user profile for your browser - the most recently modified ###
### directory is probably the one, but this isn't guaranteed ###

# NOTE: the following does NOT launch selenium in headless mode
# since figuring out if the FirefoxProfile you set in profile is difficult to 
# do when the browser is invisible :)
### also NOTE: launching the selenium driver with the user profile is kind of slow ###
driver = webdriver.Firefox(firefox_profile=profile)

# to launch in headless mode once you figure out the FirefoxProfile
# path, comment the line above and uncomment the line below:
# driver = webdriver.Firefox(firefox_profile=profile, options=options)

driver.get(url)
driver.set_window_size(780, 800)
driver.set_window_position(0, 0)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 9)

print(driver.title)

wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

print("Done")

Once you verify the user profile works using some test code as above, you can try adding these changes to the configure_{SPECIFIC}driver() function in /dev/logic.py, run python3 minifier.py (python minifier.py on Windows) again, install the local changes with pip3 install . (pip install . on Windows), then run yt_videos_list on a channel you want to scrape. If you're using firefox, you would add these changes under the configure_firefoxdriver() function.

Also keep in mind, the exact changes you need to use the user profile to force selenium to use your personal browser settings instead of the empty profile selenium uses by default varies based on which driver/browser (firefox, opera, chrome`, etc.) you use, so here are some references:

https://chromium.googlesource.com/chromium/src/+/master/docs/user_data_dir.md
https://www.guru99.com/firefox-profile-selenium-webdriver.html
https://stackoverflow.com/questions/45521012/how-to-start-firefox-with-with-specific-profile-selenium-python-geckodriver
https://stackoverflow.com/questions/45521012/how-to-start-firefox-with-with-specific-profile-selenium-python-geckodriver
https://stackoverflow.com/questions/50321278/how-to-load-firefox-profile-with-python-selenium
https://stackoverflow.com/questions/55130791/how-to-enable-built-in-vpn-in-operadriver (shows how to use the Opera user profile with webdriver.ChromeOptions() for webdriver.Opera())

If you have any questions or something doesn't work properly, please add to this thread below! 🙂 Also if you have any suggestions for any other additions/modifications, feel free to include that as well. One thing I can think of that sounds like a good idea would be to opt out of all cookies if the consent.youtube.com page comes up, but this might cause problems since agreeing to the cookies is easy, but opting out takes you to a different page where you need to click more options and then submit the form.

The issue probably wouldn't be with clicking the boxes, but rather with the timing (if the page to opt out of cookies takes a long time to load, or if redirecting to the channel after opting out takes a long time). Do you think it might be useful to add this (opt out of cookies) option to yt_videos_list as well?

May 05 '21 09:05 shailshouryya

Man, THANK YOU SO MUCH for this all effort! Really appreciate it!

I still haven't had enough time to test the given solutions, but I'll definitely post the feedback once I test it.

May 08 '21 19:05 milosb793

Added support for the program to block cookies/accept cookies in Release v0.5.7. You should be able to download these changes and run the program with the updated code using

pip3 install -U yt-videos-list   # macOS/Linux
pip  install -U yt-videos-list   # Windows

# run your yt-videos-list code as you normally do
# ListCreator is instantiated with cookie_consent=False by default (blocks cookies)
# so you shouldn't need to modify anything to get this functionality,
# but if you want to accept cookies, you'd need to add the cookie_consent argument to the instantiation:
# lc = ListCreator(cookie_consent=True)

Let me know if this works, or if you have any problems with anything.

I'll work on adding support to enable the program to use your user profile to allow selenium to run with your personal browser settings instead of the empty profile selenium uses by default next!

May 17 '21 01:05 shailshouryya

Hi slow-but-steady, I test the feature, cookie_consent=True (False is tested too), but the consent page is still shown, and you must "click" in the "I Agree"button. lc = ListCreator(cookie_consent=True, driver=firefox, scroll_pause_time=0.8, headless=False, csv=False, md=False)

Thanks in advence

This is the error after the timeout (if the button is clicked all works fine):

===>ERROR!<=== The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank): Message:

Traceback (most recent call last): File "/home/pi/.local/lib/python3.7/site-packages/yt_videos_list/logic.py", line 126, in run_scraper wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]'))) File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/support/wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message:

May 17 '21 20:05 tfmotu

Hi slow-but-steady, I added the if statement that you have shown previously after the line 126 of logic.py file: ...snip try: if 'consent.youtube.com' in driver.current_url: # THIS IS THE CHECK driver.find_element_by_xpath('//button[@aria-label="Agree to the use of cookies and other data for the purposes described"]').click() # THIS ACCEPTS THE COOKIE CONSENT FORM wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@class="style-scope ytd-channel-name"]')))

except selenium.common.exceptions.TimeoutException as error_message: ...snip

I tested it with headless true and false. It seems that works OK. Thank you very much

May 17 '21 20:05 tfmotu

Hi asiergda,

Thanks so much for writing up the error and the workaround! I looked into the problem using the information you provided, and (hopefully) fixed the issues with Release 0.5.8. The linked release page references specific commits with a more comprehensive explanation of the problem and the fix, but here's a short summary of the relevant problems:

the create_list_for() method for ListCreator passed in cookie_consent as the last argument to logic.execute() in release 0.5.7, but the execute() function expected the last argument to be _execution_type
- since the argument order passed into execute() was incorrect, the program did not correctly block or accept cookies using the cookie_consent boolean attribute as intended (see commit cd65c5c73d945db487743b4679f2997a6f1d06e4 for more details)
commit 0d4d2180712ee5439b23e930f980824ef5639c42 incorrectly provided the error message as a string return value (and also started printing a log message instead of logging the message) instead of printing the error message (see commit b41081485c3d599856f4431bcee01e6bb79146da for the fix), so the traceback error you saw was not as descriptive as it should have been; i.e.

===>ERROR!<===
The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message:

Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/yt_videos_list/logic.py", line 126, in run_scraper
wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@Class="style-scope ytd-channel-name"]')))
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

instead of

### start of missing message ###
YouTube is redircting to youtube.onsent.com, but you entered an invalid argument for the cookie_consent instance attrribute!
Please use cookie_consent=True or cookie_consent=False.
Your current setting is: cookie_consent={cookie_consent}     # this line also would have helped debug the cookie_consent/_execution_type argument mix up
### end of missing message ###

===>ERROR!<===
The page did not load elements! If you've scraped many channels within a short period of time, please try rerunning the program after waiting to make sure YouTube isn't throttling your IP address! For further debugging, this was the exact error message (might also be blank):
Message:

Traceback (most recent call last):
File "/home/pi/.local/lib/python3.7/site-packages/yt_videos_list/logic.py", line 126, in run_scraper
wait.until(EC.element_to_be_clickable((By.XPATH, '//yt-formatted-string[@Class="style-scope ytd-channel-name"]')))
File "/usr/local/lib/python3.7/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:

Hopefully this fixed the problem, but let me know if it didn't!

Also, as mentioned earlier in the thread above, I'll eventually add an argument to allow the program to use the your personal browser settings via the user profile instead of the empty profile selenium uses by default, so that should be available after I test and (properly 😅) verify everything works!

May 25 '21 03:05 shailshouryya

Hi slow-but-steady, All the kudos for your work, thank you very much.

May 28 '21 14:05 tfmotu

yt-videos-list yt-videos-list copied to clipboard

Before you continue to YouTube - Cookie consent

yt-videos-list
yt-videos-list copied to clipboard