dailyblink
dailyblink copied to clipboard
Daily Blink Page Layout has changed - IndexError: list index out of range
The Layout and URL of the Free Daily Page has changed.
New URL: https://www.blinkist.com/en/content/daily
The locator attribute values for BeautifulSoup have to be updated accordingly. Previous values are no longer valid and cause an IndexError
:
def _create_blink_info(response_text):
soup = BeautifulSoup(response_text, "html.parser")
> daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
E IndexError: list index out of range
confirmed, having this also since... 22.05.2022, because last folder i have in my library is:
'2022-05-21 - Finde den Weg zu deiner inneren Mitte'/
root@banane:~# python3 -m dailyblink
dailyblink v1.2.1, Python 3.9.2, Linux armv7l 32bit ELF
Downloading the free daily Blinks on 2022-06-04 22:47:32...
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 67, in <module>
main()
File "/root/.local/lib/python3.9/site-packages/dailyblink/__main__.py", line 63, in main
blinkist_scraper.download_daily_blinks(args.language, base_path)
File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 37, in download_daily_blinks
self._attempt_daily_blinks_download(languages, base_path)
File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 56, in _attempt_daily_blinks_download
self._download_daily_blinks(language_code, base_path)
File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 63, in _download_daily_blinks
blink_info = self._get_daily_blink_info(language=language_code)
File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 126, in _get_daily_blink_info
return _create_blink_info(response.text)
File "/root/.local/lib/python3.9/site-packages/dailyblink/core.py", line 171, in _create_blink_info
daily_book_href = soup.find_all("a", {"class": "daily-book__cta"})[0]["href"]
IndexError: list index out of range
root@banane:~#
Jap same here. How to fix this?
I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)
I was able to retrieve audio and text content for the free daily by calling Blinkist's API the way the frontend does. I prefer this over BeautifulSoup because it's more direct and the new DOM lacks descriptive classes/IDs. However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium. If anyone's interested, I'll post my code tomorrow. :)
Perfect, let me please know!
Here you go. :)
⚠️ Update: I've created a repo with updated code here
Again, I haven't tried other values for
User-Agent
yet, and I can't check whether this approach will work for Premium content.
import cloudscraper
from datetime import datetime
from pathlib import Path
import requests
from rich import print
from rich.progress import track
BASE_URL = 'https://www.blinkist.com/'
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0',
'x-requested-with': 'XMLHttpRequest',
}
LOCALES = ['en', 'de']
DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist'
scraper = cloudscraper.create_scraper()
def get_book_dir(book):
return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')} – {book['slug']}"
def get_free_daily(locale):
# see also: https://www.blinkist.com/en/content/daily
response = scraper.get(
BASE_URL + 'api/free_daily',
params={'locale': locale}
)
return response.json()
def get_chapters(book_slug):
url = f"{BASE_URL}/api/books/{book_slug}/chapters"
response = requests.get(url, headers=HEADERS)
response.raise_for_status()
return response.json()['chapters']
def get_chapter(book_id, chapter_id):
url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}"
response = requests.get(url, headers=HEADERS)
response.raise_for_status()
return response.json()
def download_chapter_audio(book, chapter_data):
book_dir = get_book_dir(book)
book_dir.mkdir(exist_ok=True)
file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a"
if file_path.exists():
print(f"Skipping existing file: {file_path}")
return
assert 'm4a' in chapter_data['signed_audio_url']
response = scraper.get(chapter_data['signed_audio_url'])
assert response.status_code == 200
file_path.write_bytes(response.content)
print(f"Downloaded chapter {chapter_data['order_no']}")
for locale in LOCALES:
free_daily = get_free_daily(locale=locale)
book = free_daily['book']
print(f"Today's free daily in {locale} is: “{book['title']}”")
# list of chapters without their content
chapter_list = get_chapters(book['slug'])
# fetch chapter content
chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')]
# download audio
for chapter in track(chapters, description='Downloading audio…'):
download_chapter_audio(book, chapter)
# write markdown
# excluded for brevity – just access chapter['text'] etc.
# markdown_text = download_book_md(book, chapters)
@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?
Would this approach work on a Windows machine?
@NicoWeio does your code work straight out of the box, or does this to be replaced with the core.py ?
See my earlier comment:
However, I haven't integrated my approach with this codebase, and I'm not sure if it works the same for arbitrary books on Blinkist Premium.
Assuming you have cloudscraper
installed, my script works out of the box, and it should download the audio just fine. However, it does not generate a text or cover image file, does not set the audio's metadata, and does not precisely follow dailyblink
's naming conventions.
Would this approach work on a Windows machine?
If dailyblink
worked on Windows before, yes. Both my approach using Blinkist's API and the current approach using BeautifulSoup.
@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.
@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.
This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.
@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.
This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.
sure you're right about that.
Here you go. :)
Again, I haven't tried other values for
User-Agent
yet, and I can't check whether this approach will work for Premium content.import cloudscraper from datetime import datetime from pathlib import Path import requests from rich import print from rich.progress import track BASE_URL = 'https://www.blinkist.com/' HEADERS = { 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:101.0) Gecko/20100101 Firefox/101.0', 'x-requested-with': 'XMLHttpRequest', } LOCALES = ['en', 'de'] DOWNLOAD_DIR = Path.home() / 'Musik' / 'Blinkist' scraper = cloudscraper.create_scraper() def get_book_dir(book): return DOWNLOAD_DIR / f"{datetime.today().strftime('%Y-%m-%d')} – {book['slug']}" def get_free_daily(locale): # see also: https://www.blinkist.com/en/content/daily response = scraper.get( BASE_URL + 'api/free_daily', params={'locale': locale} ) return response.json() def get_chapters(book_slug): url = f"{BASE_URL}/api/books/{book_slug}/chapters" response = requests.get(url, headers=HEADERS) response.raise_for_status() return response.json()['chapters'] def get_chapter(book_id, chapter_id): url = f"{BASE_URL}/api/books/{book_id}/chapters/{chapter_id}" response = requests.get(url, headers=HEADERS) response.raise_for_status() return response.json() def download_chapter_audio(book, chapter_data): book_dir = get_book_dir(book) book_dir.mkdir(exist_ok=True) file_path = book_dir / f"chapter_{chapter_data['order_no']}.m4a" if file_path.exists(): print(f"Skipping existing file: {file_path}") return assert 'm4a' in chapter_data['signed_audio_url'] response = scraper.get(chapter_data['signed_audio_url']) assert response.status_code == 200 file_path.write_bytes(response.content) print(f"Downloaded chapter {chapter_data['order_no']}") for locale in LOCALES: free_daily = get_free_daily(locale=locale) book = free_daily['book'] print(f"Today's free daily in {locale} is: “{book['title']}”") # list of chapters without their content chapter_list = get_chapters(book['slug']) # fetch chapter content chapters = [get_chapter(book['id'], chapter['id']) for chapter in track(chapter_list, description='Fetching chapters…')] # download audio for chapter in track(chapters, description='Downloading audio…'): download_chapter_audio(book, chapter) # write markdown # excluded for brevity – just access chapter['text'] etc. # markdown_text = download_book_md(book, chapters)
Executing this code on google colab I am getting 403 forbidden error on line 70 when calling get_chapters after troubleshooting I found that response.raise_for_status() gives that error as it can't access the url which gives this error. how can I resolve this?
@NicoWeio
@rajeshbhavikatti I just published my code here, so we can keep this issue clean from further discussions. Notice the double slash in the URL? That might be the cause, although it didn't cause issues for me. Maybe because of a different requests version? Anyway, I fixed the double slashes in my code. Plus, I've added CI to my repo, and it works just fine there, too.
@ptrstn Is there a fix/update coming? you said until Sunday and then you removed your answer.
This change requires some refactoring and a little bit more time than initially expected. I'll see what I can do. Can't guarantee you when though, since I've got other things in life to take care of first.
Hi Peter @ptrstn , do you have some updates on this?
Hi Peter @ptrstn , do you have some updates on this?
I'll be able to work on it starting beginning of October, since I'm still busy with private issues
Hi Peter @ptrstn , do you have some updates on this?
I'll be able to work on it starting beginning of October, since I'm still busy with private issues
Any News for us?
Hi, I have made some updates based on this repo feel free to reach out to me on any changes or update check out my notebook here
@rajeshbhavikatti nice work, but you don't catch the mp3 files.
@Erik262 yes, as the notion API doesn't support it yet