Unable to get date
I noticed while using the package and substack tools, the date is incorrect/cannot be found. I have not found 1 post where the date is correct.
I tried debugging it but had no success. Any chance anyone has expertise here?
Could you give us the publication url that with this problem? I'll fix it as soon as possible, Thanks! @greenforestpath
@Firevvork I'm facing the same error with all posts when scrapping 2 separate premium publications I'm subscribed to. All the md/html files have the "date not found" error.
@caliammaps Could you give me the post url which with problems? Thanks!
@caliammaps Could you give me the post url which with problems? Thanks!
sent your an email 🙏
slightly change the code, and it should work.
def extract_post_data(self, soup: BeautifulSoup) -> Tuple[str, str, str, str, str]:
"""
Converts substack post soup to markdown, returns metadata and content
"""
# Extract title
title = soup.select_one("h1.post-title, h2").text.strip() # When a video is present, the title is demoted to h2
# Extract subtitle
subtitle_element = soup.select_one("h3.subtitle")
subtitle = subtitle_element.text.strip() if subtitle_element else ""
# Extract date
date_element = soup.select_one("div.pencraft.pc-reset.color-pub-secondary-text-hGQ02T.line-height-20-t4M0El.font-meta-MWBumP.size-11-NuY2Zx.weight-medium-fw81nC.transform-uppercase-yKDgcq.reset-IxiVJZ.meta-EgzBVA")
date = date_element.text.strip() if date_element else ""
if not date:
# Try to find date in the metadata
script_tag = soup.find('script', {'type': 'application/ld+json'})
if script_tag and script_tag.string:
try:
import json
from datetime import datetime
metadata = json.loads(script_tag.string)
if 'datePublished' in metadata:
date_str = metadata['datePublished']
date_obj = datetime.fromisoformat(date_str.replace('Z', '+00:00'))
date = date_obj.strftime('%b %d, %Y')
except (json.JSONDecodeError, ValueError, KeyError):
date = "Date not found"
# Extract like count
like_count_element = soup.select_one("a.post-ufi-button .label")
like_count = (
like_count_element.text.strip()
if like_count_element and like_count_element.text.strip().isdigit()
else "0"
)
# Extract and convert content
content = str(soup.select_one("div.available-content"))
md = self.html_to_md(content)
md_content = self.combine_metadata_and_content(title, subtitle, date, like_count, md)
return title, subtitle, like_count, date, md_content
Had the same problem. The updated function by @LuqianSun worked great. Maybe open a PR for that?
The div containing the publish date uses dynamically generated class names that change frequently on Substack, making them unreliable for scraping. I haven't tested the JSON-LD backup for getting it to work - thanks @LuqianSun for implementing that and lets hope it works rather than needing to update the div ever time substack changes it