Substack2Markdown icon indicating copy to clipboard operation
Substack2Markdown copied to clipboard

Unable to get date

Open greenforestpath opened this issue 1 year ago • 6 comments

image

I noticed while using the package and substack tools, the date is incorrect/cannot be found. I have not found 1 post where the date is correct.

I tried debugging it but had no success. Any chance anyone has expertise here?

image

greenforestpath avatar Nov 06 '24 00:11 greenforestpath

Could you give us the publication url that with this problem? I'll fix it as soon as possible, Thanks! @greenforestpath

Firevvork avatar Nov 07 '24 08:11 Firevvork

@Firevvork I'm facing the same error with all posts when scrapping 2 separate premium publications I'm subscribed to. All the md/html files have the "date not found" error. date

caliammaps avatar Dec 22 '24 14:12 caliammaps

@caliammaps Could you give me the post url which with problems? Thanks!

Firevvork avatar Dec 23 '24 16:12 Firevvork

@caliammaps Could you give me the post url which with problems? Thanks!

sent your an email 🙏

caliammaps avatar Dec 25 '24 09:12 caliammaps

slightly change the code, and it should work.

def extract_post_data(self, soup: BeautifulSoup) -> Tuple[str, str, str, str, str]:
        """
        Converts substack post soup to markdown, returns metadata and content
        """
        # Extract title
        title = soup.select_one("h1.post-title, h2").text.strip()  # When a video is present, the title is demoted to h2

        # Extract subtitle
        subtitle_element = soup.select_one("h3.subtitle")
        subtitle = subtitle_element.text.strip() if subtitle_element else ""

        # Extract date
        date_element = soup.select_one("div.pencraft.pc-reset.color-pub-secondary-text-hGQ02T.line-height-20-t4M0El.font-meta-MWBumP.size-11-NuY2Zx.weight-medium-fw81nC.transform-uppercase-yKDgcq.reset-IxiVJZ.meta-EgzBVA")
        date = date_element.text.strip() if date_element else ""
        if not date:
            # Try to find date in the metadata
            script_tag = soup.find('script', {'type': 'application/ld+json'})
            if script_tag and script_tag.string:
                try:
                    import json
                    from datetime import datetime
                    metadata = json.loads(script_tag.string)
                    if 'datePublished' in metadata:
                        date_str = metadata['datePublished']
                        date_obj = datetime.fromisoformat(date_str.replace('Z', '+00:00'))
                        date = date_obj.strftime('%b %d, %Y')
                except (json.JSONDecodeError, ValueError, KeyError):
                    date = "Date not found"
        
        # Extract like count
        like_count_element = soup.select_one("a.post-ufi-button .label")
        like_count = (
            like_count_element.text.strip()
            if like_count_element and like_count_element.text.strip().isdigit()
            else "0"
        )

        # Extract and convert content
        content = str(soup.select_one("div.available-content"))
        md = self.html_to_md(content)
        md_content = self.combine_metadata_and_content(title, subtitle, date, like_count, md)
        
        return title, subtitle, like_count, date, md_content

LuqianSun avatar Feb 04 '25 22:02 LuqianSun

Had the same problem. The updated function by @LuqianSun worked great. Maybe open a PR for that?

foxblock avatar Mar 15 '25 20:03 foxblock

The div containing the publish date uses dynamically generated class names that change frequently on Substack, making them unreliable for scraping. I haven't tested the JSON-LD backup for getting it to work - thanks @LuqianSun for implementing that and lets hope it works rather than needing to update the div ever time substack changes it

timf34 avatar Sep 24 '25 19:09 timf34