Substack2Markdown
Substack2Markdown copied to clipboard
Add option to download images
I'm using this tool to mirror some of my Substack posts to my website, and as part of that process I'd really like to host my own images instead of having them link to the Substack CDN!
In case this will help someone else, here's a PR 🙂
Here's a list of some tweaks I made to get that to happen:
- Add an
--imagesflag that will download images for all posts being scraped into asubstack_images/folder - Add an option to download a single post (by passing in a
--urlin the formathttps://example.substack.com/p/postname - When downloading images, Substack nests them like
[](/path). Change these to just beso clicking on the images doesn't link to itself. - Add some tests, to prove to myself this code works the way I expect it
As a bonus, the progress bars reflect image downloads (since they can take a while)! As an example:
Scraping posts: 100%|██████████| 2/2 [00:30<00:00, 15.00s/post]
Downloading images for test-post: 100%|██████████| 7/7 [00:14<00:00, 2.00s/image]
Downloading images for another-post: 100%|██████████| 4/4 [00:08<00:00, 2.00s/image]