snscrape
snscrape copied to clipboard
Scrape images, video, and post forwarding information for Telegram
A small enhancement that adds some additional information from Telegram channel posts.
Makes sense to me. I don't have a timeline for when we'd be able to make those changes -- there's a few high priority things happening right now -- but we've been using our fork for a while and I wanted to open a PR to remember to merge it upstream at some point.
I implemented the requested changes:
- Made attachment handling similar to Twitter's: dataclasses for Image, Video, and Gif.
- Added capability to scrape multiple Videos from a single message
- Added attribute for the full forwarded URL and made the forwarded attribute have type Channel
- Added capability to scrape number of views for messages
Additional changes:
- Telegram seems to have changed their interface somehow such that the
tme_messages_more, data-before
tag often doesn't appear on some pages. To deal with this, I added a default that decrements thebefore
query parameter by 20. This requires a few additional changes to handle edge cases:- If the querystring doesn't contain the
before
parameter, get the canonical url tag in the page - Added a termination condition: if the first
tgme_widget_message_date
has an href to the first post (t.me/CHANNEL/1), terminate the scraping loop
- If the querystring doesn't contain the
- Moved attachment extraction out of
if (message := post.find('div', class_ = 'tgme_widget_message_text')):
clause, since some attachments are in messages without text, so they weren't being added to the media list - I also added a responseOkCallback function to retry the request if we get a 5xx response.
Hm, should this be rebased? 25 commits is a lot, but I'm not sure on @JustAnotherArchivist's policy on that.
Pasting something from the PR to the fork that I think is relevant:
I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message.
The changes sound good so far, though I haven't reviewed the code thoroughly yet. Some quick comments on things I noticed at a glance:
- I don't mind the number of commits. The merges make the history slightly messy, but that's alright.
- The 'thin' Channel change is fine; the Twitter module does that as well, only including data that is already available e.g. for replied-to users.
- The functions at the bottom should be prefixed with an underscore to mark them as private API.
-
views
attribute:parse_num
returns anIntWithGranularity
, not anint
. -
outlinks
,mentions
, etc. should beNone
if there aren't any, not an empty list. Related to that:typing.Optional
is missing on a couple in the class definition. - The changes to the VK module should be a separate PR.
- Do you have an example of a channel page that often lacks the
before=
link? I haven't noticed this before.
This is an example of a channel page with no tme_messages_more
data-before
attribute: https://t.me/s/proudboysusa?before=8033
I only started noticing such pages after I had started working on this fork, so maybe Telegram changed something in their web interface in the last few months.
Incorporated your changes, let me know if there are other issues you'd like me to address
@JustAnotherArchivist Any additional changes you want us to make? We've been using this quite a bit and would love to see it get merged.
Also, while testing this and looking around for odd cases, I discovered that Telegram supports 'round videos'. Example: https://t.me/s/memes/9641 Support for those doesn't need to be part of this PR, but I thought it'd be appropriate to mention it here in case you do want to handle it. Else I'll add it after this is merged.
Hello. I'd love to get this PR merged. Is there anything I can do to help?
Please finish and merge this; it's quite a useful feature. If there's anything I can do to help, I'd be more than happy to.
Just so you know: I've created a Python library called tchan that scrapes Telegram public channels and does not have the current problems snscrape has regarding this PR (still missing some features like scrape polls).
What I need (and what I'm pretty sure this PR provides) is an easy way to check if a post contains one or more videos.
Hey, I recently did a Bellingcat workshop which used this fork -- I'd love to close the gap and get it merged. I'll try to cut something soon, @JustAnotherArchivist, and will let you know if I have any questions on how you'd like it!
I wrapped up my changes to all your comments, @JustAnotherArchivist !
Asked @.loganwilliams for a review over there, but if you'd like to preemptively call out any issues with my changes, I'd love to get em fixed ahead of time and only do one merge process 👍