snscrape icon indicating copy to clipboard operation
snscrape copied to clipboard

Scrape images, video, and post forwarding information for Telegram

Open loganwilliams opened this issue 3 years ago • 15 comments

A small enhancement that adds some additional information from Telegram channel posts.

loganwilliams avatar Feb 24 '22 14:02 loganwilliams

Makes sense to me. I don't have a timeline for when we'd be able to make those changes -- there's a few high priority things happening right now -- but we've been using our fork for a while and I wanted to open a PR to remember to merge it upstream at some point.

loganwilliams avatar Mar 09 '22 07:03 loganwilliams

I implemented the requested changes:

  • Made attachment handling similar to Twitter's: dataclasses for Image, Video, and Gif.
  • Added capability to scrape multiple Videos from a single message
  • Added attribute for the full forwarded URL and made the forwarded attribute have type Channel
  • Added capability to scrape number of views for messages

Additional changes:

  • Telegram seems to have changed their interface somehow such that the tme_messages_more, data-before tag often doesn't appear on some pages. To deal with this, I added a default that decrements the before query parameter by 20. This requires a few additional changes to handle edge cases:
    • If the querystring doesn't contain the before parameter, get the canonical url tag in the page
    • Added a termination condition: if the first tgme_widget_message_date has an href to the first post (t.me/CHANNEL/1), terminate the scraping loop
  • Moved attachment extraction out of if (message := post.find('div', class_ = 'tgme_widget_message_text')): clause, since some attachments are in messages without text, so they weren't being added to the media list
  • I also added a responseOkCallback function to retry the request if we get a 5xx response.

trislee avatar May 25 '22 06:05 trislee

Hm, should this be rebased? 25 commits is a lot, but I'm not sure on @JustAnotherArchivist's policy on that.

TheTechRobo avatar May 25 '22 12:05 TheTechRobo

Pasting something from the PR to the fork that I think is relevant:

I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message.

TheTechRobo avatar May 25 '22 12:05 TheTechRobo

The changes sound good so far, though I haven't reviewed the code thoroughly yet. Some quick comments on things I noticed at a glance:

  • I don't mind the number of commits. The merges make the history slightly messy, but that's alright.
  • The 'thin' Channel change is fine; the Twitter module does that as well, only including data that is already available e.g. for replied-to users.
  • The functions at the bottom should be prefixed with an underscore to mark them as private API.
  • views attribute: parse_num returns an IntWithGranularity, not an int.
  • outlinks, mentions, etc. should be None if there aren't any, not an empty list. Related to that: typing.Optional is missing on a couple in the class definition.
  • The changes to the VK module should be a separate PR.
  • Do you have an example of a channel page that often lacks the before= link? I haven't noticed this before.

JustAnotherArchivist avatar May 29 '22 07:05 JustAnotherArchivist

This is an example of a channel page with no tme_messages_more data-before attribute: https://t.me/s/proudboysusa?before=8033 I only started noticing such pages after I had started working on this fork, so maybe Telegram changed something in their web interface in the last few months.

trislee avatar Jun 23 '22 20:06 trislee

Incorporated your changes, let me know if there are other issues you'd like me to address

trislee avatar Jun 23 '22 20:06 trislee

@JustAnotherArchivist Any additional changes you want us to make? We've been using this quite a bit and would love to see it get merged.

trislee avatar Dec 02 '22 13:12 trislee

Also, while testing this and looking around for odd cases, I discovered that Telegram supports 'round videos'. Example: https://t.me/s/memes/9641 Support for those doesn't need to be part of this PR, but I thought it'd be appropriate to mention it here in case you do want to handle it. Else I'll add it after this is merged.

JustAnotherArchivist avatar Dec 20 '22 00:12 JustAnotherArchivist

Hello. I'd love to get this PR merged. Is there anything I can do to help?

turicas avatar Feb 06 '23 01:02 turicas

Please finish and merge this; it's quite a useful feature. If there's anything I can do to help, I'd be more than happy to.

Demmenie avatar Jul 10 '23 21:07 Demmenie

Just so you know: I've created a Python library called tchan that scrapes Telegram public channels and does not have the current problems snscrape has regarding this PR (still missing some features like scrape polls).

turicas avatar Jul 10 '23 22:07 turicas

What I need (and what I'm pretty sure this PR provides) is an easy way to check if a post contains one or more videos.

Demmenie avatar Jul 11 '23 18:07 Demmenie

Hey, I recently did a Bellingcat workshop which used this fork -- I'd love to close the gap and get it merged. I'll try to cut something soon, @JustAnotherArchivist, and will let you know if I have any questions on how you'd like it!

john-osullivan avatar Dec 31 '23 00:12 john-osullivan

I wrapped up my changes to all your comments, @JustAnotherArchivist !

Asked @.loganwilliams for a review over there, but if you'd like to preemptively call out any issues with my changes, I'd love to get em fixed ahead of time and only do one merge process 👍

john-osullivan avatar Feb 22 '24 06:02 john-osullivan