snscrape icon indicating copy to clipboard operation
snscrape copied to clipboard

Content of tweet includes non written mentions

Open enzoferey opened this issue 2 years ago β€’ 4 comments

Describe the bug

Then scrapping the following tweet, the content returned starts like "@GitHubCopilot @tabnine @Replit @vercel Have you tried them ?" instead of just "Have you tried them ?" as expected.

How to reproduce

Use the TwitterTweetScraper and pass the tweet id 1674020720458776576.

Expected behaviour

There should be no non-written mentions at the beginning of the content.

Screenshots and recordings

No response

Operating system

macOS 13.4.1

Python version: output of python3 --version

3.9

snscrape version: output of snscrape --version

0.7.0.20230622

Scraper

TwitterTweetScraper

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

No response

Log output

No response

Dump of locals

No response

Additional context

No response

enzoferey avatar Jun 30 '23 02:06 enzoferey

These mentions are technically part of the tweet text. This is exactly what Twitter returns:

...['tweet_results']['result']['legacy']['full_text'] = '@GitHubCopilot @tabnine @Replit @vercel Have you tried them ? What’s your opinion ? We read you πŸ‘€'

There is however also a display_text_range field. That should probably be taken into account for the renderedContent.

JustAnotherArchivist avatar Jun 30 '23 04:06 JustAnotherArchivist

Thanks for pointing it out @JustAnotherArchivist πŸ™πŸ»

I did not realize that all accounts mentioned in a tweet are internally included in its replies (since you get notified about replies it makes sense πŸ˜„).

This might be a good opportunity for me to task as well about the differences of content, renderedContent, and rawContent ?

enzoferey avatar Jun 30 '23 11:06 enzoferey

Forget that content exists; it's a deprecated alias from the early days that will be removed eventually. (It emits a warning if you try to use it.)

rawContent is the exact tweet text Twitter returns, while renderedContent is (roughly) the text as it would be rendered on Twitter's web interface. The only difference there currently is the replacement of links, so it doesn't exactly match. For example, replies start with a mention of the replied-to user, which gets rendered separately on the web interface.

JustAnotherArchivist avatar Jun 30 '23 17:06 JustAnotherArchivist

Links replacement you mean the https://t.co ones instead of the originals right? I’m using Puppeteer to navigate those and get the actual URLs.

So as far as I understood, I should be using renderedContent and there needs to be fix for the fact it should not include mentions on replies. Is this right ?

enzoferey avatar Jun 30 '23 18:06 enzoferey