stweet icon indicating copy to clipboard operation
stweet copied to clipboard

How to retrieve only a single Tweet, not subtweets

Open cguess opened this issue 2 years ago • 5 comments

Running the sample code for getting a tweet with a single id actually scrapes dozens of tweets taking 30+ seconds and return a massive List. Is it possible to just get the first tweet object and then stop running so it can be more time efficient?

I've tried messing around with the context, but I can't seem to wrap my head around how that's actually used in this project. I could perhaps be being dense however, and any clarification would be greatly appreciated.

cguess avatar Jan 30 '23 22:01 cguess

@cguess is this what you were looking for?

search_tweets_task = stweet.SearchTweetsTask(from_username=tweet_author, replies_filter=RepliesFilter.ONLY_ORIGINAL)

https://github.com/markowanga/stweet/blob/fe34e98254dc7646bde6e083b5f6f745a0ee8cb6/stweet/search_runner/search_tweets_task.py#L29

https://github.com/markowanga/stweet/blob/fe34e98254dc7646bde6e083b5f6f745a0ee8cb6/stweet/search_runner/replies_filter.py#L6-L10

oneroyalace avatar Mar 05 '23 01:03 oneroyalace

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

ataniz avatar Apr 28 '23 12:04 ataniz

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

I've done some quick research and seems like the TweetsByIdContext object is exactly what to find. Here:

https://github.com/markowanga/stweet/blob/fe34e98254dc7646bde6e083b5f6f745a0ee8cb6/stweet/tweets_by_ids_runner/tweets_by_id_runner.py#L71-L77

parsed_list is a list of UserTweetRaw and Cursor, where list of Cursors tell the runner where to go next, and list of UserTweetRaws are the downloaded tweets. The runner will judge if the scrap was finished by checking the cursor attribute of the context is not None.

If you only want the exact tweet you asked, you may force the cursor to be None, like this:

parsed_list = get_all_tweets_from_json(response.text)
# cursors = [it for it in parsed_list if isinstance(it, Cursor)]
# cursor = cursors[0] if len(cursors) > 0 else None
cursor = None # force the cursor to be None
user_tweet_raw = [it for it in parsed_list if isinstance(it, UserTweetRaw)]
self.tweets_by_id_context.add_downloaded_tweets_count_in_request(len(user_tweet_raw))
self.tweets_by_id_context.cursor = cursor
self._process_new_tweets_to_output(user_tweet_raw)

By editing this, in the run method, the while loop asks _is_end_of_scrapping, as ctx.cursor is None, the condition will not be satisfied, therefore ending the loop thus the scrap.

This will still result in an "one-level" scrapping, so if the tweet has replies, some (or all) of the replies will still be included. You may then filter the desired tweet by checking the "id_str" in the raw dictionary. But still this will dramatically increase the speed especially for tweets with large amount of interactions.

junyilou avatar May 07 '23 13:05 junyilou

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

I've actually achieved a much easier implementation that you don't need to change the source code.

from stweet.tweets_by_ids_runner.tweets_by_id_context import TweetsByIdContext

class DummyContext(TweetsByIdContext):
    def __setattr__(self, __name: str, __value: Any) -> None:
        if __name == "cursor":
            __value = None
        return super().__setattr__(__name, __value)

Create a DummyContext instance to use as a TweetsByIdContext instance. The only difference is when the runner tries to set the cursor attribute, the __setattr__ method will always silently edit the value to None.

Example Usage:

stweet.TweetsByIdRunner(
    tweets_by_id_task = task, 
    raw_data_outputs = [output], 
    tweets_by_ids_context = DummyContext()
).run()

junyilou avatar May 07 '23 14:05 junyilou

Many thanks @junyilou ! I have had success with the strategy you provided. Results still need filtering as you have foreseen, but speed is unharmed.

ataniz avatar May 16 '23 11:05 ataniz