stweet
stweet copied to clipboard
How to retrieve only a single Tweet, not subtweets
Running the sample code for getting a tweet with a single id actually scrapes dozens of tweets taking 30+ seconds and return a massive List. Is it possible to just get the first tweet object and then stop running so it can be more time efficient?
I've tried messing around with the context, but I can't seem to wrap my head around how that's actually used in this project. I could perhaps be being dense however, and any clarification would be greatly appreciated.
@cguess is this what you were looking for?
search_tweets_task = stweet.SearchTweetsTask(from_username=tweet_author, replies_filter=RepliesFilter.ONLY_ORIGINAL)
https://github.com/markowanga/stweet/blob/fe34e98254dc7646bde6e083b5f6f745a0ee8cb6/stweet/search_runner/search_tweets_task.py#L29
https://github.com/markowanga/stweet/blob/fe34e98254dc7646bde6e083b5f6f745a0ee8cb6/stweet/search_runner/replies_filter.py#L6-L10
@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.
@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.
I've done some quick research and seems like the TweetsByIdContext
object is exactly what to find. Here:
https://github.com/markowanga/stweet/blob/fe34e98254dc7646bde6e083b5f6f745a0ee8cb6/stweet/tweets_by_ids_runner/tweets_by_id_runner.py#L71-L77
parsed_list
is a list
of UserTweetRaw
and Cursor
, where list of Cursor
s tell the runner where to go next, and list of UserTweetRaw
s are the downloaded tweets. The runner will judge if the scrap was finished by checking the cursor
attribute of the context
is not None.
If you only want the exact tweet you asked, you may force the cursor
to be None
, like this:
parsed_list = get_all_tweets_from_json(response.text)
# cursors = [it for it in parsed_list if isinstance(it, Cursor)]
# cursor = cursors[0] if len(cursors) > 0 else None
cursor = None # force the cursor to be None
user_tweet_raw = [it for it in parsed_list if isinstance(it, UserTweetRaw)]
self.tweets_by_id_context.add_downloaded_tweets_count_in_request(len(user_tweet_raw))
self.tweets_by_id_context.cursor = cursor
self._process_new_tweets_to_output(user_tweet_raw)
By editing this, in the run
method, the while loop asks _is_end_of_scrapping
, as ctx.cursor
is None
, the condition will not be satisfied, therefore ending the loop thus the scrap.
This will still result in an "one-level" scrapping, so if the tweet has replies, some (or all) of the replies will still be included. You may then filter the desired tweet by checking the "id_str" in the raw dictionary. But still this will dramatically increase the speed especially for tweets with large amount of interactions.
@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.
I've actually achieved a much easier implementation that you don't need to change the source code.
from stweet.tweets_by_ids_runner.tweets_by_id_context import TweetsByIdContext
class DummyContext(TweetsByIdContext):
def __setattr__(self, __name: str, __value: Any) -> None:
if __name == "cursor":
__value = None
return super().__setattr__(__name, __value)
Create a DummyContext
instance to use as a TweetsByIdContext
instance. The only difference is when the runner tries to set the cursor
attribute, the __setattr__
method will always silently edit the value to None
.
Example Usage:
stweet.TweetsByIdRunner(
tweets_by_id_task = task,
raw_data_outputs = [output],
tweets_by_ids_context = DummyContext()
).run()
Many thanks @junyilou ! I have had success with the strategy you provided. Results still need filtering as you have foreseen, but speed is unharmed.