twitter_scraping icon indicating copy to clipboard operation
twitter_scraping copied to clipboard

parameterizing dateinc field , change the default date inc

Open rush00121 opened this issue 8 years ago • 7 comments

Instead of pulling data for every day , this will extend the search filter parameter to get a bigger date range. This should make the scraping process faster .

rush00121 avatar Mar 29 '17 18:03 rush00121

In my experience, you actually get less results when you use an interval, as opposed to going day by day. Have you tried running this on a user with a lot of tweets (20K+) and compared the total with the interval method?

bpb27 avatar Mar 29 '17 18:03 bpb27

I was trying out to get tweets from realdonaldtrump .He has > 30k tweets. I refactored the code and ran it with a dateinterval of 50 days. It was way faster than if I get it for 1 day .I did not record metrics to prove this but this definitely sped up the scraping process for me.

rush00121 avatar Mar 29 '17 18:03 rush00121

By less results I mean you only get / collect 28K total tweets (w/ interval method) instead of the 30K total tweets (w/ day by day method).

bpb27 avatar Mar 29 '17 18:03 bpb27

I did not test the number of tweets scraped. Let me test it and see if both the results are the same or not .

rush00121 avatar Mar 29 '17 18:03 rush00121

I tested for dates from 2010-01-01 - 2017-03-01 .

For the previous code, I got results : total tweet count: 26141

With my modifications, I got results : total tweet count: 27357

I took a smaller sample size :

dates from 2017-01-01 - 2017-02-01 .

Both runs gave me total tweet count: 204

I am not sure why in the previous run , my code gave me more results. Is is a timeout issue with the twitter page or something else.

But in both cases, I got a significant speed improvement .

rush00121 avatar Mar 29 '17 22:03 rush00121

I also ran this PR branch and got 14479 for 2010-01-01 - 2017-06-30. This may well be a facet of the loading of the pages on my machine - a factor which it seems would be an issue regardless - but definitely worth considering.

ryanbateman avatar Jul 06 '17 17:07 ryanbateman

Increasing the page load time to 2 netted 15687 ids for that same time period.

ryanbateman avatar Jul 07 '17 13:07 ryanbateman