parameterizing dateinc field , change the default date inc
Instead of pulling data for every day , this will extend the search filter parameter to get a bigger date range. This should make the scraping process faster .
In my experience, you actually get less results when you use an interval, as opposed to going day by day. Have you tried running this on a user with a lot of tweets (20K+) and compared the total with the interval method?
I was trying out to get tweets from realdonaldtrump .He has > 30k tweets. I refactored the code and ran it with a dateinterval of 50 days. It was way faster than if I get it for 1 day .I did not record metrics to prove this but this definitely sped up the scraping process for me.
By less results I mean you only get / collect 28K total tweets (w/ interval method) instead of the 30K total tweets (w/ day by day method).
I did not test the number of tweets scraped. Let me test it and see if both the results are the same or not .
I tested for dates from 2010-01-01 - 2017-03-01 .
For the previous code, I got results : total tweet count: 26141
With my modifications, I got results : total tweet count: 27357
I took a smaller sample size :
dates from 2017-01-01 - 2017-02-01 .
Both runs gave me total tweet count: 204
I am not sure why in the previous run , my code gave me more results. Is is a timeout issue with the twitter page or something else.
But in both cases, I got a significant speed improvement .
I also ran this PR branch and got 14479 for 2010-01-01 - 2017-06-30. This may well be a facet of the loading of the pages on my machine - a factor which it seems would be an issue regardless - but definitely worth considering.
Increasing the page load time to 2 netted 15687 ids for that same time period.