twitter_scraping
twitter_scraping copied to clipboard
Can't get the IDs
When I run the scrape.py, the final JSON created is blank without any ids in it. It was working last week. Does anyone know how to solve it?
I also have this problem, it used to work and now doesn't. My suspicion is that the problem is with the css selector. Maybe Twitter recently changed the way they store tweet ids in the css file? I also don't really know what I'm talking about because I'm pretty new to python. If you figure it out, please let me know!
@shenyizy I think I've fixed it but I'm not entirely confident this is logic error free. It's a bit messy, but the trick is to use a new and less effective css selector. I've noticed three problems so far but I've been able to work around:
- The new selector will also select for hyperlinks on the names of users being replied to, so to work around that I remove all the list items that aren't fully numeric. But if your user was replying to someone with a fully numeric handle then this data point would slip through. There might be a better way to fix that than this.
- It also tends to duplicate a lot of tweet ids but this really doesn't matter because duplicates are removed at the end of the script.
- The json file doesn't get wiped at any point, so if you run for two users in a row, the second user will inherit all of the first user's tweets. My solution is to manually delete the all_ids.json between runs, which is clunky but functional.
New selector:
twitter_ids_filename = 'all_ids.json'
days = (end - start).days + 1
tweet_selector = 'article > div > div > div > div > div > div > a'
user = user.lower()
ids = []
New loop:
for day in range(days):
d1 = format_day(increment_day(start, 0))
d2 = format_day(increment_day(start, 1))
url = form_url(d1, d2)
print(url)
print(d1)
driver.get(url)
sleep(delay)
try:
found_tweets = driver.find_elements_by_css_selector(tweet_selector)
all_tweets = found_tweets[:]
increment = 0
while len(found_tweets) >= increment:
print('scrolling down to load more tweets')
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
sleep(delay)
found_tweets = driver.find_elements_by_css_selector(tweet_selector)
all_tweets += found_tweets[:]
print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
increment += 10
for tweet in all_tweets:
try:
id = tweet.get_attribute('href').split('/')[-1]
ids.append(id)
except StaleElementReferenceException as e:
print('lost element reference', tweet)
print(ids)
except NoSuchElementException:
print('no tweets on this day')
start = increment_day(start, 1)
finalids = [tweetid for tweetid in ids if tweetid.isdigit() == True]
New writetofile
try:
with open(twitter_ids_filename) as f:
all_ids = finalids + json.load(f)
data_to_write = list(set(all_ids))
print('tweets found on this scrape: ', len(finalids))
print('total tweet count: ', len(data_to_write))
except FileNotFoundError:
with open(twitter_ids_filename, 'w') as f:
all_ids = finalids[-]
data_to_write = list(set(all_ids))
print('tweets found on this scrape: ', len(finalids))
print('total tweet count: ', len(data_to_write))
with open(twitter_ids_filename, 'w') as outfile:
json.dump(data_to_write, outfile)
Hope that works for you too!
@jaackland Thanks for sharing a solution. However, when I run this code, it doesn't get out of the "while len(found_tweets) >= increment:" loop. The problem is coming from the "all_tweets" variable. There is nothing added to that variable and so it never gets out of that loop. Any alternative solution?
@AhsanCode Sorry I should have made clearer that that isn't a full script. Are you substituting that into the original Scrape.py?
@jaackland No worries, I am aware that this isn't the full script. The original version was also working for me until now. I just made the same adjustments as you did. It is the "tweet_selector" that is giving me problems at the moment.
@jaackland My mistake, I had an indent problem. The code is running fine now. There are still a couple of issues: 1: I tried to retrieve all tweets since 1st Jan 2020 and noticed that after 20 days, the code doesn't retrieve any new ID's anymore so I had to rerun the code multiple times. 2: Once all ID's are retrieved, there is still a significant number of tweets unaccounted for.
@AhsanCode Yes unfortunately I think Twitter have managed to rate-limit Selenium now (the original post implies this wasn't always the case). If you increase the delay variable it will scrape more tweets (but take longer, obviously). I went up to 5 and got all the tweets I needed, but you might be able to get away with less than that.
Glad it was just an indent problem because as far as I can tell that tweet_selector is universal (if a bit sloppy).
@jaackland Thanks so much for sharing the codes. However, when I tried to substitute the original codes with yours. There is a syntax error in the part of New writetofile as shown below.
all_ids = finalids[-]
^
SyntaxError: invalid syntax
I am also pretty new to python so sorry for the stupid question.
I found a selector which seems to only select the time posted link, which links to the full tweet's page. tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a
I found a selector which seems to only select the time posted link, which links to the full tweet's page.
tweet_selector = article > div > div > div:nth-child(2) > div > div:nth-child(1) > div > div > div > a
Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks
Twitter changed CSS styling, therefore in current code you need to change id_selector and tweet_selector to:
id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a" tweet_selector = 'article'
id_selector = "div > div > :nth-child(2) > :nth-child(2) > :nth-child(1) > div > div > :nth-child(1) > a" tweet_selector = 'article'
I would combine the article part with the rest of the selector, for code neatness. Your selector seems to grab a few more tweets for whatever reason.
Hi, could you please tell me how did you get this CSS selector for tweet_selector? Thanks
I used Chrome DevTools to generate a selector and stripped out class names, etc.
I've tried the changes suggested to id_selector and tweet_selector, however I'm not getting the ID with this.
I've changed the line collecting the id (line 65) to this: id = tweet.find_element_by_css_selector(id_selector).get_attribute('href').split('/')[-3]
This gives me some ids, but not even close the number of tweets I'm finding. Any suggestions on what the problem might be?
I've started writing some scripts for a project I'm working on. https://github.com/namuan/twitter-utils
Currently tweets_between.py generates a text file but I'll see if I can generate a json so that get_metadata.py
script can be used without any change
I'm back with a new selector! article > div > div > div > div > div > div > div.r-1d09ksm > a
. The class name on the last div may change depending on platform, I've only tested it on Chrome 81. Removing it will collect some extra links to profiles but should be easy to filter those out.
I have also run into a new twitter search page which will not work with this selector. Simple fix: just restart the script, I think they're A/B testing it.
@rougetimelord that selector worked... kinda
It found the tweets but I don't know why it didn't save the tweets
I was able to get a list of ids... the thing is that now I can´t transform it to Json file. I always get the error "unhashable type: 'list'"
I tried transforming into a dict o a tuple... but didn't worked.
My Try code looks like this know
try:
found_tweets = driver.find_elements_by_css_selector(tweet_selector)
increment = 10
while len(found_tweets) >= increment:
print('scrolling down to load more tweets')
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
sleep(delay)
found_tweets = driver.find_elements_by_css_selector(tweet_selector)
increment += 10
print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
for tweet in found_tweets:
try:
id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href')
id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user]
ids.append(id)
print(id_user)
except StaleElementReferenceException as e:
print('lost element reference', tweet)
These are my selectors:
id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \
'> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a'
tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'
I finally got it to work! But I hit Twitter limit every time.... I started at 2 seconds of sleep and did like 4 months. Then changed it to 4 and did a couple more months... it's going to be a loooooong journey.
I will try to upload my version.
I was able to get a list of ids... the thing is that now I can´t transform it to Json file. I always get the error "unhashable type: 'list'"
I tried transforming into a dict o a tuple... but didn't worked.
My Try code looks like this know
try: found_tweets = driver.find_elements_by_css_selector(tweet_selector) increment = 10 while len(found_tweets) >= increment: print('scrolling down to load more tweets') driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') sleep(delay) found_tweets = driver.find_elements_by_css_selector(tweet_selector) increment += 10 print('{} tweets found, {} total'.format(len(found_tweets), len(ids))) for tweet in found_tweets: try: id_user = tweet.find_element_by_css_selector(id_selector).get_attribute('href') id = [id_user.split('/')[-1] for x in id_user if id_user.split('/')[-3] == user] ids.append(id) print(id_user) except StaleElementReferenceException as e: print('lost element reference', tweet)
These are my selectors:
id_selector = 'article > div > div.css-1dbjc4n.r-18u37iz > div.css-1dbjc4n.r-1iusvr4.r-16y2uox.r-1777fci.r-1mi0q7o ' \ '> div:nth-child(1) > div > div > div.css-1dbjc4n.r-1d09ksm.r-18u37iz.r-1wbh5a2 > a' tweet_selector = 'div > div > div > main > div > div > div > div > div > div > div > div > section > div > div > div > div > div > div > div > div > article'
Hi! Were you able to fix this error? I was trying what you recommended and it also shows the 'unhashable type: list' error.
I'd appreciate any help you could give me with this. I'm trying to retrieve tweets of specific accounts from November and December 2019.
I submitted a Pull Request with my working version
Hi, I used the updated .py file and it still doesn't work. Is there anything else I have to change?
@rougetimelord Just wanted to let you know that your css selector can be condensed into something way smaller
article div.r-1d09ksm > a
I have fixed this in my own version, as I see PRs are not being sorted out I am not sure if I should make a PR here to fix it. If there are others interested in me opening a PR just let me know, there are quite a few improvements such as making sure the csv writer uses utf-8 encoding to simply using selenium's ability execute javascript to easily fetch the id information from a twitter post
I submitted a Pull Request with my working version
bro ı try your version
0 tweets found, 0 total https://twitter.com/search?f=tweets&vertical=default&q=from%3Aelonmusk%20since%3A2014-04-07%20until%3A2014-04-08include%3Aretweets&src=typd 2014-04-07
getting results like this
Why dont i get all tweets from a user? For example, @lulaoficial i only scratch 8k out of 22k
The twitter webpage was modified years after this code was written. The modifications needed are as follows: id_selector and tweet_selector not required now. Only need to change the try-except code into of for loop `
try:
found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
for i in found_tweets:
print("Founded: ", i.get_attribute('href'))
increment = 10
while len(found_tweets) >= increment:
print('scrolling down to load more tweets')
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
sleep(delay)
found_tweets = driver.find_elements_by_css_selector('a[href*="/'+user+'/status/"]')
increment += 10
print('{} tweets found, {} total'.format(len(found_tweets), len(ids)))
for tweet in found_tweets:
try:
id = tweet.get_attribute('href').split('/')[-1]
ids.append(id)
except StaleElementReferenceException as e:
print('lost element reference', tweet)
except NoSuchElementException:
print('no tweets on this day')
start = increment_day(start, 1)`