twitter-archive-parser
twitter-archive-parser copied to clipboard
Downloads
1st a little fix:
corrects the folder name for images. tweet_data
should be tweets_data
2nd:
downloads the full size version of the image and uses that in the markdown url, if successful. Otherwise just uses the one that came in the archive. A small percentage of images seem to violate the naming convention, or something, and aren't downloaded as a result.
note: I've added in a 0.75 delay after downloading an image just to, hopefully, appease any algorithm that's trying to prevent DOSing or similar abuse. Seems to work.
3rd:
updates README to make the features the first things people see, and adds some notes for non-python geeks.
Hi @masukomi. Thanks for making a PR. Some comments:
One change per PR please.
Before making a PR, please make an Issue so that we can get consensus on the problem and the possible fixes. Exception: the change is very small and the fix is uncontroversial.
First issue you mention was captured as #11 and is fixed now. (But please check on your archive.)
addressing conflicts now...
conflicts addressed.
logic got a little more complicated, wondering if the main method shouldn't be refactored now. I'll leave that to "real" python programmers though. ;)
With that note, please sanity check i haven't done anything stupid. It seems to work, and handle the error states correctly but... 🤷♀️ I rarely ever touch python.
@masukomi I've improved the readme, so you can remove that bit from the PR.
It's a great contribution to be able to download the full-size images - thanks! I hadn't even noticed that the images they supply are reduced.
One hurdle though. We have a lot of users struggling to run the code at all. Asking them to pip install a package will exclude more of them. Is there a way to do it without that?
Also, the downloading could take a long time. It might be good to maybe run the tool first and then print a message with instructions on how to run it with downloading turned on.
It looks like one can use urllib.request
instead of requests
. ~~It will likely involve more code though... 🤔~~
https://dev.to/bowmanjd/http-calls-in-python-without-requests-or-other-external-dependencies-5aj1
Update: it does not, really:
diff --git i/parser.py w/parser.py
index f6d6d52..691c6e7 100644
--- i/parser.py
+++ w/parser.py
@@ -22,7 +22,7 @@ import glob
import json
import os
import shutil
-import requests
+import urllib.request
import time
def read_json_from_js_file(filename):
@@ -44,10 +44,10 @@ def extract_username(account_js_filename):
def download_image(url, file_name):
if not os.path.isfile(file_name):
- res = requests.get(url, stream = True)
- if res.status_code == 200:
+ res = urllib.request.urlopen(url)
+ if res.code == 200:
with open(file_name,'wb') as f:
- shutil.copyfileobj(res.raw, f)
+ shutil.copyfileobj(res, f)
print('Image sucessfully Downloaded: ',file_name)
True
else:
@masukomi I've made a proposal in a comment in #16 for how to implement this without breaking the user's expectations of what parser.py
is doing. Let me know what you think and whether you are still able/motivated to work on this.
Have reworked this PR into #28