twitter-archive-parser icon indicating copy to clipboard operation
twitter-archive-parser copied to clipboard

Downloads

Open masukomi opened this issue 1 year ago • 5 comments

1st a little fix:

corrects the folder name for images. tweet_data should be tweets_data

2nd:

downloads the full size version of the image and uses that in the markdown url, if successful. Otherwise just uses the one that came in the archive. A small percentage of images seem to violate the naming convention, or something, and aren't downloaded as a result.

note: I've added in a 0.75 delay after downloading an image just to, hopefully, appease any algorithm that's trying to prevent DOSing or similar abuse. Seems to work.

3rd:

updates README to make the features the first things people see, and adds some notes for non-python geeks.

masukomi avatar Nov 11 '22 13:11 masukomi

Hi @masukomi. Thanks for making a PR. Some comments:

One change per PR please.

Before making a PR, please make an Issue so that we can get consensus on the problem and the possible fixes. Exception: the change is very small and the fix is uncontroversial.

First issue you mention was captured as #11 and is fixed now. (But please check on your archive.)

timhutton avatar Nov 11 '22 14:11 timhutton

addressing conflicts now...

masukomi avatar Nov 11 '22 14:11 masukomi

conflicts addressed.

logic got a little more complicated, wondering if the main method shouldn't be refactored now. I'll leave that to "real" python programmers though. ;)

With that note, please sanity check i haven't done anything stupid. It seems to work, and handle the error states correctly but... 🤷‍♀️ I rarely ever touch python.

masukomi avatar Nov 11 '22 15:11 masukomi

@masukomi I've improved the readme, so you can remove that bit from the PR.

It's a great contribution to be able to download the full-size images - thanks! I hadn't even noticed that the images they supply are reduced.

One hurdle though. We have a lot of users struggling to run the code at all. Asking them to pip install a package will exclude more of them. Is there a way to do it without that?

Also, the downloading could take a long time. It might be good to maybe run the tool first and then print a message with instructions on how to run it with downloading turned on.

timhutton avatar Nov 11 '22 20:11 timhutton

It looks like one can use urllib.request instead of requests. ~~It will likely involve more code though... 🤔~~ https://dev.to/bowmanjd/http-calls-in-python-without-requests-or-other-external-dependencies-5aj1

Update: it does not, really:

diff --git i/parser.py w/parser.py
index f6d6d52..691c6e7 100644
--- i/parser.py
+++ w/parser.py
@@ -22,7 +22,7 @@ import glob
 import json
 import os
 import shutil
-import requests
+import urllib.request
 import time
 
 def read_json_from_js_file(filename):
@@ -44,10 +44,10 @@ def extract_username(account_js_filename):
 
 def download_image(url, file_name):
     if not os.path.isfile(file_name):
-        res = requests.get(url, stream = True)
-        if res.status_code == 200:
+        res = urllib.request.urlopen(url)
+        if res.code == 200:
             with open(file_name,'wb') as f:
-                shutil.copyfileobj(res.raw, f)
+                shutil.copyfileobj(res, f)
             print('Image sucessfully Downloaded: ',file_name)
             True
         else:

davidstosik avatar Nov 12 '22 12:11 davidstosik

@masukomi I've made a proposal in a comment in #16 for how to implement this without breaking the user's expectations of what parser.py is doing. Let me know what you think and whether you are still able/motivated to work on this.

timhutton avatar Nov 13 '22 00:11 timhutton

Have reworked this PR into #28

timhutton avatar Nov 13 '22 02:11 timhutton