TumblThree
TumblThree copied to clipboard
[Question] Discussion and Questions
For non-issue related questions, please ask here instead of creating new issues.
Thank you for the thread! In the settings, I'm a bit confused that there are several connection settings, I would like to understand exactly what they are. The value of parallel blogs and parallel connections is the number of connections to the Tumblr. If set the value of 20 parallel connections and 2 parallel blogs, there will be 10 streams of downloading to Tumblr servers, is this true? Further below there are functions of scan connections and the number of connections to Tumblr api. Their values are 4 connection scans and 60 connections to Tumblr api. I guess that the settings for the parallel connection determine the number of streams of downloading files by the links. A connection to api is about how to get these links from Tumblr api. And if all is true, then what is scan connection, the value of which is set to 4? Does this function somehow relate to the connection to the Tumblr api? I tried to put this value on 1000 and start downloading blogs (specifically the image metadata) and did not notice the error "Limit Exceeded", but also did not notice any apparent increase in download speed. Will it be better to leave this value at 1000 or better to return to 4? Also here there are two settings "Timeout" and "Time interval". I understand that the upper one is about the maximum duration of the connection of downloading files, and the lower one is about the maximum duration of the connection to the Tumblr api, after which these connections are forcibly terminated by the program? Will it be better for speed performance if I increase the time in the setting of the time interval for Tumblr api? Sometimes I just notice that the program does not download some part of the metadata, and after this did not see any error Limit Exceeded, perhaps because of the timeout of the connection to the api.
Hello.
I was wondering where can I get the .exe file of the latest release. Unfortunately, I don't have VS2015 or higher, but I wanted to test out the app.
@AnryCryman: Under releases, download the latest release. Currently that is v1.0.8.4, so the right file is TumblThree-v1.0.8.4.zip
@johanneszab Yeah, I downloaded it. But there are no executables there, only source code. Can you possibly email me the .exe file of the latest release to [email protected]?
..
I've uploaded that file myself and I'm pretty sure that there is a file called TumblThree/TumblThree.exe in that particular zip file. I cannot send you the .exe itself since it needs some more .dlls which are included in the zip file. Thus, I'd have to send you the exact same file I've linked above.
Why do you download the file called source code if you don't want the source code? Since I've already received five similar emails there must be a reason. Should I rename the link to binary? Did you download the source code .zip file from the main page by pressing the green download or clone-button?
@johanneszab Sorry, my mistake. Must have hit the wrong link. I downloaded TumblThree-v1.0.8.4.zip and found TumblThree.exe in there. Is there a way to explicitly specify the language of the app?
@Taranchuk:
The value of parallel blogs and parallel connections is the number of connections to the Tumblr. If set the value of 20 parallel connections and 2 parallel blogs, there will be 10 streams of downloading to Tumblr servers, is this true?
No, there will be 20 streams opened to the Tumblr servers. Actually it was more hard coded at the beginning. Right now it checks the current amount of active blogs give each active blog it slice of downloads. Thus, if you have the parallel connections setting set to 20 but only one blog in the queue active, it will consume all 20 connections. If you have 2 active blogs, they both will get 10 streams. It's probably a bit wonky but should work most of the time.
Further below there are functions of scan connections and the number of connections to Tumblr api. Their values are 4 connection scans and 60 connections to Tumblr api.
I've decoupled the scan/crawler connection at some point from the above settings. The Tumblr api/svc service/the parsing of the website usually is quite quick since it's only a few KB text. In the beginning of TumblThree the crawler was started first and after it finished the downloader started. Thus it made sense to allow more connections for parsing the website and grabbing the urls as for downloading the heavy binary data.
Right now the values are superfluous because of two reasons:
- The downloader starts immediately after the crawler dropped the first image/video/metadata url in the queue. So the waiting time until the first actuall download starts is mostly neglectable now.
- The Tumblr api is rate limited now. This means they only allow a specified number of connections to the api per a specific time period. Thus, even if you increase the scan connections but have the "Limit the scan connections to the Tumblr api" checkbox ticked, the connections are queued until a free slot is available. Thus, it bascially makes no difference since the rate limiter is the limiting factor.
- If you use the SVC Release or the parsing release however, you can increase or turn off the "Limit the scan connections to the SVC Service". I've discovered the svc service during my implementation of the private blog downloader. It basically outputs even more data about the posts of a blog than the Tumblr api but seems not be limited. They possibly cannot even do this since their webpage depends on it. I've implemented most features in that branch already. You'll have to try, I don't know if they'll eventually limit it (if abused).
Also here there are two settings "Timeout" and "Time interval". I understand that the upper one is about the maximum duration of the connection of downloading files, and the lower one is about the maximum duration of the connection to the Tumblr api, after which these connections are forcibly terminated by the program?
Exactly.
- The Timeout (s) value is the maximum time the stream to the server stays open if there is no activity on it. Thus, if there comes no data back for 120 seconds, the stream is closed.
- The Time Interval (s): belongs the the Limit connections to the Tumblr Api setting. If you enable the checkbox, TumblThree allows x number of connections per y Time Interval in seconds to the Api. E.g., the default allows 90 connections per 60 seconds. For me this value finally works without any forcefully closed connections (e.g. Limit Exceeded -- 403 error messages). Keep in mind that this is a global value. If you browse the api manually or open TumblThree twice from the same connection, your connections might still be dropped. Thus, if you open TumblThree twice, you'd have to halve the value in each instance (45 connection per 60 seconds).
Also take a look into #107 for some more program details.
Please redesign the whole UI to sane level, where it conforms to common expectations:
- Selecting blogs is slow and does not do the obvious things, especially the automatically color changes in the lines are very confusing, perhaps add a checkbox for selecting.
- How do i delete / select / rearrange things in the download list?
- Whole UI is very sluggish (on a i6770K 3,4Ghz ...)
- Replace the whole color codes in the listing with some describing text or add some what it means.
- add a (scrolling) log tab what it currently is busy with.
- there is no "update (all) blogs" (with a "force check" checkbox) on the button bar.
- why does it not download the images after extracting the URLs?
- progress bar in the main window is very confusing, especially there is "green" and not filled. Why?
I wonder: Would it be possible to be able to auto upload files to queue them? As in point the app to a folder / folders and provide a text file with a tag or make it configurable. Have the app then process the folder, upload and queue the images as per setting (one at a time, two, three, four) add the tag etc. This would TumblrThree even more than an amazing backup tool.
@Kvothe1970: Nice idea. Should be possible yes. There already is a filesystem monitor api in C#, thus implementing this should be more or less straight forward.
Maybe it's a good thing to also implement TumblThree GUI-less at the same time and let it start it from the command line. That might reduce resource usage.
@johanneszab Considering I am a big fan of GUIs I would support this being optional ;)
Can I download my own liked photos and videos? I tried "liked/by/myaccount" but it just shows:"Request denied.You do not have permission to access this page."
- What is this file *_files.tumblrtagsearch, which lies in the folders from the tag downloader? I looked inside and it turned out that the filenames are stored here. But what's the matter: the filenames there are a few hundred more than there is really in the folder. Why can this be so and what does it affect? It may be that there are skipped files that were not downloaded the first time and the program can not download them again because these files are already on the list just like with index files?
- Are the search and tag downloaders associated with the tumblr api? Is it possible to disable the api limit in the settings and run several instances for downloading by tag and search keywords without the risk of not getting the some files that will be skipped during the download? I have already tried this and have not yet encountered a limit error, but I would not like to think that it is possible that the limit detection monitor is simply not built into these functions and I just do not notice the missing files.
- Is it possible to get some metadata from the search and tag pages? It seems to me that it's impossible to completely get them, but is it possible to get a full list of blogs on these pages, where were the images downloaded from? It was very much like one day to see the function of downloading a list of blogs from search and tag pages so that can select the most frequently blogs and add them to the program because perhaps they contain good content if they offer a lot of content that I'm interested in search and tag pages.
- Only the regular Tumblr blog downloader in the 1.0.8.X releases use the Tumblr api. The search downloader parse the regular website, so you should be able to run multiple instances without any problems. The SVC release (1.0.7.X) and the downloader for private blogs in the normal release (1.0.8.X) use a web service that is required by the browser to display the website itself. So it might eventually be rate limited it, but I don't think so.
Thank you for the great tool! One question, when I first time run the application, in Details panel I can see the textboxes for download time (from ~ to), but when I choose a blog, these textboxes disappeared, is this a function in progress or I used it in a wrong way?
douww2000, this is for tag pages only. If you need to partially download blogs, use the function of downloading pages. For example, if you only need 1000 last posts and you have 50 posts per page by default in the detail views, then set the interval to 1-20 in the field "Download pages:". Or 1-1000, if you set 1 posts per page.
Well, not entirely right. I've included it in the release notes (v1.0.8.18) because downloading posts in a defined time span is possible for (private) blog downloads too.
So, I guess you'll have to update to the latest version.
I can't get into any private blogs. I went to settings and successfully authenticated with my Tumblr login credentials, but none of the private blogs I want to back up will download? I've attempted both on a friend's private blog and my own private blog and neither will work. Am I missing a step?
Am I missing a step?
Yes. Describing exactly what happens if you try to download a private blog. What do you see in the queue progress? What happens with the blog, does it just finish or hang? What did you select in the Details window for the private blog? Any tags? And maybe posting the url here, so that someone can check if it actually works.
Since you aren't the first person reporting this (see #118 for more), there might be something missing, but the blog posted there actually worked for me. Thus, I cannot do anything, since I cannot reproduce the error.
Of course, you could also debug the code/error yourself, if there is one after all ..
Ok, that won't work right now since you need to password to view your blog.
What I meant with a private blog is a blog like this: https://privtumbl.tumblr.com/ where you need to be logged in in order to see them.
It's probably possible to implement something that it will work with password protected blogs too, but it's not possible right now. How are these things called? it's weird tho since they are called differently all over the place. At least the last time I've looked.
hmm, it's way easier than I though. You just have to do an additional POST request with the password in the body before browsing the blogs, that's it. All the other code can be reused I guess.
Looks easy to do, but I don't have any time for this right now.
Okay, one thing that works in the meantime is that you just update the cookie with the Internet Explorer. TumblThree uses the Internet Explorer to login to tumblr.com. The Internet Explorer is just opened in a different window. Thus, they share the same cookie.
To download your password-protected blog, try the following:
- Start the Internet Explorer, enter your blogs URL, enter the password, load your blog once.
- Now you can re-add your blog to TumblThree and it should work. At least in my short test for my test blog here. You have to be authenticated though, even if you have a "public" blog until I'll update the code properly.
@PonyGirl6763: I think this will work for you (downloads password protected blog). You'll have to supply the blogs password in the Details tab:
Something like a hidden (login required) and password protected blog does not exist? I can set both options in my second tumblr blog, but then it's impossible to access it from another account?
After I login with a second account, i always get a 404 page (this tumblr does not exist) without seeing any password request page at all.
Hi @johanneszab, thanks for this questions-threat (and for a really amazing good app)! :)
I have a question about the "Downloaded Files" vs "Number of Downloads". Most of the time these numbers don't match up; why is that?
What do the numbers represent? I first assumed that "Number of Downloads" was the total number of downloads available after filtering (settings) and that "Downloaded Files" was a way to confirm that you had scraped 100% of the available gallery, but seeing the inconsistency makes me realize they are not working like that at all. Can you please explain? :)
Screenshot: Both blog crawls are complete.
Previously it was implemented differently, but the number of the download is the number of downloads (posts, videos, images, external images/videos) TumblThree detected during the current crawl with your given settings, yap. Thus, it's not the total number of possible downloads, nor the total number of posts of the blog. Previously I've tried to calculate the total number, but it's never really consistent.
As I've just mentioned, the number of posts can be lower than the downloads if the blog contains a picture set as TumblThree will download all pictures from that set. Or there is an embedded picture withing a post. Someone deletes things from the blog, then the Downloaded Files will be higher than the Number of Downloads. It just was never really right, and people kept complaining, thus I've changed it to the current behavior.
It should be (almost) complete in your case if you download the whole blog at once, yes. But some urls TumblThree grabs aren't accessible on the Tumblr servers anymore. I've seen a few cases (pictures), and I'm sure those images are the reason for the lower count. They just return a 403 error code. I cannot give you an example right now though.
So, it's more or less a rough estimate.
I see, great response @johanneszab thanks! :)
I'll keep that in mind and, I assume, I can safely ignore the numbers and use them only as an estimate of amounts/size. :)
Thanks again!
A short question: How does TumblThree determine its "duplicate found" elements?
I have tried the program with a single blog (with a rather big post count) and the number of found duplicates seems a bit high to me. Although that's just a guess, I admit. But given that each post on a blog has its own unique post ID, it can't be the posts itself, or am I mistaken?