TumblThree icon indicating copy to clipboard operation
TumblThree copied to clipboard

Some hidden Tumblr blog posts cannot be parsed

Open thornate opened this issue 6 years ago • 14 comments

I have two hidden tumblrs added to TumblThree. One of them loads fine, so authentication isn't an issue. The other sits in the queue with a status message 'Evaluated n tumblr blog sites' where 'n' counts from 1-4. I can see that another message appears very quickly then disappears after the 4th count. Is that logged anywhere?

I saw on a different issue that the 'Download reblogged posts' checkbox must be checked for hidden posts, so that's not an issue.

I have upgraded to version 1.0.8.45. I'm running Windows 7.

Can you please give suggestions on how to fix this, or debug it further?

thornate avatar Mar 22 '18 07:03 thornate

Sounds like you did everything right.

Could you clear your cookies once, do the authentication again, and then try the non-working blog again? It is a workaround for people were the authentication seems to have no effect at the first time (#180, #210).

Could you also post the blogs url or send it me per email? Then I'll check it later today and see if I can reproduce the error. Maybe there is a specical character or somethings that cannot be correctly parsed.

I saw on a different issue that the 'Download reblogged posts' checkbox must be checked for hidden posts, so that's not an issue.

That should be fixed already since v.10.8.41+

johanneszab avatar Mar 22 '18 10:03 johanneszab

I tried clearing the cookies and re-authenticating but it didn't work. I'll email you the blog url.

thornate avatar Mar 22 '18 15:03 thornate

Thanks for the blog.

At least one of the posts (I think there are two in total within the 44 posts), cannot be parsed and the error message in Visual Studio is simply:

"Expecting state 'Element'.. Encountered 'Text'  with name '', namespace ''. "

which is quite useless. I'll eventually fix the problem I think, but I don't have time to investigate this further right now. It might be an weirdo emoji as I've noticed there are some, or just some text that messes up the json, and then it cannot be parsed correctly anymore.

As a workaround, in the Details tab you can set the Posts per page down to 1 instead of 50 for this specific blog. TumblThree will then crawl only one post per page (request), and hence only parse one post after another and only discard those two posts that cannot be processed correctly. This way the majority of the blog is still downloadable. The only downside is the slower crawling, but at least it will be working.

johanneszab avatar Mar 22 '18 18:03 johanneszab

Looks like it worked. Thanks!

thornate avatar Mar 23 '18 05:03 thornate

I'll add a notification if this happens so that one can react in decrease the number of posts per page.

johanneszab avatar Mar 23 '18 09:03 johanneszab

The workaround works for now. Excellent. I have to check something because I now noticed, that some blogs get crawled in their entirety every single time despite the settings not saying "Force Rescan" Which does not make sense. This is noticeable now, with the crazy time this now takes because of the 1/50 speed for these blogs.

Kvothe1970 avatar Mar 28 '18 06:03 Kvothe1970

I have to check something because I now noticed, that some blogs get crawled in their entirety every single time despite the settings not saying "Force Rescan" Which does not make sense.

It's not implemented in the hidden Tumblr crawler. It's probably possible though.

If the api limit is reached in the normal, non-hidden Tumblr blogs, it doesn't save the last crawled post id either. It might need an update now, that if a download fails because of timeout (which i'm still not 100% satisfied with), that it maybe also doesn't save the last crawled post ID, hence recrawls everything.

johanneszab avatar Mar 29 '18 06:03 johanneszab

Is there anything I can do, to help to confirm this? It might also be, that some of there were changed form normal to hidden or vice versa and were added again. Let me know if you need me to do specific tests and extract information from the files etc. please. I would love to help.

Kvothe1970 avatar Mar 29 '18 06:03 Kvothe1970

Parsing error not only with protected blogs..

EC-O-DE avatar Apr 18 '18 16:04 EC-O-DE

hey btw this https://github.com/ScriptSmith/reaper was updated few hours ago and they fixed some Tumblr stuff...

EC-O-DE avatar Apr 18 '18 16:04 EC-O-DE

I think I might have fixed this now for good. Since I'm not download so much personally, let me know if it still happens and i'd like to get another (small!) example blog in that case.

Thanks!

johanneszab avatar Apr 18 '18 19:04 johanneszab

Looking better now. I have an issue still with multiple blogs reporting that I need to be signed in, whilst being signed in. Will determine factors as soon as I have time to run a few tests. But now the blogs that used to stop after a few posts work, thank you!

Kvothe1970 avatar Apr 19 '18 19:04 Kvothe1970

It looks like it downloads some of the files, but not all. I'm getting 120 downloaded files out of 202 for the Tumblr I emailed to you, and it drops from the queue with the progress bar only partway through.

thornate avatar Apr 20 '18 05:04 thornate

Thanks @thornate. Then I think it's on the state before Tumblr changed its APIs/website, which is okay for me.

To fix the all the parsing errors and to be prepared for future changes, it might be an idea to use an external json library like json.net. It's way easier to code and can handle unknown data by just ignoring it. The DataContractJsonSerializer we currently use that is part of the .NET framework however cannot and simply fails if it detects unknown structures. And I think thats what happens here. Some posts probably have some json like string embedded and then the parsing doesn't work.

Since I want TumblThree as small as possible, I might have a look again at some point later before doing the switch. Thanks for remembering of the email with the blog name.

johanneszab avatar Apr 20 '18 06:04 johanneszab