gallery-dl icon indicating copy to clipboard operation
gallery-dl copied to clipboard

[2chen] Add 2chen.moe extractor

Open enduser420 opened this issue 2 years ago • 11 comments

https://2chen.moe/

enduser420 avatar Jun 27 '22 12:06 enduser420

This only has a limited set of metadata keys and they're not enough to avoid collisions. In this thread there's multiple images named image.png and there's no way to download all of them using this implementation. The last part of the URL does feature a unique ID, you could extract that and put it in a metadata key I guess. I'm not sure whether you can do that in the output template using gallery-dl's string formatting, I couldn't find a way.

https://2chen.moe/tech/1726349

lx30011 avatar Aug 26 '22 18:08 lx30011

@Hrxn Sorry for pinging you. Is it possible to extract the ID from the following URL using string formatting in the output template? Is yes, how?

https://2chen.moe/assets/images/src/e4ae6d2bb0ae01e8cb1c853cf279ac07ea6ee1d4.png

lx30011 avatar Aug 26 '22 19:08 lx30011

I can make it so that it uses the {reply_no} {filename} for the filename (this was my first ever extractor so i missed a lot of things)

enduser420 avatar Aug 26 '22 19:08 enduser420

Huh? You mean the ID/Hash value of the file?

From taking a cursory look at the source of a thread page (without account etc.).. probably by extracting the data-hash attribute from

<span class="fileinfo"><span>1.1 MB</span><span>1908x3392</span></span>
<a href="/assets/images/src/e86f89842c9a9ba18fbadd1505c7f93c6e4a2889.jpg" download="fansly-gdxe962yu7s.jpg" data-hash="e86f89842c9a9ba18fbadd1505c7f93c6e4a2889">fansly-gdxe962yu7s.jpg</a>

Which is inside <figcaption class="spaced"> ... </figcaption>

Or from the image link <a target="_blank" [..] > where the hash is in the file name. Both should always be the same..

<div class="post-container">
    <figure><a target="_blank" href="/assets/images/src/e86f89842c9a9ba18fbadd1505c7f93c6e4a2889.jpg"><img src="/assets/images/thumb/e86f89842c9a9ba18fbadd1505c7f93c6e4a2889.jpg" width="84" height="150" loading="lazy"></a></figure>
    <blockquote> ... </blockquote>
</div>

Keeping the "original" flename (uploaded filename) has its benefits too, although you're right that this could cause filename collisions on the filesystem when downloading.

But there is a workaround, you can use the "skip" option with "enumerate", this should work right now, without making any changes to the extractor..

Hrxn avatar Aug 26 '22 19:08 Hrxn

extracting the data-hash attribute

I did exactly that, and also used it for archive_fmt

enduser420 avatar Aug 26 '22 19:08 enduser420

Thank you for implementing this. Would there be a way to also get the date of the files (for use with the mtime postprocessor)? I checked the headers of the direct URLs to the files but they don't have the date, the post creation date would work though.

Edit: mtime postprocessor uses the "date" metadata key: https://github.com/mikf/gallery-dl/blob/117eeefda0d1beced1e3fe448d34bb80d8584353/gallery_dl/postprocessor/mtime.py

lx30011 avatar Aug 29 '22 10:08 lx30011

something like this?

enduser420 avatar Aug 29 '22 12:08 enduser420

I just tried your patch but I think utcoffset=-5.5 is wrong. I loaded the page with JS disabled to get the UTC date which is 19:19:30 (07:19:39 PM), but the tool got 22:49:30 (10:49:30 PM).

Everything works well though, I tried the mtime postprocessor and the keys you added. Just the offset is wrong.

lx30011 avatar Sep 16 '22 10:09 lx30011

I edited a local copy to utcoffset=-2 which produced the correct dates but I'm in UTC+2. I find that weird because when loading the page without JS (that's what gallery-dl must be doing) 2chen sends the dates as UTC so no conversion should be necessary, yet it is.

lx30011 avatar Sep 16 '22 10:09 lx30011

Yes, removing offset works. The date key is now correctly in UTC but the time key is in local timezone, do you know a way to fix that?

lx30011 avatar Sep 16 '22 10:09 lx30011

but the time key is in local timezone

timezones don't affect epoch, so it should be correct

enduser420 avatar Sep 16 '22 11:09 enduser420

thanks for the help, as I said this was my very first extractor and I was new to python so I tried to copy the other extractors (and clearly missed a lot of things)

enduser420 avatar Oct 04 '22 19:10 enduser420

OK, lets leave it like that.

Thank you for the PR, all your work, and putting up with me.

Don't worry about copying an existing module or any potential mistakes. For your first PR, you did a lot better than most.

mikf avatar Oct 04 '22 20:10 mikf