gallery-dl
gallery-dl copied to clipboard
[2chen] Add 2chen.moe extractor
https://2chen.moe/
This only has a limited set of metadata keys and they're not enough to avoid collisions. In this thread there's multiple images named image.png
and there's no way to download all of them using this implementation. The last part of the URL does feature a unique ID, you could extract that and put it in a metadata key I guess. I'm not sure whether you can do that in the output template using gallery-dl's string formatting, I couldn't find a way.
https://2chen.moe/tech/1726349
@Hrxn Sorry for pinging you. Is it possible to extract the ID from the following URL using string formatting in the output template? Is yes, how?
https://2chen.moe/assets/images/src/e4ae6d2bb0ae01e8cb1c853cf279ac07ea6ee1d4.png
I can make it so that it uses the {reply_no} {filename}
for the filename
(this was my first ever extractor so i missed a lot of things)
Huh? You mean the ID/Hash value of the file?
From taking a cursory look at the source of a thread page (without account etc.)..
probably by extracting the data-hash
attribute from
<span class="fileinfo"><span>1.1 MB</span><span>1908x3392</span></span>
<a href="/assets/images/src/e86f89842c9a9ba18fbadd1505c7f93c6e4a2889.jpg" download="fansly-gdxe962yu7s.jpg" data-hash="e86f89842c9a9ba18fbadd1505c7f93c6e4a2889">fansly-gdxe962yu7s.jpg</a>
Which is inside <figcaption class="spaced"> ... </figcaption>
Or from the image link <a target="_blank" [..] >
where the hash is in the file name.
Both should always be the same..
<div class="post-container">
<figure><a target="_blank" href="/assets/images/src/e86f89842c9a9ba18fbadd1505c7f93c6e4a2889.jpg"><img src="/assets/images/thumb/e86f89842c9a9ba18fbadd1505c7f93c6e4a2889.jpg" width="84" height="150" loading="lazy"></a></figure>
<blockquote> ... </blockquote>
</div>
Keeping the "original" flename (uploaded filename) has its benefits too, although you're right that this could cause filename collisions on the filesystem when downloading.
But there is a workaround, you can use the "skip"
option with "enumerate"
, this should work right now, without making any changes to the extractor..
extracting the
data-hash
attribute
I did exactly that, and also used it for archive_fmt
Thank you for implementing this. Would there be a way to also get the date of the files (for use with the mtime postprocessor)? I checked the headers of the direct URLs to the files but they don't have the date, the post creation date would work though.
Edit: mtime postprocessor uses the "date" metadata key: https://github.com/mikf/gallery-dl/blob/117eeefda0d1beced1e3fe448d34bb80d8584353/gallery_dl/postprocessor/mtime.py
something like this?
I just tried your patch but I think utcoffset=-5.5
is wrong.
I loaded the page with JS disabled to get the UTC date which is 19:19:30 (07:19:39 PM), but the tool got 22:49:30 (10:49:30 PM).
Everything works well though, I tried the mtime postprocessor and the keys you added. Just the offset is wrong.
I edited a local copy to utcoffset=-2
which produced the correct dates but I'm in UTC+2. I find that weird because when loading the page without JS (that's what gallery-dl must be doing) 2chen sends the dates as UTC so no conversion should be necessary, yet it is.
Yes, removing offset works. The date key is now correctly in UTC but the time key is in local timezone, do you know a way to fix that?
but the time key is in local timezone
timezones don't affect epoch, so it should be correct
thanks for the help, as I said this was my very first extractor and I was new to python so I tried to copy the other extractors (and clearly missed a lot of things)
OK, lets leave it like that.
Thank you for the PR, all your work, and putting up with me.
Don't worry about copying an existing module or any potential mistakes. For your first PR, you did a lot better than most.