gallery-dl
gallery-dl copied to clipboard
Support for BlueSky?
The site is still invite-only for now, but I’m willing to provide an invite code as soon as I get a new one (should be in ~5 days).
Heya, I have a few invite codes left over if @mikf wants one to implement this :)
I've implemented some basics for this in an unrelated project. Chitose makes it relatively easy, but I'm not sure what the contribution guidelines are for new library dependencies. @mikf ?
Logic is basically
def login(self, instance="bsky.social"):
rc = netrc.netrc()
(BSKY_USER, _, BSKY_PASSWD) = rc.authenticators(instance)
self.api = chitose.BskyAgent(service=f'https://{instance}')
self.api.login(BSKY_USER, BSKY_PASSWD)
logging.info(f"Logged into {instance} as {BSKY_USER}")
def getPostMedia(self, json_obj) -> typing.Iterable[typing.Tuple[str, str]]:
for image_def in json_obj.get('embed', {}).get('images', []):
src_url = image_def['fullsize']
name = posixpath.split(src_url)[-1].replace('@', '.')
yield (name, src_url)
def bskyGetThread(self, post_reference: PostReference) -> dict:
thread_response = self.api.get_post_thread(uri=self.bskyTupleToUri(post_reference))
thread_response = json.loads(thread_response)
return thread_response
def getSkeetJsonApi(self, post_reference: PostReference, reason=""):
try:
thread_response = self.bskyGetThread(post_reference)
thread_response['thread']['post']['id'] = post_reference.post_id
logging.info(f"Downloaded new {self.NOUN_POST} for {post_reference} ({reason})")
# print(thread_response)
json_obj = thread_response['thread']['post']
return json_obj
except urllib.error.HTTPError as e: # type: ignore[attr-defined]
logging.error(e.headers)
logging.error(e.fp.read())
raise e
except Exception:
raise
Bluesky is now open to the public, FYI:
https://www.pcmag.com/news/twitter-alternative-bluesky-makes-posts-publicly-viewable
Bluesky is now open to the public, FYI:
https://www.pcmag.com/news/twitter-alternative-bluesky-makes-posts-publicly-viewable
Not for every account tho, you can manually set if you want your posts to be publicly viewable or only for people who are logged in.
BlueSky posts are always public. You can request for your profile to be hidden from the unauthenticated human-friendly web interface, but that doesn't make it private. It will always be readable via public API.
I've added a bunch of bluesky
code.
Could someone test it and let me know what else should be added/improved/etc?
Starting experimentation with the version 1.26.7 on bluesky with username, password and cookies.
"bluesky":
{
"username": "EMAIL",
"password": "PASS",
"retweets": false,
"original": true,
"cookies": "C:\\Users\\USER\\cookies.txt",
"cookies-update": true
},
Using this post as a test: https://bsky.app/profile/toomanyboners.bsky.social/post/3khucm2ygso2z
Picture downloaded with gallery-dl results in a 1470 x 1260 JPG.
Opening picture in browser with "Open in new tab" gives a picture of 1000 x 857 JPG: https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg
Clicking on picture to open it up in browser and "Open in new tab" gives a picture of 2000 x 1714 JPG: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg
Removing "original": true,
and cookies
lines results in same resolution downloaded; between "preview" and "fullsize" resolutions. Not sure if "fullsize" is actually that, or if the one ripped is the true size, but figured this should be known for clarity's sake.
I just tested the same link, same parameters except for not providing cookies.
Downloading the link the first time gave me the 2000 × 1714 JPG file.
All subsequent downloads of the same link using the same exact settings gave me the 1470 × 1260 JPG file.
Each image uploaded to bluesky has 3 different versions (at least, haven't found more at this point).
-
thumbnail
: scaled to 1000x? or ?x1000 px, regardless of original size (https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:cslxjqkeexku6elp5xowxkq7/bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq@jpeg) -
fullsize
: scaled to 2000x? or ?x2000 px, regardless of original size (https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:cslxjqkeexku6elp5xowxkq7/bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq@jpeg) -
original
: original upload size, but downscaled to 2000x? or ?x2000 px if larger than that (https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:cslxjqkeexku6elp5xowxkq7&cid=bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq)Even though I call this
original
, it is still a modified version of the uploaded file, as in every file gets converted to JPEG and even uploaded JPEGs get re-compressed.
(from https://bsky.app/profile/mikf.bsky.social/post/3kkn2rkvdls2v)
gallery-dl is currently downloading everything in original
size (https://github.com/mikf/gallery-dl/commit/55bbd49a0eccbf207c6833983d3a2a0ff6f73287).
Guess I'll add an option for this.
edit: cookies
don't work on bluesky. The site itself doesn't use cookies. You need to provide username
and password
to login, but you can remove password
after the first login.
That's confusing and a bit annoying how bluesky is doing this image resizing. You know it's bad when Twitter is more consistent with filesizes than this new alternative. So, if a 3000 x 3000 pic is uploaded, it'll always be downsized to 2000 x 2000 with no way to get the true original size, all after having to put up with image conversation and severe decompression.
Even though I call this
original
, it is still a modified version of the uploaded file, as in every file gets converted to JPEG and even uploaded JPEGs get re-compressed.
But I think this is something Bluesky does when uploading the images, no? Ie., I don’t think Bluesky stores the original image anywhere, only their (potentially downscaled) JPEG image.
edit:
cookies
don't work on bluesky. The site itself doesn't use cookies. You need to provideusername
andpassword
to login, but you can removepassword
after the first login.
Is there a reason for asking for/using login information at all? Better rate limits? (As @qub1750ul mentioned earlier, all Bluesky posts (incl. images) are always public, so there’s no need for logging in to access them.)
But I think this is something Bluesky does when uploading the images, no? Ie., I don’t think Bluesky stores the original image anywhere, only their (potentially downscaled) JPEG image.
Seems like it. Bluesky does not store the originally uploaded image.
Is there a reason for asking for/using login information at all?
Certain (private) feeds, like /likes
or /lists/<LIST-ID>
, only return posts when logged in.
You don't need to login if all you want to do is download a user's media.
I just updated to gain access to the Bluesky functionality, although have a question, as my Python Script I run for a bot uses a separate downloader function, when I attempt to run it using the usual works-with-everything command (and using the example post above)
gallery-dl --get-urls --no-download --option search-endpoint=graphq1 https://bsky.app/profile/toomanyboners.bsky.social/post/3khucm2ygso2z
Rather than outputting the actual image url as https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg which is what shows up in a browser, it spits out a blob
https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:zyctzyihzisjnrdoiw75xvhm&cid=bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii
Granted I am seeing similarities in the URL's so a bit of rewriting the URL could accomplish what I need to pass over to my separate downloading function, just wondering if there's any command-line flags when using the --get-urls and --no-download function to instead output the correct https://cdn.bsky.app/img/feed_fullsize/plain/did: url instead of https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did: ?
@quentinwolf See https://github.com/mikf/gallery-dl/issues/4438#issuecomment-1937738238
There is currently no such option, but I'd think original resolution is better than the upscaled-to-2000px version.
instead output the correct
The https://bsky.social/xrpc/com.atproto.sync.getBlob?did=$DID&cid=$CID
URL is the more “correct” URL since that one won’t break when(/if) other AT protocol nodes start getting added to the federation network. (DID
is a unique user/account identifier across all AT protocol instances, CID
is a unique content identifier.) The cdn.bsky.app
URLs are implementation details specific to how Bluesky is handling the AT protocol and probably shouldn’t be considered stable (emphasis mine):
Blobs for a specific account can be listed and downloaded using endpoints in the
com.atproto.sync.*
NSID space. These endpoints give access to the complete original blob, as uploaded. A common pattern is for applications to mirror both the original blob and any downsized thumbnail or preview versions via separate URLs (eg, on a CDN), instead of deep-linking to thegetBlob
endpoint on the original PDS.
These endpoints give access to the complete original blob, as uploaded
Now that's not true, at least for the bsky.app instance it isn't
https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:cslxjqkeexku6elp5xowxkq7&cid=bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq is not the same file I uploaded. Or is this URL somehow wrong, e.g. wrong CID?
I think it’s true in that it’s the “complete original blob, as uploaded” by bsky.app to their storage backend, even if not by the user to bsky.app, hence also my earlier comment about Bluesky’s handling of uploaded images.
I haven’t looked at the what’s going on in the browser, but the JPEGifying and (potential) downscaling could even be happening browserside (I know that there are JavaScript libraries that do this anyway) so the original‐original might never touch any bsky.app infrastructure at all.
JPEGifying and (potential) downscaling
Even JPEG files that don't get downscaled are modified: https://bsky.app/profile/mikf.bsky.social/post/3kkzcewddop2o
Requesting for unique ID to be added, a string with numbers/letters unique to the account
Bluesky's equivalent to Twitter's user IDs as unchanging, unique IDs are DIDs.
Each user has a handle and a DID, and both can be used with gallery-dl.
https://bsky.app/profile/bsky.app
https://bsky.app/profile/did:plc:z72i7hdynmk6r22z27h6tvur
A user's DID can found at author['did']
(or user['did']
when enabled).
gallery-dl --filter "print(author['did']) or abort()" https://bsky.app/profile/bsky.app
gallery-dl -o metadata=user --filter "print(user['did']) or abort()" https://bsky.app/profile/bsky.app
It is also included in -K
and -j
outputs.
Doesn't work for archives because of the colon
[bluesky][warning] Failed to open download archive at 'D:\test/gallery-dl/archives/bluesky/did:plc:z72i7hdynmk6r22z27h6tvur.sqlite' (OperationalError: unable to open database file)
Then replace :
(:R:/_/
) or remove the first 8 characters ([8:]
) in your format string.
That worked. I want to request the equivalent of Mastodon extractor's {instance}
. It's similar to {category}
except it includes the domain as well.
I don't think dots are allowed in the username. Can there be a version of author['handle']
without the domain (any text that comes after the first dot)? Such a keyword exists for Mastodon and Misskey.
I don't think dots are allowed in the username. Can there be a version of
author['handle']
without the domain (any text that comes after the first dot)? Such a keyword exists for Mastodon and Misskey.
Dots are allowed if you use a custom domain name. I know this for a fact cause I've done so with an alt account of mine (NSFW so can't post the name here).
EDIT: I mean sub-domains with this, as example "@sub.epiclper.com" would be a valid Blue Sky username.
I don't think dots are allowed in the username.
In Bluesky/the AT protocol, usernames are domain names (or as the documentation says: Handles are DNS names
), so not only are dots allowed, they are required to have at least one in them. :) Most languages will have libraries for handling domain names (or you can just split on .
and grab the first part of the resulting array), so you can use that if you’re only interested in the sub‐most part of the domain name. Do keep in mind if you do that, that you shouldn’t expect those to be unique – e.g., @freso.dk and @freso.bsky.social would both resolve to freso
.
Using the code ` "bluesky": { "username": "[email protected]", "password": "bl;ahblah", "filename": "{createdAt[:19]}{post_id}{num}.{extension}", "directory": ["{category}", "{author[handle]}"], "include": "avatar,media", "reposts": false "retweets": false, "original": true, "cookies-update": true },
` But it scans the URLs I have in the file txt but it doesn't download the files it finds.