gallery-dl icon indicating copy to clipboard operation
gallery-dl copied to clipboard

Support for BlueSky?

Open Erkhyan opened this issue 1 year ago • 27 comments

The site is still invite-only for now, but I’m willing to provide an invite code as soon as I get a new one (should be in ~5 days).

Erkhyan avatar Aug 18 '23 12:08 Erkhyan

Heya, I have a few invite codes left over if @mikf wants one to implement this :)

EpicLPer avatar Sep 20 '23 05:09 EpicLPer

I've implemented some basics for this in an unrelated project. Chitose makes it relatively easy, but I'm not sure what the contribution guidelines are for new library dependencies. @mikf ?

Logic is basically

    def login(self, instance="bsky.social"):
        rc = netrc.netrc()
        (BSKY_USER, _, BSKY_PASSWD) = rc.authenticators(instance)

        self.api = chitose.BskyAgent(service=f'https://{instance}')
        self.api.login(BSKY_USER, BSKY_PASSWD)
        logging.info(f"Logged into {instance} as {BSKY_USER}")

    def getPostMedia(self, json_obj) -> typing.Iterable[typing.Tuple[str, str]]:
        for image_def in json_obj.get('embed', {}).get('images', []):
            src_url = image_def['fullsize']
            name = posixpath.split(src_url)[-1].replace('@', '.')
            yield (name, src_url)

    def bskyGetThread(self, post_reference: PostReference) -> dict:
        thread_response = self.api.get_post_thread(uri=self.bskyTupleToUri(post_reference))
        thread_response = json.loads(thread_response)

        return thread_response

    def getSkeetJsonApi(self, post_reference: PostReference, reason=""):
        try:
            thread_response = self.bskyGetThread(post_reference)
            thread_response['thread']['post']['id'] = post_reference.post_id

            logging.info(f"Downloaded new {self.NOUN_POST} for {post_reference} ({reason})")
            # print(thread_response)
            json_obj = thread_response['thread']['post']

            return json_obj

        except urllib.error.HTTPError as e:  # type: ignore[attr-defined]
            logging.error(e.headers)
            logging.error(e.fp.read())
            raise e
        except Exception:
            raise

GiovanH avatar Oct 26 '23 01:10 GiovanH

Bluesky is now open to the public, FYI:

https://www.pcmag.com/news/twitter-alternative-bluesky-makes-posts-publicly-viewable

Iron-Squid avatar Jan 18 '24 03:01 Iron-Squid

Bluesky is now open to the public, FYI:

https://www.pcmag.com/news/twitter-alternative-bluesky-makes-posts-publicly-viewable

Not for every account tho, you can manually set if you want your posts to be publicly viewable or only for people who are logged in.

EpicLPer avatar Jan 18 '24 12:01 EpicLPer

BlueSky posts are always public. You can request for your profile to be hidden from the unauthenticated human-friendly web interface, but that doesn't make it private. It will always be readable via public API.

qub1750ul avatar Jan 19 '24 19:01 qub1750ul

I've added a bunch of bluesky code. Could someone test it and let me know what else should be added/improved/etc?

mikf avatar Feb 10 '24 21:02 mikf

Starting experimentation with the version 1.26.7 on bluesky with username, password and cookies.

		 "bluesky":
        {
            "username": "EMAIL",
			"password": "PASS",
			"retweets": false,
			"original": true,
			"cookies": "C:\\Users\\USER\\cookies.txt",
			"cookies-update": true
        },

Using this post as a test: https://bsky.app/profile/toomanyboners.bsky.social/post/3khucm2ygso2z

Picture downloaded with gallery-dl results in a 1470 x 1260 JPG. 2023-12-31T18_07_37_3khucm2ygso2z_1

Opening picture in browser with "Open in new tab" gives a picture of 1000 x 857 JPG: https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg

Clicking on picture to open it up in browser and "Open in new tab" gives a picture of 2000 x 1714 JPG: https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg

Removing "original": true, and cookies lines results in same resolution downloaded; between "preview" and "fullsize" resolutions. Not sure if "fullsize" is actually that, or if the one ripped is the true size, but figured this should be known for clarity's sake.

biznizz avatar Feb 11 '24 09:02 biznizz

I just tested the same link, same parameters except for not providing cookies.

Downloading the link the first time gave me the 2000 × 1714 JPG file.

All subsequent downloads of the same link using the same exact settings gave me the 1470 × 1260 JPG file.

Erkhyan avatar Feb 11 '24 11:02 Erkhyan

Each image uploaded to bluesky has 3 different versions (at least, haven't found more at this point).

  • thumbnail: scaled to 1000x? or ?x1000 px, regardless of original size (https://cdn.bsky.app/img/feed_thumbnail/plain/did:plc:cslxjqkeexku6elp5xowxkq7/bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq@jpeg)

  • fullsize: scaled to 2000x? or ?x2000 px, regardless of original size (https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:cslxjqkeexku6elp5xowxkq7/bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq@jpeg)

  • original: original upload size, but downscaled to 2000x? or ?x2000 px if larger than that (https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:cslxjqkeexku6elp5xowxkq7&cid=bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq)

    Even though I call this original, it is still a modified version of the uploaded file, as in every file gets converted to JPEG and even uploaded JPEGs get re-compressed.

(from https://bsky.app/profile/mikf.bsky.social/post/3kkn2rkvdls2v)

gallery-dl is currently downloading everything in original size (https://github.com/mikf/gallery-dl/commit/55bbd49a0eccbf207c6833983d3a2a0ff6f73287). Guess I'll add an option for this.

edit: cookies don't work on bluesky. The site itself doesn't use cookies. You need to provide username and password to login, but you can remove password after the first login.

mikf avatar Feb 11 '24 13:02 mikf

That's confusing and a bit annoying how bluesky is doing this image resizing. You know it's bad when Twitter is more consistent with filesizes than this new alternative. So, if a 3000 x 3000 pic is uploaded, it'll always be downsized to 2000 x 2000 with no way to get the true original size, all after having to put up with image conversation and severe decompression.

biznizz avatar Feb 11 '24 19:02 biznizz

Even though I call this original, it is still a modified version of the uploaded file, as in every file gets converted to JPEG and even uploaded JPEGs get re-compressed.

But I think this is something Bluesky does when uploading the images, no? Ie., I don’t think Bluesky stores the original image anywhere, only their (potentially downscaled) JPEG image.

edit: cookies don't work on bluesky. The site itself doesn't use cookies. You need to provide username and password to login, but you can remove password after the first login.

Is there a reason for asking for/using login information at all? Better rate limits? (As @qub1750ul mentioned earlier, all Bluesky posts (incl. images) are always public, so there’s no need for logging in to access them.)

Freso avatar Feb 25 '24 09:02 Freso

But I think this is something Bluesky does when uploading the images, no? Ie., I don’t think Bluesky stores the original image anywhere, only their (potentially downscaled) JPEG image.

Seems like it. Bluesky does not store the originally uploaded image.

Is there a reason for asking for/using login information at all?

Certain (private) feeds, like /likes or /lists/<LIST-ID>, only return posts when logged in.

You don't need to login if all you want to do is download a user's media.

mikf avatar Feb 25 '24 14:02 mikf

I just updated to gain access to the Bluesky functionality, although have a question, as my Python Script I run for a bot uses a separate downloader function, when I attempt to run it using the usual works-with-everything command (and using the example post above)

gallery-dl --get-urls --no-download --option search-endpoint=graphq1 https://bsky.app/profile/toomanyboners.bsky.social/post/3khucm2ygso2z

Rather than outputting the actual image url as https://cdn.bsky.app/img/feed_fullsize/plain/did:plc:zyctzyihzisjnrdoiw75xvhm/bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii@jpeg which is what shows up in a browser, it spits out a blob

https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:zyctzyihzisjnrdoiw75xvhm&cid=bafkreie534wdizaua3psm66jflig66hqc5itocukqk4ajsu7pk6ic23aii

Granted I am seeing similarities in the URL's so a bit of rewriting the URL could accomplish what I need to pass over to my separate downloading function, just wondering if there's any command-line flags when using the --get-urls and --no-download function to instead output the correct https://cdn.bsky.app/img/feed_fullsize/plain/did: url instead of https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did: ?

quentinwolf avatar Feb 29 '24 01:02 quentinwolf

@quentinwolf See https://github.com/mikf/gallery-dl/issues/4438#issuecomment-1937738238

There is currently no such option, but I'd think original resolution is better than the upscaled-to-2000px version.

mikf avatar Feb 29 '24 01:02 mikf

instead output the correct

The https://bsky.social/xrpc/com.atproto.sync.getBlob?did=$DID&cid=$CID URL is the more “correct” URL since that one won’t break when(/if) other AT protocol nodes start getting added to the federation network. (DID is a unique user/account identifier across all AT protocol instances, CID is a unique content identifier.) The cdn.bsky.app URLs are implementation details specific to how Bluesky is handling the AT protocol and probably shouldn’t be considered stable (emphasis mine):

Blobs for a specific account can be listed and downloaded using endpoints in the com.atproto.sync.* NSID space. These endpoints give access to the complete original blob, as uploaded. A common pattern is for applications to mirror both the original blob and any downsized thumbnail or preview versions via separate URLs (eg, on a CDN), instead of deep-linking to the getBlob endpoint on the original PDS.

Freso avatar Feb 29 '24 15:02 Freso

These endpoints give access to the complete original blob, as uploaded

Now that's not true, at least for the bsky.app instance it isn't

https://bsky.social/xrpc/com.atproto.sync.getBlob?did=did:plc:cslxjqkeexku6elp5xowxkq7&cid=bafkreifhy4gmtrfp3ax7wx2l7ojabjabhcnxvieumend3iu3ghlpp4fuiq is not the same file I uploaded. Or is this URL somehow wrong, e.g. wrong CID?

mikf avatar Feb 29 '24 15:02 mikf

I think it’s true in that it’s the “complete original blob, as uploaded” by bsky.app to their storage backend, even if not by the user to bsky.app, hence also my earlier comment about Bluesky’s handling of uploaded images.

I haven’t looked at the what’s going on in the browser, but the JPEGifying and (potential) downscaling could even be happening browserside (I know that there are JavaScript libraries that do this anyway) so the original‐original might never touch any bsky.app infrastructure at all.

Freso avatar Mar 02 '24 21:03 Freso

JPEGifying and (potential) downscaling

Even JPEG files that don't get downscaled are modified: https://bsky.app/profile/mikf.bsky.social/post/3kkzcewddop2o

mikf avatar Mar 02 '24 21:03 mikf

Requesting for unique ID to be added, a string with numbers/letters unique to the account

a84r7a3rga76fg avatar Mar 09 '24 02:03 a84r7a3rga76fg

Bluesky's equivalent to Twitter's user IDs as unchanging, unique IDs are DIDs.

Each user has a handle and a DID, and both can be used with gallery-dl.

https://bsky.app/profile/bsky.app
https://bsky.app/profile/did:plc:z72i7hdynmk6r22z27h6tvur

A user's DID can found at author['did'] (or user['did'] when enabled).

gallery-dl --filter "print(author['did']) or abort()" https://bsky.app/profile/bsky.app
gallery-dl -o metadata=user --filter "print(user['did']) or abort()" https://bsky.app/profile/bsky.app

It is also included in -K and -j outputs.

mikf avatar Mar 09 '24 12:03 mikf

Doesn't work for archives because of the colon

[bluesky][warning] Failed to open download archive at 'D:\test/gallery-dl/archives/bluesky/did:plc:z72i7hdynmk6r22z27h6tvur.sqlite' (OperationalError: unable to open database file)

a84r7a3rga76fg avatar Mar 09 '24 21:03 a84r7a3rga76fg

Then replace : (:R:/_/) or remove the first 8 characters ([8:]) in your format string.

mikf avatar Mar 09 '24 22:03 mikf

That worked. I want to request the equivalent of Mastodon extractor's {instance}. It's similar to {category} except it includes the domain as well.

a84r7a3rga76fg avatar Mar 09 '24 23:03 a84r7a3rga76fg

I don't think dots are allowed in the username. Can there be a version of author['handle'] without the domain (any text that comes after the first dot)? Such a keyword exists for Mastodon and Misskey.

a84r7a3rga76fg avatar Mar 20 '24 05:03 a84r7a3rga76fg

I don't think dots are allowed in the username. Can there be a version of author['handle'] without the domain (any text that comes after the first dot)? Such a keyword exists for Mastodon and Misskey.

Dots are allowed if you use a custom domain name. I know this for a fact cause I've done so with an alt account of mine (NSFW so can't post the name here).

EDIT: I mean sub-domains with this, as example "@sub.epiclper.com" would be a valid Blue Sky username.

EpicLPer avatar Mar 20 '24 06:03 EpicLPer

I don't think dots are allowed in the username.

In Bluesky/the AT protocol, usernames are domain names (or as the documentation says: Handles are DNS names), so not only are dots allowed, they are required to have at least one in them. :) Most languages will have libraries for handling domain names (or you can just split on . and grab the first part of the resulting array), so you can use that if you’re only interested in the sub‐most part of the domain name. Do keep in mind if you do that, that you shouldn’t expect those to be unique – e.g., @freso.dk and @freso.bsky.social would both resolve to freso.

Freso avatar Mar 20 '24 15:03 Freso

Using the code ` "bluesky": { "username": "[email protected]", "password": "bl;ahblah", "filename": "{createdAt[:19]}{post_id}{num}.{extension}", "directory": ["{category}", "{author[handle]}"], "include": "avatar,media", "reposts": false "retweets": false, "original": true, "cookies-update": true },

` But it scans the URLs I have in the file txt but it doesn't download the files it finds.

Kuroo2021 avatar Mar 21 '24 02:03 Kuroo2021