musicbrainz-userscripts icon indicating copy to clipboard operation
musicbrainz-userscripts copied to clipboard

bandcamp importer runs on every page

Open mtfurlan opened this issue 7 months ago • 13 comments

The regex for the bandcamp importer is too broad cause it can match any https url

example

"https://literally.anything".match(/^https:\/\/[^/]+(?:\/(?:album|track)\/[^/]+\/?|\/music\/?|\/?)$/) // [ "https://literally.anything/" ]

I think something like

^https:\/\/[^/]+(?:\/(?:album|track)\/[^/]+\/?|\/music\/?)$

is what is intended, that will only match stuff ending in /music, or /(album|track)/something?

I can write regexes if you can tell me what it should be matching/not matching

mtfurlan avatar May 15 '25 01:05 mtfurlan

Unfortunately, we cannot fully prevent it, because it needs to run on the discography/landing pages as well, so not just artist.bandcamp.com/music, but also artist.bandcamp.com. Take for example this label: https://moatun7.bandcamp.com, yes there's a /music page, but most of the people are using the landing page to navigate the collection.

Similarly, potentially we could run it only on *.bandcamp.com, but then there comes the problem with white-labeling (which bandcamp allows you to do if you pay for it), where the word bandcamp will not be present in the URL at all. Can't find a working example now, but in the past this artist had such a setup: https://3six.net/album/the-box. In these cases the script should also work.

Bad news, it needs to run pretty much anywhere. Good news, the fingerprint of this is tiny, because if the script doesn't find the TralbumData object in the window.document, it exits.

arsinclair avatar Jun 02 '25 03:06 arsinclair

If it needs to run anywhere, then it should say that, and not have a complicated regex that works out to anything in what looks like an accident

mtfurlan avatar Jun 02 '25 04:06 mtfurlan

Yes, perhaps you're right, it will make it more clear. @zas @murdos what do you think? If you agree, I can make the change.

arsinclair avatar Jun 02 '25 05:06 arsinclair

I think that's hard to avoid because of white-labeling (which was always an issue anyway), the match was made wider to handle those in https://github.com/murdos/musicbrainz-userscripts/commit/e2f318c128406341ecd323bd3a3e3d0a8a75f305 but it still required album or track in the URL.

The recent change to add links on discography (/music) including / because of landing page thing, actually make it matching anything.

The real match is https://github.com/murdos/musicbrainz-userscripts/commit/e2f318c128406341ecd323bd3a3e3d0a8a75f305#diff-06c163b15569b6a04b2da6f89363853aa2a2fa250208fdae59bbf45533382615R216 (finding TralbumData) anyway.

So, I think it's hard to avoid running anywhere if we actually want the script features to work in all cases. Not that I'm totally happy with that, but I think we don't have much choices.

If someone has a better solution to this... ;)

zas avatar Jun 02 '25 06:06 zas

Thank you for the additional context! I suggest then we wait for a week in case anyone has better ideas and after that just change the regex to match all pages explicitly.

arsinclair avatar Jun 02 '25 07:06 arsinclair

I would be fine with simply not supporting the edge case of custom domains without /music, /album or /track in the URL. Even more so since there is no actual example for this. IMO it is also not too much to ask of users to click the "Music" link if they want to use that feature.

kellnerd avatar Jun 02 '25 13:06 kellnerd

It seems, at least according to this https://get.bandcamp.help/hc/en-us/articles/23020691318551-How-do-I-set-up-a-custom-domain-on-Bandcamp, that due to recent changes in browser handling of third-party cookies, Bandcamp has implemented automatic redirects from custom domains to the original Bandcamp subdomains. This means that while someone can set up a custom domain, visitors will still see the Bandcamp URL in their browser's address bar after redirection. It's not clear what the nature of the problem is and whether they will be able to find a solution, but it might happen that white-labelled domains are not coming back.

I guess we can drop that case indeed, and only have two matchers, one for *.bandcamp.com and another one for *.(/music|/album|/track) (pseudo-regex).

arsinclair avatar Jun 02 '25 13:06 arsinclair

maybe

^https:\/\/[^/]+(?:\/(?:album|track)\/[^/]+\/?|\/music\/?|\.bandcamp.com.*)$

?

mtfurlan avatar Jun 02 '25 13:06 mtfurlan

https://customdomain.com/album
https://customdomain.com/track

didn't work, otherwise looks good to me.

Image

arsinclair avatar Jun 03 '25 04:06 arsinclair

We should separate the 2 cases:

  • Bandcamp:

^https:\/\/[^.]+\.bandcamp\.com(?:\/(?:(?:album|track))\/[^/]+|\/|\/music)$

It will match:

https://artist.bandcamp.com/
https://artist.bandcamp.com/music
https://artist.bandcamp.com/album/xxx
https://artist.bandcamp.com/track/xxx
  • Custom domain:

^https:\/\/[^/]+\/(?:(?:(?:album|track))\/[^/]+|music)$

It will match:

https://artist.domain/music
https://artist.domain/album/xxx
https://artist.domain/track/xxx

This one is wider, so it has to be first, but having both make it easier to disable the wider one if needed. Having one expression is possible but it will be longer/complex and would not allow one to restrict to Bandcamp domain as easily.

The second one is only matched for landing pages on Bandcamp itself (https://artist.bandcamp.com/), I'm not sure if we need to match https://artist.bandcamp.com or not (without any path).

// @include      /^https:\/\/[^/]+\/(?:(?:(?:album|track))\/[^/]+|music)$/
// @include      /^https:\/\/[^.]+\.bandcamp\.com(?:\/(?:(?:album|track))\/[^/]+|\/|\/music)$/

What do you think?

zas avatar Jun 03 '25 08:06 zas

I like this approach, but I think matching https://artist.bandcamp.com (without the slash at the end) is important enough, at least very important to me :)

Otherwise it might look like the script doesn't work at all when we're on the artist landing page. Currently none of the regexes match it.

arsinclair avatar Jun 03 '25 08:06 arsinclair

I like this approach, but I think matching https://artist.bandcamp.com (without the slash at the end) is important enough, at least very important to me :)

Otherwise it might look like the script doesn't work at all when we're on the artist landing page. Currently none of the regexes match it.

Would the following work for you? It matches x.bandcamp.com with or without trailing /

// @include      /^https:\/\/[^/]+\/(?:(?:(?:album|track))\/[^/]+|music)$/
// @include      /^https:\/\/([^.]+)\.bandcamp\.com((?:\/(?:(?:album|track))\/[^/]+|\/|\/music)?)$/

zas avatar Jun 03 '25 10:06 zas

@zas this seems to cover all the cases, awesome!

arsinclair avatar Jun 05 '25 05:06 arsinclair

This removed the ability to import from archive.org: https://github.com/murdos/musicbrainz-userscripts/pull/673/files#r2312609075

cybersphinx avatar Aug 31 '25 20:08 cybersphinx

Addressed in https://github.com/murdos/musicbrainz-userscripts/pull/684.

arsinclair avatar Sep 04 '25 22:09 arsinclair