cloudpathlib
cloudpathlib copied to clipboard
az:// URL scheme isn't really a URL because it is not unique
The existing 'az://' pseudo-URL scheme violates the principal of URLs in that it is not universal and unique.
The 's3://' bucket syntax is valid because S3 buckets are universally unique. You may need account credentials to access an S3 bucket but you don't need additional information to locate it.
With Azure blob storage you need the storage account as well. Without that you have no way to locate the blobs even if they are publically accessible.
A better scheme would be something like:
az://<storage-account>/<container>/...
which would turn into:
https://<storage-account>.blob.core.windows.net/<container>/...
You could allow the optional addition of .blob or .dfs, so one could write:
az://<storage-account>.dfs/<container>/...
This is a great point, thanks @analog-cbarber.
Do you know if there is a list of all the different kinds of storage URLs we might need to generate to work with Azure storage services somewhere? It would be ideal to make the design of this comprehensive.
Also, curious if as an ugly workaround passing a BlobServiceClient with the right storage account to AzureBlobClient then using that to generate paths works to at least be able to access the blobs properly.
I can indeed construct an AzureBlobClient to use with a AzureBlobPath that works, but it would be a lot nicer, if you could just directly construct an AzureBlobPath object with a string and perhaps an optional SAS token:
AzureBlobPath('az://myaccount/mycontainer?token')
or
AzureBlobPath('az://myaccount/mycontainer', token=token)
I don't know of a list of cloud storage container URLs, nor do I think that URL schemes necessarily even exist. Microsoft has not defined an azure:// or az:// scheme. I think it is fine to make up you own version, as long as it can be used to locate the resource without any extra information (apart from authentication credentials).
Since Azure cloud uses a REST api with https urls, you probably should also support the raw http url as well:
AzureBlobPath('https://myaccount.blob.core.widows.net/mycontainer/')
FYI, here is quick hack that extends the existing implementation to support the azure://<account>/<container>/... scheme:
@register_client_class('azure2')
class AzureClient(AzureBlobClient):
pass
@register_path_class('azure2')
class AzurePath(AzureBlobPath):
client: 'AzureClient'
cloud_prefix: str = "azure://"
def __init__(self, cloud_path: Union[str, CloudPath],
client: Optional[AzureBlobClient] = None,
token: Optional[str] = None
):
if isinstance(cloud_path, str):
parsed = urlparse(cloud_path)
m = re.match(r'(?P<account>[a-z0-9]+)(\.(?P<type>blob|dfs)(\.core\.windows\.net)?)?',
parsed.netloc,
flags=re.IGNORECASE)
if m is None:
raise ValueError(f"Bad azure path '{cloud_path}'")
account = m.group('account')
fstype = m.group('type') or 'blob'
account_url = f'https://{account}.{fstype}.core.windows.net/'
optional_type = '' if fstype == 'blob' else '.' + fstype
cloud_path = f"azure://{account}{optional_type}/{parsed.path.lstrip('/')}"
if client is None or parsed.query or token or client.service_client.account_name != account:
if token is not None:
token = '?' + token.lstrip('?')
elif parsed.query:
token = '?' + parsed.query
client = AzureClient(account_url, token)
super().__init__(cloud_path, client = client)
@classmethod
def is_valid_cloudpath(cls, path: Union[str, CloudPath], raise_on_error=False) -> bool:
valid = bool(re.match(r'(azure://|https://[a-z0-9]+\.(blob|dfs)\.core\.windows\.net)', str(path).lower()))
if raise_on_error and not valid:
raise InvalidPrefixError(
f"'{path}' is not a valid path since it does not start with '{cls.cloud_prefix}' "
"or valid Azure https blob or dfs location."
)
return valid
@property
def container(self) -> str:
return self._no_prefix.split('/',2)[1]
@property
def blob(self) -> str:
return super().blob.split('/',1)[1]