adlfs
                                
                                
                                
                                    adlfs copied to clipboard
                            
                            
                            
                        Existing file marked as non-existing
What happened:
fs.isfile(existing_file_path) incorrectly returns False and gives a warning
EDIT: Output is
False
RuntimeWarning: coroutine 'AzureBlobFileSystem._details' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
What you expected to happen:
Return True without a warning
Minimal Complete Verifiable Example:
import adlfs, fsspec, os
storage_options = {
    'account_name': os.environ['AZ_STORAGE_ACCOUNT_NAME'],
    'account_key': os.environ['AZ_STORAGE_ACCOUNT_KEY']    
}
az_storage_container_name = os.environ['AZ_STORAGE_CONTAINER_NAME']
fs = fsspec.filesystem('abfs', **storage_options)
base_path = f'abfs://{az_storage_container_name}/data/datasets'
existing_file_path = f'{base_path}/{dataset_id}'
fs.isdir(existing_file_path)
Anything else we need to know?:
- Continuation of https://github.com/dask/adlfs/issues/261 , with patched adfls and now on existing files (vs prev on non-existing)
 
Environment:
fsspec '2021.07.0' (conda) adlfs '2021.08.1' (pip, no conda yet) docker / ubuntu 18.04 / python 3.7
@hayesgb Digging a bit more, switching to asynchronous=True ... await fs._isfile(existing_file_path) does not work around the issue: the warning still triggers and the wrong result still gets returned
@hayesgb (Continuing from https://github.com/dask/adlfs/issues/261)
Just tried from head:
- [ ] existing files 
isfile()is quickly & incorrectly returningFalse; no async warning anymore - [ ] existing dirs 
isdir()is slowly but correctly returningTrue; I suspect it is downloading the folders - [x] non-existing paths for 
isfile()quickly & correctly returningFalse - [x] non-existing paths for 
isdir()quickly & correctly returningFalse 
Also if it helps, my paths look like:
abfs://somecontainer/mydata/mydata2/myfile
Would you mind posting the result of: fs.details(“somecontainer/mydata/mydata2/abc”)
On Aug 15, 2021, at 6:07 PM, lmeyerov @.***> wrote:
somecontainer/mydata/mydata2/abc
AttributeError: 'AzureBlobFileSystem' object has no attribute 'details'
FYI, having more luck with variants of:
async def aexists_dir(path):
    blob_service_client = BlobServiceClient.from_connection_string(conn_str)
    async with blob_service_client:
        container_client = blob_service_client.get_container_client(az_storage_container_name)
        async for myblob in container_client.list_blobs(name_starts_with=path):
            return myblob['name'] != path
    return False
                                    
                                    
                                    
                                
Thanks.  I may end up updating to this.  I asked about details earlier, but could you post the result of fs.info(path).  Trying to create a test case for this.
{
  "metadata": None,
  "creation_time": datetime.datetime(2020, 9, 29, 0, 16, 6, tzinfo=datetime.timezone.utc),
  "deleted": None,
  "deleted_time": None,
  "last_modified": datetime.datetime(2021, 8, 13, 15, 35, 35, tzinfo=datetime.timezone.utc),
  "content_settings": {
    "content_type": "application/x-gzip",
    "content_encoding": None,
    "content_language": None,
    "content_md5": bytearray(b"*****"),
    "content_disposition": None,
    "cache_control": None
  },
  "remaining_retention_days": None,
  "archive_status": None,
  "last_accessed_on": None,
  "etag": "*****",
  "tags": None,
  "tag_count": None,
  "name": "mycontainer/myfolder/myfile",
  "size": 4332,
  "type": "file"
}
                                    
                                    
                                    
                                
Thanks for the help here @lmeyerov Release 2021.08.2 should fix the errors with isfile.
Can you share an example of the slowly downloading isdir? This does call cc.list_blobs. Are there a very large number of blobs in the location you're scanning?
Yes - it's a potentially big folder (named parquet dumps), in this case I wouldn't be surprised if 1K-10K files. I think async list_files paginates, though I'm unsure of how to ensure that's reasonably small. That's part of the reason we're trying to only do asyncio w/ adlfs, ensuring even occasional blips will not starve out other tasks.
@lmeyerov -- I just refactored _isdir on the accel_isdir branch. It passes all the tests, and completely eliminates the list_blobs call. Would appreciate your feedback if you have a chance to check it out.
Sure -- will check on Th/F (am traveling)
At the same time, if anything around async multi-connection downloads of indiv + folder blobs, happy to check there. Currently investigating how to do via az's SDK, but we rather have unified under fsspec!
Cool. Just curious -- on the multi-connection downloads -- are you looking to use Dask or is the use case async multithreading?
- Currently single-node / multicore . Our Azure GPU VMs have something like 2-8 NICs with 8-32 Gbps, and I think AWS/GCP end up similar, so focusing on saturating abfs => SSD writes with that. Multi-node may be interesting early next year, but not there yet :)
 
RE:async multithreading, az sdk has parallel connection support with a configurable # of streams, which seems like a fine first step..
- Our other common use case is when we read directly from 
dask_cudf.read_parquet, and it may have some funny NUMA behavior to consider for remote reads, but not sure yet. Local reads are via GPU Direct Storage, and I believe there may be network extensions for GPU Direct as well....