dask-cloudprovider icon indicating copy to clipboard operation
dask-cloudprovider copied to clipboard

Azure : RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'LZ4', 'UNCOMPRESSED']

Open arnabbiswas1 opened this issue 4 years ago • 3 comments

Steps to reproduce:

I have created Dask Cluster inside AzureML environment using the following code:

amlcluster = AzureMLCluster(ws,
                            vm_size="STANDARD_D1",
                            environment_definition=ws.environments['AzureML-Dask-CPU'], 
                            initial_node_count=0, 
                            scheduler_idle_timeout=10800,
                            vnet='vnet',
                            subnet='subnet',
                            vnet_resource_group='resourcegroup',
                            ct_name="biswasdask",
)

Next open the jupyter lab using the link returned by amlcluster.jupyter_link

As per my understanding I am into the scheduler node of the cluster now.

On the Jupyter notebook, try the following code (from the repository azureml-examples):

from adlfs import AzureBlobFileSystem

container_name = "isdweatherdatacontainer"
storage_options = {"account_name": "azureopendatastorage"}

fs = AzureBlobFileSystem(**storage_options)
files = fs.glob(f"{container_name}/ISDWeather/year=2020/month=2/part-00003-tid-695161346761253622-368439cf-81e6-43f1-be5d-49ba29e282c0-2567-2.c000.snappy.parquet")
ddf = dd.read_parquet(files, storage_options=storage_options, chunksize="20MB")

ddf.head()

It returns the following error:

RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'LZ4', 'UNCOMPRESSED']

This is seems to be an old issue. But, since I have not created this environment manually, I don't know what is the problem?

arnabbiswas1 avatar Nov 19 '20 12:11 arnabbiswas1

This is an open source project, so I really can't complain. But, while trying to work with dask-cloudprovider (for azure), I am encountering with issues after issue at different steps. That concerns me a lot about the basic sanity and stability of the product.

Further to that I see this commit to azureml-examples repository:

"remove dask-cloudprovider givne instability and lack of support"

With this, I am not sure if I should continue my effort of trying to use dask_cloudprovider within Azure ML pipeline (as a part of my day job).

Would appreciate if anyone from the dask-cloudprovider brief about the status of the project at this point of time.

arnabbiswas1 avatar Nov 19 '20 12:11 arnabbiswas1

Thanks for taking the time to raise these issues @arnabbiswas1.

Dask Cloudprovider contains cluster managers for a variety of different cloud platforms. Currently the AzureMLCluster is maintained by the AzureML team.

We are working to add a new cluster manager for Azure in #175 which will use Azure VMs directly instead of the AzureML API. The AzureML folks have indicated that they want to remove the AzureMLCluster in favour of the new more generic AzureVMCluster.

jacobtomlinson avatar Nov 19 '20 13:11 jacobtomlinson

Thanks for your quick and detailed reply. That helps me to prioritize my work.

I will wait for the new cluster manager for Azure and then will pick it back. Will eagerly wait for it.

Thanks for all the great work you are doing. :love_you_gesture:

arnabbiswas1 avatar Nov 19 '20 13:11 arnabbiswas1