clearml
clearml copied to clipboard
Artefact/Dataset upload statistics look wrong
When uploading a dataset to the clearml-server, the uploading statistics, in terms of file size uploaded, is wrong compared to the total size. The total file size appears to be correct, in this example, around 550MB.
Example of the output:
2021-06-04 13:06:20,897 - clearml.storage - INFO - Uploading: 33024.00MB / 531.72MB @ 16346.09MBs from /tmp/dataset.96785ab93024437d81f6fa27cf7152df.zip
2021-06-04 13:06:21,037 - clearml.storage - INFO - Uploading: 33540.00MB / 531.72MB @ 3677.48MBs from /tmp/dataset.96785ab93024437d81f6fa27cf7152df.zip
2021-06-04 13:06:21,052 - clearml.storage - INFO - Uploading: 34060.00MB / 531.72MB @ 34766.44MBs from /tmp/dataset.96785ab93024437d81f6fa27cf7152df.zip
2021-06-04 13:06:21,216 - clearml.storage - INFO - Uploading: 34584.00MB / 531.72MB @ 3189.15MBs from /tmp/dataset.96785ab93024437d81f6fa27cf7152df.zip
2021-06-04 13:06:21,226 - clearml.storage - INFO - Uploading: 35112.00MB / 531.72MB @ 56624.71MBs from /tmp/dataset.96785ab93024437d81f6fa27cf7152df.zip
2021-06-04 13:06:21,352 - clearml.storage - INFO - Uploading: 35643.72MB / 531.72MB @ 4216.48MBs from /tmp/dataset.96785ab93024437d81f6fa27cf7152df.zip
Upload completed (557.55 MB)
In this particular example, the following script was used to upload a directory of images which represents a training dataset:
import os
import argparse
# ClearML modules
from clearml import Dataset
parser = argparse.ArgumentParser(description='CUB200 2011 ClearML data uploader - Ed Morris (c) 2021')
parser.add_argument(
'--dataset-basedir',
dest='dataset_basedir',
type=str,
help='The directory to the root of the dataset',
default='/home/edmorris/projects/image_classification/caltech_birds/data/images')
parser.add_argument(
'--clearml-project',
dest='clearml_project',
type=str,
help='The name of the clearml project that the dataset will be stored and published to.',
default='Caltech Birds/Datasets')
parser.add_argument(
'--clearml-dataset-url',
dest='clearml_dataset_url',
type=str,
help='Location of where the dataset files should be stored. Default is Azure Blob Storage. Format is azure://storage_account/container',
default='azure://clearmllibrary/datasets')
args = parser.parse_args()
for task_type in ['train','test']:
print('[INFO] Versioning and uploading {0} dataset for CUB200 2011'.format(task_type))
dataset = Dataset.create('cub200_2011_{0}_dataset'.format(task_type), dataset_project=args.clearml_project)
dataset.add_files(path=os.path.join(args.dataset_basedir,task_type), verbose=False)
dataset.upload(output_url=args.clearml_dataset_url)
print('[INFO] {0} Dataset finalized....'.format(task_type), end='')
dataset.finalize()
print('[INFO] {0} Dataset published....'.format(task_type), end='')
dataset.publish()
Just compared two uploads of the same dataset, one to Azure Blob and the other to local storage on clearml-server. The local storage didn't report any statistics, so it might be confined to the cloud storage method, and specifically Azure.
Just ran a model which pulled the dataset from Azure Blob Storage and it seemed to produce correct statistics:
2021-06-04 13:34:21,708 - clearml.storage - INFO - Downloading: 13.00MB / 550.10MB @ 32.59MBs from azure://clearmllibrary/datasets/Caltech Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip
2021-06-04 13:34:21,754 - clearml.storage - INFO - Downloading: 21.00MB / 550.10MB @ 175.54MBs from azure://clearmllibrary/datasets/Caltech Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip
2021-06-04 13:34:21,791 - clearml.storage - INFO - Downloading: 29.00MB / 550.10MB @ 218.32MBs from azure://clearmllibrary/datasets/Caltech Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip
2021-06-04 13:34:21,819 - clearml.storage - INFO - Downloading: 37.00MB / 550.10MB @ 282.70MBs from azure://clearmllibrary/datasets/Caltech Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip
2021-06-04 13:34:21,843 - clearml.storage - INFO - Downloading: 45.00MB / 550.10MB @ 334.24MBs from azure://clearmllibrary/datasets/Caltech Birds%2FDatasets/cub200_2011_train_dataset.37a8f00931b04952a1500e3ada831022/artifacts/data/dataset.37a8f00931b04952a1500e3ada831022.zip
Hi @ecm200
... it might be confined to the cloud storage method, and specifically Azure.
Yes, I have the same suspicion, it seems to be reporting correctly with S3/GS, might be Azure specific issue. Let me see if we can reproduce this issue.
Hi @ecm200
Sorry, it took a long. Does the issue still persist? Please let us know.