matorage icon indicating copy to clipboard operation
matorage copied to clipboard

AssertionError(assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)))

Open jinserk opened this issue 5 years ago • 13 comments

Can I ask you what this error stands for?

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 65, in set_datasets
    dataset = MatorageAnnDataset(trainset_config, clear=True)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 73, in __init__
    super(Dataset, self).__init__(config, **kwargs)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 80, in __init__
    self._init_download()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 189, in _init_download
    assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))
AssertionError

jinserk avatar Aug 28 '20 19:08 jinserk

Of course. Could you show all files related to metadata? (Represents a file within metadata.)

graykode avatar Aug 28 '20 19:08 graykode

Here is the only file in metadata dir. The dataset name and host/port info have been censored for security. Thank you! 6bd037556e8842d6.zip

jinserk avatar Aug 28 '20 21:08 jinserk

If I commented out the assertion, anyway it works to retreive data from the minio server. However, I found lots of annoying loggings as:

2020/08/28 17:16:43 EDT [INFO] mlmanager.torch.workers (workers.py:56) set device cpu as rank 0                                                                                                                                                                                 
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
2020/08/28 17:17:06 EDT [INFO] mlmanager.torch.workers (workers.py:316) train:  epoch 0001  lr 5.0000e-04  loss 0.191976                                                                                                                                                        
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:350) validate:  epoch 0001  loss 0.099826                                                                                                                                                                    
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:237) epoch 0001  ave_train_loss 0.191976  ave_val_loss 0.099826             

Can I turn them off? Sorry for lots of questions and bug reports.

jinserk avatar Aug 28 '20 21:08 jinserk

@jinserk

Thank you for the detailed bug report!

While analyzing the bug you showed, I was able to find a few more bugs related to the NAS. First, it is a part that cannot read the sub-JSON files of metadata well, which was solved by modifying the list_object function of NAS :

    def list_objects(self, bucket_name, prefix="", recursive=False):
        _foldername = os.path.join(self.path, bucket_name, prefix)
        if not recursive:
            objects = [
                os.path.join(prefix, f) for f in os.listdir(_foldername)
            ]
        else:
            objects = [
                os.path.join(dp, f) for dp, dn, fn in os.walk(_foldername) for f in fn
            ]
        return [Obj(o) for o in objects if o.startswith(prefix)]

The second one is related to assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)). This error has been confirmed to be caused by a mismatch between the metadata on the remote server and the cached metadata.

This 'caching' serves to map the location of the downloaded file and the key of the minio when calling the dataset. If you use the NAS setting, you don't actually need this caching.

Solution

  • First, delete the matorage cache. (rm ~/.matorage/*.json)
  • Second, Apply the hotfix code

One thing I'd like to ask is, did you use the ip4 address when using the NAS settings?

graykode avatar Aug 29 '20 17:08 graykode

Yes I used IPv4 address. I'll check your solution ASAP. Thank you so much for the prompt solution!

jinserk avatar Aug 30 '20 04:08 jinserk

@jinserk When using a NAS, you must use a local address rather than ipv4.

For example:

from matorage import DataConfig

# NAS example
data_config = DataConfig(
    endpoint='/tmp/shared',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)

If you use ipv4 for the endpoint, connection is established through HTTP protocol. However, use the local path for the endpoint, It's much faster because it doesn't use the Http protocol. (Just file copy from folder to folder) Also, If you use an http endpoint in the dataloader, data is downloaded to all nodes unconditionally. Check this code might be helpful: https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L178

graykode avatar Aug 30 '20 08:08 graykode

Hi @graykode, Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.

By the way, I have two questions related to this:

  • When using the local path addressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned by root, since the minio server runs with the root permission. However, when I use the local path addressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct?
  • If the local path addressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namely rank0 node here), then can I set the rank0 node as local_path addressing but the other rank nodes uses ipv4 addressing at the same time?

jinserk avatar Aug 31 '20 01:08 jinserk

It looks I cannot use the local_path addressing and ipv4 addressing at the same time:

Process TrainProcess-1:
Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 64, in set_datasets
    trainset_config = nas.DataConfig.from_json_file("train.json")
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 312, in from_json_file
    return cls(**config_dict)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 131, in __init__
    self._check_all()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 140, in _check_all
    self._check_bucket()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 242, in _check_bucket
    raise ValueError(
ValueError: Already created endpoint(/mnt/hdd1/kyu/matorage) doesn't current endpoint str(127.0.0.1:9000) It may occurs permission denied error

jinserk avatar Aug 31 '20 02:08 jinserk

@jinserk

Hi @graykode, Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.

By the way, I have two questions related to this:

  • When using the local path addressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned by root, since the minio server runs with the root permission. However, when I use the local path addressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct?
  • If the local path addressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namely rank0 node here), then can I set the rank0 node as local_path addressing but the other rank nodes uses ipv4 addressing at the same time?
  • First question: In macOS, I did not encounter such a permission error, but in Linux OS, a permission error was found. Thank you for doing the troubleshooting. This seems to be a minio-related error, so I'll find a solution.
  • Second question: As far as I know this is possible. In other words, nodes that physically use NAS can access them through the local path, and other nodes can access them through the HTTP protocol.

graykode avatar Aug 31 '20 11:08 graykode

@jinserk

I found the solution related to the first one. This is the way to use binary minio without using minio docker: https://github.com/minio/minio#gnulinux

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
# minio for background running
nohup ./minio gateway nas /home/nlkey2022/shared &

I don't know why we get a permission error in minio docker nas. I will leave an issue on the minio once.

graykode avatar Aug 31 '20 11:08 graykode

@graykode Guess this is because when using docker it runs as the root but when using local binary it runs with a user permission. I guess if you're run minio with the root permission, it will be the same:

sudo -H nohup ./minio gateway nas /home/nlkey2022/shared &

In my quick and humble opinion, we need to check the minio's set_bucket_policy to set the files or dirs to public. Please check here even though it's minio-java, not the minio-py. Of course I could be wrong and I'm afraid of misleading.

jinserk avatar Aug 31 '20 15:08 jinserk

@jinserk

I don't actually know the detailed configuration of the minio. So I will consider it. Thank you.

I'll leave a thread when I find more options!! :)

graykode avatar Sep 01 '20 12:09 graykode

A step-by-step look at why this error occurs is as follows.

  1. In the dataset, the minio was updated with the same dataset_name and dataset_additional.
  2. However, json cached locally, that is, files in the ~/.matorage folder are not updated.
  3. Currently, the files in the ~/.matorage folder must be manually deleted, but the related logic must be additionally implemented later.

graykode avatar Nov 17 '20 09:11 graykode