AssertionError(assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)))
Can I ask you what this error stands for?
Traceback (most recent call last):
File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
self.setup()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
self.set_dataloaders()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
trainset, valset = self.set_datasets()
File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 65, in set_datasets
dataset = MatorageAnnDataset(trainset_config, clear=True)
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 73, in __init__
super(Dataset, self).__init__(config, **kwargs)
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 80, in __init__
self._init_download()
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 189, in _init_download
assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))
AssertionError
Of course. Could you show all files related to metadata? (Represents a file within metadata.)
Here is the only file in metadata dir. The dataset name and host/port info have been censored for security. Thank you!
6bd037556e8842d6.zip
If I commented out the assertion, anyway it works to retreive data from the minio server. However, I found lots of annoying loggings as:
2020/08/28 17:16:43 EDT [INFO] mlmanager.torch.workers (workers.py:56) set device cpu as rank 0
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 - PyTorch version 1.6.0 available.
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 - PyTorch version 1.6.0 available.
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 - PyTorch version 1.6.0 available.
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 - PyTorch version 1.6.0 available.
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 - PyTorch version 1.6.0 available.
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 - PyTorch Vision version 0.7.0 available.
2020/08/28 17:17:06 EDT [INFO] mlmanager.torch.workers (workers.py:316) train: epoch 0001 lr 5.0000e-04 loss 0.191976
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 - PyTorch version 1.6.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 - PyTorch version 1.6.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 - PyTorch version 1.6.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 - PyTorch Vision version 0.7.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 - PyTorch version 1.6.0 available.
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 - PyTorch Vision version 0.7.0 available.
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:350) validate: epoch 0001 loss 0.099826
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:237) epoch 0001 ave_train_loss 0.191976 ave_val_loss 0.099826
Can I turn them off? Sorry for lots of questions and bug reports.
@jinserk
Thank you for the detailed bug report!
While analyzing the bug you showed, I was able to find a few more bugs related to the NAS.
First, it is a part that cannot read the sub-JSON files of metadata well, which was solved by modifying the list_object function of NAS :
def list_objects(self, bucket_name, prefix="", recursive=False):
_foldername = os.path.join(self.path, bucket_name, prefix)
if not recursive:
objects = [
os.path.join(prefix, f) for f in os.listdir(_foldername)
]
else:
objects = [
os.path.join(dp, f) for dp, dn, fn in os.walk(_foldername) for f in fn
]
return [Obj(o) for o in objects if o.startswith(prefix)]
The second one is related to assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)).
This error has been confirmed to be caused by a mismatch between the metadata on the remote server and the cached metadata.
This 'caching' serves to map the location of the downloaded file and the key of the minio when calling the dataset. If you use the NAS setting, you don't actually need this caching.
Solution
- First, delete the matorage cache. (
rm ~/.matorage/*.json) - Second, Apply the hotfix code
One thing I'd like to ask is, did you use the ip4 address when using the NAS settings?
Yes I used IPv4 address. I'll check your solution ASAP. Thank you so much for the prompt solution!
@jinserk When using a NAS, you must use a local address rather than ipv4.
For example:
from matorage import DataConfig
# NAS example
data_config = DataConfig(
endpoint='/tmp/shared',
dataset_name='mnist',
additional={
"framework" : "pytorch",
"mode" : "training"
},
compressor={
"complevel" : 0,
"complib" : "zlib"
},
attributes=[
('image', 'float32', (28, 28)),
('target', 'int64', (1, ))
]
)
If you use ipv4 for the endpoint, connection is established through HTTP protocol. However, use the local path for the endpoint, It's much faster because it doesn't use the Http protocol. (Just file copy from folder to folder) Also, If you use an http endpoint in the dataloader, data is downloaded to all nodes unconditionally. Check this code might be helpful: https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L178
Hi @graykode,
Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.
By the way, I have two questions related to this:
- When using the
local pathaddressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned byroot, since the minio server runs with the root permission. However, when I use thelocal pathaddressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct? - If the
local pathaddressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namelyrank0 nodehere), then can I set therank0 nodeaslocal_pathaddressing but theother rank nodesusesipv4addressing at the same time?
It looks I cannot use the local_path addressing and ipv4 addressing at the same time:
Process TrainProcess-1:
Traceback (most recent call last):
File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
self.setup()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
self.set_dataloaders()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
trainset, valset = self.set_datasets()
File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 64, in set_datasets
trainset_config = nas.DataConfig.from_json_file("train.json")
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 312, in from_json_file
return cls(**config_dict)
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 131, in __init__
self._check_all()
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 140, in _check_all
self._check_bucket()
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 242, in _check_bucket
raise ValueError(
ValueError: Already created endpoint(/mnt/hdd1/kyu/matorage) doesn't current endpoint str(127.0.0.1:9000) It may occurs permission denied error
@jinserk
Hi @graykode, Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with
DataSaverwell. I'll check it withDatasetafter all the data uploading completed.By the way, I have two questions related to this:
- When using the
local pathaddressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned byroot, since the minio server runs with the root permission. However, when I use thelocal pathaddressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct?- If the
local pathaddressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namelyrank0 nodehere), then can I set therank0 nodeaslocal_pathaddressing but theother rank nodesusesipv4addressing at the same time?
- First question: In macOS, I did not encounter such a permission error, but in Linux OS, a permission error was found. Thank you for doing the troubleshooting. This seems to be a minio-related error, so I'll find a solution.
- Second question: As far as I know this is possible. In other words, nodes that physically use NAS can access them through the local path, and other nodes can access them through the HTTP protocol.
@jinserk
I found the solution related to the first one. This is the way to use binary minio without using minio docker: https://github.com/minio/minio#gnulinux
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
# minio for background running
nohup ./minio gateway nas /home/nlkey2022/shared &
I don't know why we get a permission error in minio docker nas. I will leave an issue on the minio once.
@graykode
Guess this is because when using docker it runs as the root but when using local binary it runs with a user permission. I guess if you're run minio with the root permission, it will be the same:
sudo -H nohup ./minio gateway nas /home/nlkey2022/shared &
In my quick and humble opinion, we need to check the minio's set_bucket_policy to set the files or dirs to public. Please check here even though it's minio-java, not the minio-py. Of course I could be wrong and I'm afraid of misleading.
@jinserk
I don't actually know the detailed configuration of the minio. So I will consider it. Thank you.
I'll leave a thread when I find more options!! :)
A step-by-step look at why this error occurs is as follows.
- In the dataset, the minio was updated with the same dataset_name and dataset_additional.
- However, json cached locally, that is, files in the ~/.matorage folder are not updated.
- Currently, the files in the ~/.matorage folder must be manually deleted, but the related logic must be additionally implemented later.