no metadata dir in a compressed bucket
Hi again,
Sorry for bothering you with several question and bug report, but this looks critical. I made a compressed data bucket and it looks storing well, but when I retrieve the dataset, it has 0 len as follows:
Traceback (most recent call last):
File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
self.setup()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
self.set_dataloaders()
File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
trainset, valset = self.set_datasets()
File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 88, in set_datasets
print(dataset[0])
File "/home/jinserk/kyu/kyumlm/tddft/ann/dataset.py", line 35, in __getitem__
x = super().__getitem__(index)
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 81, in __getitem__
return self._get_item_with_download(idx)
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 89, in _get_item_with_download
_objectname, _relative_index = self._find_object(idx)
File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 128, in _find_object
_key = self.end_indices[_key_idx]
IndexError: list index out of range
I've checked briefly, and found that the bucket has no metadata to read out the meta info of the dataset.
Can you fix this error? I have installed the latest master branch code.
One more minor error I found was, when I export the DataConfig to json, itemsize info of a DataAttribute was not exported. Of course I can add it manually.
@jinserk
No, a lot of questions on this project don't bother me. Rather, I am happy to think that this project can be improved.
First question: If there is no information related to the metadata, it means that the save was accidentally broken in the middle.
Therefore, it seems necessary to create a metadata recover function for this case. Or maybe you have forgot datasaver.disconnect.
Second question : Yes, itemsize is missed. I'll add this part as soon as possible.
To fixed
- create
metadata.recoverfunction itemsizeoption also save json file.
So, for the first question, please double check that the code was written correctly before modifying this part.
@graykode You're right! I forgot datasaver.disconnect. Thank you so much! By the way, is this disconnect not able to be called from DataSaver.__del__() automatically?
@jinserk Thanks for the great suggestion.
As you suggested, adding it to DataSaver.__del__() doesn't seem to have any problem in terms of concurrency(multiprocessing). I will reflect on this. Thanks!
@jinserk
The python destructor is not a function that is triggered when the class ends. Therefore, it seems more efficient to manage with python's Context Manager (__enter()__, __exit__) :
with DataSaver(...) as datasaver:
datasave(...)
Looks great! I thought that __del__ is called when the instance destructed but it doesn't.. sorry for making you confused!