matorage icon indicating copy to clipboard operation
matorage copied to clipboard

no metadata dir in a compressed bucket

Open jinserk opened this issue 5 years ago • 6 comments

Hi again,

Sorry for bothering you with several question and bug report, but this looks critical. I made a compressed data bucket and it looks storing well, but when I retrieve the dataset, it has 0 len as follows:

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 88, in set_datasets
    print(dataset[0])
  File "/home/jinserk/kyu/kyumlm/tddft/ann/dataset.py", line 35, in __getitem__
    x = super().__getitem__(index)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 81, in __getitem__
    return self._get_item_with_download(idx)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 89, in _get_item_with_download
    _objectname, _relative_index = self._find_object(idx)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 128, in _find_object
    _key = self.end_indices[_key_idx]
IndexError: list index out of range

I've checked briefly, and found that the bucket has no metadata to read out the meta info of the dataset. Can you fix this error? I have installed the latest master branch code.

jinserk avatar Aug 28 '20 02:08 jinserk

One more minor error I found was, when I export the DataConfig to json, itemsize info of a DataAttribute was not exported. Of course I can add it manually.

jinserk avatar Aug 28 '20 02:08 jinserk

@jinserk

No, a lot of questions on this project don't bother me. Rather, I am happy to think that this project can be improved.

First question: If there is no information related to the metadata, it means that the save was accidentally broken in the middle. Therefore, it seems necessary to create a metadata recover function for this case. Or maybe you have forgot datasaver.disconnect.

Second question : Yes, itemsize is missed. I'll add this part as soon as possible.

To fixed

  • create metadata.recover function
  • itemsize option also save json file.

So, for the first question, please double check that the code was written correctly before modifying this part.

graykode avatar Aug 28 '20 08:08 graykode

@graykode You're right! I forgot datasaver.disconnect. Thank you so much! By the way, is this disconnect not able to be called from DataSaver.__del__() automatically?

jinserk avatar Aug 28 '20 12:08 jinserk

@jinserk Thanks for the great suggestion.

As you suggested, adding it to DataSaver.__del__() doesn't seem to have any problem in terms of concurrency(multiprocessing). I will reflect on this. Thanks!

graykode avatar Aug 28 '20 12:08 graykode

@jinserk

The python destructor is not a function that is triggered when the class ends. Therefore, it seems more efficient to manage with python's Context Manager (__enter()__, __exit__) :

with DataSaver(...) as datasaver:
   datasave(...)

graykode avatar Aug 29 '20 15:08 graykode

Looks great! I thought that __del__ is called when the instance destructed but it doesn't.. sorry for making you confused!

jinserk avatar Aug 29 '20 16:08 jinserk