deeplake icon indicating copy to clipboard operation
deeplake copied to clipboard

[BUG] sample_by on view does not work

Open daniel-falk opened this issue 3 years ago • 3 comments

🐛🐛 Bug Report

Hi, when using the sample by query on the result of another query, it fails with an exception.

ds = deeplake.load("hub://activeloop/mnist-train")
ds2 = ds.query("select * limit 1000")
ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()

This fails with exception:

IndexError                                Traceback (most recent call last)
<ipython-input-147-431c4577b4bb> in <cell line: 1>()
----> 1 ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()

~/src/Hub/deeplake/core/dataset/dataset.py in query(self, query_string)
   1709         from deeplake.enterprise import query
   1710 
-> 1711         return query(self, query_string)
   1712 
   1713     def sample_by(

~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/humbug/report.py in wrapped_callable(*args, **kwargs)
    443             self.feature_report(callable.__name__, parameters)
    444 
--> 445             return callable(*args, **kwargs)
    446 
    447         return wrapped_callable

~/src/Hub/deeplake/enterprise/libdeeplake_query.py in query(dataset, query_string)
     39     dsv = ds.query(query_string)
     40     indexes = dsv.indexes
---> 41     return dataset[indexes]
     42 
     43 

~/src/Hub/deeplake/core/dataset/dataset.py in __getitem__(self, item, is_iteration)
    456                 ret = self.__class__(
    457                     storage=self.storage,
--> 458                     index=self.index[item],
    459                     group_index=self.group_index,
    460                     read_only=self._read_only,

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    374             return new_index
    375         elif isinstance(item, list):
--> 376             return self[(tuple(item),)]  # type: ignore
    377         elif isinstance(item, Index):
    378             return self[tuple(v.value for v in item.values)]  # type: ignore

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    371             for idx, sub_item in enumerate(item):
    372                 ax = new_index.find_axis(offset=idx)
--> 373                 new_index = new_index.compose_at(sub_item, ax)
    374             return new_index
    375         elif isinstance(item, list):

~/src/Hub/deeplake/core/index/index.py in compose_at(self, item, i)
    330             return Index(self.values + [IndexEntry(item)])
    331         else:
--> 332             new_values = self.values[:i] + [self.values[i][item]] + self.values[i + 1 :]
    333             return Index(new_values)
    334 

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

~/src/Hub/deeplake/core/index/index.py in <genexpr>(.0)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

IndexError: tuple index out of range

Using versions:

deeplake==3.1.0
libdeeplake==0.0.29

It would also be nice if I could automatically sample to get a uniformed distribution instead of using weights, because now I need to do the query in two steps:

  • Filter on any metadata that I am insterested in
  • Calculate the class imballance
  • Sample by the inverse of the class imballance

daniel-falk avatar Nov 23 '22 15:11 daniel-falk

Hey, @daniel-falk. Can you try this on main? This PR should have fixed the issue https://github.com/activeloopai/deeplake/pull/2018

AbhinavTuli avatar Nov 23 '22 15:11 AbhinavTuli

Does not seem to work for me on master either:

In [6]: deeplake.__version__
Out[6]: '3.1.1'

In [8]: deeplake.__file__
Out[8]: '/home/daniel/src/Hub/deeplake/__init__.py'

In [10]: !cd /home/daniel/src/Hub/deeplake/ && git log -n1
commit c2c64607a42e3135923bb529e729118c4d4cdf2a (HEAD -> main, origin/main, origin/HEAD)
Author: Abhinav Tuli <[email protected]>
Date:   Wed Nov 23 18:45:55 2022 +0530

    Handle repeated samples in shuffle (#2018)
    
    * fix
    
    * Fix.
    
    * fix for fix
    
    Co-authored-by: Sasun Hambardzumyan <[email protected]>                                   
                                                                      
In [11]: ds = deeplake.load("hub://activeloop/mnist-train")
hub://activeloop/mnist-train loaded successfully.                                                                                            
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/mnist-train
                                                                      
In [12]: ds2 = ds.query("select * limit 1000")

In [13]: ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-13-431c4577b4bb> in <cell line: 1>()
----> 1 ds2.query("select * sample by max_weight(contains(labels, '1'): 2, true: 1) limit 10").labels.numpy()

~/src/Hub/deeplake/core/dataset/dataset.py in query(self, query_string)
   1702         from deeplake.enterprise import query
   1703 
-> 1704         return query(self, query_string)
   1705 
   1706     def sample_by(

~/.pyenv/versions/3.10.0/lib/python3.10/site-packages/humbug/report.py in wrapped_callable(*args, **kwargs)
    443             self.feature_report(callable.__name__, parameters)
    444 
--> 445             return callable(*args, **kwargs)
    446 
    447         return wrapped_callable

~/src/Hub/deeplake/enterprise/libdeeplake_query.py in query(dataset, query_string)
     39     dsv = ds.query(query_string)
     40     indexes = dsv.indexes
---> 41     return dataset[indexes]
     42 
     43 

~/src/Hub/deeplake/core/dataset/dataset.py in __getitem__(self, item, is_iteration)
    456                 ret = self.__class__(
    457                     storage=self.storage,
--> 458                     index=self.index[item],
    459                     group_index=self.group_index,
    460                     read_only=self._read_only,

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    374             return new_index
    375         elif isinstance(item, list):
--> 376             return self[(tuple(item),)]  # type: ignore
    377         elif isinstance(item, Index):
    378             return self[tuple(v.value for v in item.values)]  # type: ignore

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    371             for idx, sub_item in enumerate(item):
    372                 ax = new_index.find_axis(offset=idx)
--> 373                 new_index = new_index.compose_at(sub_item, ax)
    374             return new_index
    375         elif isinstance(item, list):

~/src/Hub/deeplake/core/index/index.py in compose_at(self, item, i)
    330             return Index(self.values + [IndexEntry(item)])
    331         else:
--> 332             new_values = self.values[:i] + [self.values[i][item]] + self.values[i + 1 :]
    333             return Index(new_values)
    334 

~/src/Hub/deeplake/core/index/index.py in __getitem__(self, item)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

~/src/Hub/deeplake/core/index/index.py in <genexpr>(.0)
    189                 return IndexEntry(self.value[item])
    190             elif isinstance(item, (tuple, list)):
--> 191                 new_value = tuple(self.value[idx] for idx in item)
    192                 return IndexEntry(new_value)
    193 

IndexError: tuple index out of range

daniel-falk avatar Nov 23 '22 15:11 daniel-falk

Thanks for pointing it out Daniel. This seems like a different problem than the one addressed in the PR mentioned above. @khustup is working on fixing this issue and we should have a fix soon.

AbhinavTuli avatar Nov 24 '22 12:11 AbhinavTuli