datatable icon indicating copy to clipboard operation
datatable copied to clipboard

Add support for python `in` operator in filtering functions

Open johnygomez opened this issue 7 years ago • 14 comments

I'd like to filter rows according to functions like

lambda x: x[0] in my_list

which use pythonic syntax (syntactic sugar). Currently I need to rewrite this to primitive formula, testing all elements in the list separately.

johnygomez avatar Jan 24 '18 20:01 johnygomez

Update https://stackoverflow.com/questions/61494957 when this is implemented

st-pasha avatar Apr 29 '20 18:04 st-pasha

Any updates when this might be implemented?

Peter-Pasta avatar Apr 02 '21 10:04 Peter-Pasta

I guess the core maintainers are currently focused on building up the time series functionality in datatable; however, since it is open source, contributions are very much welcome.

samukweku avatar Apr 03 '21 09:04 samukweku

I doubt I have the skills and deep level understanding to contribute such a feature. The fact that this feature is still missing implies to me that it takes some time and sophistication to develop it, hence the maintainers weren't able to include it so far. Regardless of that, what are the necessary educational resources to begin to understand how datatable works under the hood?

Peter-Pasta avatar Apr 03 '21 09:04 Peter-Pasta

@Peter-Pasta I am still finding my way around the source code. The core maintainers can explain better

samukweku avatar Apr 04 '21 11:04 samukweku

We have a tutorial on creating a new datatable function: https://datatable.readthedocs.io/en/latest/develop/create-fexpr.html

Now, since in is an operator and not a regular function, the process will be slightly more complicated: you'd need to fill the tp_as_sequence slot and implement the sq_contains method.

As for the "core" of the function, then there are two examples that are quite similar: the replace() function, which compares each value with a list (or map) of values, and the join() function which compares each value with a sorted column via binary search.

Overall, on a difficulty scale from 1 (easy) to 5 (hard), I would rate this task as 2 or 3.

st-pasha avatar Apr 05 '21 18:04 st-pasha

I think it might be easier to write a function, instead of an operator for in, maybe dt.in. I would like to give it a shot

samukweku avatar Apr 11 '21 12:04 samukweku

Also need guidance @st-pasha @oleksiyskononenko ; when building datatable in editable mode, I dont have an easy-install.pth in my site-packages folder, only a easy-install.py file. As such, I cant run this command: echo "`pwd`/src" >> ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth

samukweku avatar Apr 11 '21 14:04 samukweku

@oleksiyskononenko @st-pasha Any ideas on how I can fix the issue above?

samukweku avatar Apr 20 '21 00:04 samukweku

@samukweku Sorry, I was on vacation last week and didn't see your message.

So the main challenge with "editable mode" installations in python is that there is no official PEP standard for this, which makes it hard to provide reliable instructions here. You can try one of the following approaches:

  1. Create the easy-install.pth file using the command above. It should work as-is, or if you have an older version of shell, try echo "`pwd`/src" >> `ls ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth` .
  2. Create a virtual environment specifically for datatable development, using the virtualenv command.

st-pasha avatar Apr 21 '21 19:04 st-pasha

@st-pasha , still having issues with the installation. Sucessfully got it as editable. However, the datatable version is 0.11.1. I uninstalled it, (pip uninstall datatable), thinking that would take care of the problem (as suggested here); however I get the error message below, when I try to run make test :

make test                                                                                                                                                             (make_mistakes) 
python -m pytest -ra --maxfail=10 -Werror tests
ImportError while loading conftest '/home/sam/github/datatable/tests/conftest.py'.
tests/__init__.py:14: in <module>
    from datatable.lib import core
E   ModuleNotFoundError: No module named 'datatable'
make: *** [Makefile:59: test] Error 4

Could you kindly suggest how I can fix this?

samukweku avatar Apr 22 '21 12:04 samukweku

On my computer I have the following configuration: the repository is checked out into

$ pwd
/Users/pasha/github/datatable

The content of the "easy-install.pth" is

$ ls ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth
/Users/pasha/py36/lib/python3.6/site-packages/easy-install.pth
$ cat `ls ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth`
/Users/pasha/github/datatable/src

And I can verify that this works by checking

$ python
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import datatable
>>> datatable.__file__
'/Users/pasha/github/datatable/src/datatable/__init__.py'

The import command may fail like this if the core wasn't compiled yet with either make debug or make build:

>>> import datatable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pasha/github/datatable/src/datatable/__init__.py", line 23, in <module>
    from .frame import Frame
  File "/Users/pasha/github/datatable/src/datatable/frame.py", line 23, in <module>
    from datatable.lib._datatable import Frame
  File "/Users/pasha/github/datatable/src/datatable/lib/__init__.py", line 31, in <module>
    from . import _datatable as core
ImportError: cannot import name '_datatable'

However, if the import says that datatable not found, then it would indicate the installation in editable mode failed somehow.

st-pasha avatar Apr 22 '21 17:04 st-pasha

@st-pasha thanks; found the error on my end and fixed; the echo part wasn't copying the right thing to my easy-install.pth file. All good now.

Another question: if changes are made to the C++ code, make build is required. How do I test code changes in the python section? say for instance i want f.string_column.len() to return 2. silly example but i hope you get my point. This does not involve any C++, so how do I do that?

samukweku avatar Apr 22 '21 23:04 samukweku

If you make changes to C++, you need to run make build (or make debug) and then restart python console (or reload kernel in jupyter). If you make changes to python only, then you just need to restart the python console.

st-pasha avatar Apr 22 '21 23:04 st-pasha