modin icon indicating copy to clipboard operation
modin copied to clipboard

FEAT-#4605: Adding small query compiler

Open arunjose696 opened this issue 9 months ago • 4 comments

What do these changes do?

  • [x] first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • [ ] passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • [ ] passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • [ ] signed commit with git commit -s
  • [ ] Resolves #4605
  • [ ] tests added and passing
  • [ ] module layout described at docs/development/architecture.rst is up-to-date

arunjose696 avatar May 13 '24 18:05 arunjose696

Great start on solving this problem! Is it possible to avoid so many of the test changes?

The most changes in tests are disabling few checks as it wont be supported without partitions, and as the current changes dont yet support IO like pd.read_csv(), Is there something specific that should be avoided?

arunjose696 avatar May 22 '24 16:05 arunjose696

is there something specific that should be avoided?

Nothing specific, I was just trying to understand context. Thanks!

devin-petersohn avatar May 22 '24 16:05 devin-petersohn

@arunjose696 please rebase on main

anmyachev avatar Jun 05 '24 10:06 anmyachev

With the introduction of the small query compiler, we need to test the interoperability between DataFrames using different query compilers. For example, performing a binary operation between a DataFrame with the small query compiler and another with the Pandas query compiler. (Note: This feature is not yet included in this PR.)

This will require modifying or adding new tests. In the current tests in the modin/modin/tests/pandas/dataframe folder, we have the following scenarios where two DataFrames interact:

1)Derived DataFrames: In tests where the second DataFrame is created or derived from the first, egtest_join_empty, we need to refactor these tests so that the second DataFrame is created separately from the first and with MODIN_NATIVE_DATAFRAME_MODE set.

2)Lambda Functions: In tests where the other DataFrame is created within a lambda function, eg test___divmod__, we need to refactor these tests to either create the second DataFrame in the test definition itself or provide an additional wrapper for the lambda functions to ensure the DataFrame is created with a different query compilers.

3)Separate DataFrames: In tests where two separate DataFrames are used, eg test_where, we need to refactor these tests to include flipping the MODIN_NATIVE_DATAFRAME_MODE to None and Native_pandas when creating both the first and second DataFrame. This ensures that both the left and right operands are tested with different query compilers for interoperability. This flipping would also be required in cases mentioned in 1 and 2 after dataframes are separated.

Upon reviewing the modin/modin/tests/pandas/dataframe folder, I found approximately 100 tests involving scenarios where two DataFrames interact. These tests may need refactoring or copying to a different directory and updating to specifically test interoperability.

@YarShev @anmyachev @devin-petersohn, could you please provide suggestions on how to approach testing the interoperability?

arunjose696 avatar Jun 10 '24 16:06 arunjose696