modin
modin copied to clipboard
PERF: avoid reorder_labels
- [x] closes #4882
import numpy as np
import modin.pandas as pd
arr = np.random.randn(1_000_000, 20)
df = pd.DataFrame(arr)
res = df[[4, 3]]
%timeit res = df[[4, 3]]
711 µs ± 4.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # <- PR
43.6 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # <- master
TODO
- [ ] further optimizations to blklocs/blknos
- [ ] avoid reorder_labels in cudf_on_ray implemenation
- [ ] fix is_full_axis_mask
- [ ] deal with potential many-partitions issues (discussed briefly on call 8/23)
This pull request introduces 3 alerts when merging 0ab5fbcb2429bddc0c93880f0957834d7edfeddd into b0465fa4425247e7949881e93866c6b85bb10a1d - view on LGTM.com
new alerts:
- 2 for Unused local variable
- 1 for Unused import
This pull request introduces 3 alerts when merging b5aede162bc26e4be3516301289e56de41de00bd into dff84b9cdf373096f88356fe17202599c192c2fc - view on LGTM.com
new alerts:
- 2 for Unused local variable
- 1 for Unused import
Looks like this runs into a many-partitions issue. @devin-petersohn on the call the other day it sounded like you had an idea to deal with this?
Looks like this runs into a many-partitions issue. @devin-petersohn on the call the other day it sounded like you had an idea to deal with this?
Hi @jbrockmendel! By many-partitions issue, do you mean that too many partitions are created? If that's the issue, you could call the rebalance_partitions method of the partition manager on the partitions, which should balance them!
This pull request introduces 3 alerts when merging 0143dc6cda908112e1d5306c169f2af507280e04 into 336fad7c26ad8d3f44fe290d232c7cb7fd7dbe17 - view on LGTM.com
new alerts:
- 2 for Unused local variable
- 1 for Unused import
This pull request introduces 3 alerts when merging a563f21ad03263f5b38247c37ccb51e68004880e into 636fc59e6820a36937b72cfb96ba9fa60d871fe4 - view on LGTM.com
new alerts:
- 2 for Unused local variable
- 1 for Unused import
Codecov Report
Merging #4880 (514f062) into master (c9fc326) will decrease coverage by
12.64%. The diff coverage is100.00%.
:exclamation: Current head 514f062 differs from pull request most recent head ab42a25. Consider uploading reports for the commit ab42a25 to get more accurate results
@@ Coverage Diff @@
## master #4880 +/- ##
===========================================
- Coverage 84.87% 72.22% -12.65%
===========================================
Files 268 268
Lines 19744 19739 -5
===========================================
- Hits 16757 14256 -2501
- Misses 2987 5483 +2496
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...in/core/dataframe/pandas/partitioning/partition.py | 100.00% <ø> (ø) |
|
| modin/core/dataframe/pandas/dataframe/dataframe.py | 80.80% <100.00%> (-14.30%) |
:arrow_down: |
| modin/experimental/sklearn/__init__.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| modin/experimental/xgboost/test/test_dmatrix.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| modin/experimental/xgboost/test/test_xgboost.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| modin/experimental/core/execution/ray/__init__.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...n/experimental/sklearn/model_selection/__init__.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...mental/sklearn/model_selection/train_test_split.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...tal/core/execution/ray/implementations/__init__.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ...pandas/interchange/dataframe_protocol/exception.py | 0.00% <0.00%> (-100.00%) |
:arrow_down: |
| ... and 78 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
This pull request introduces 1 alert when merging 2809891e95f8a83f291bc1b54fa09c68cb9c028a into 636fc59e6820a36937b72cfb96ba9fa60d871fe4 - view on LGTM.com
new alerts:
- 1 for Unused local variable
@RehanSD i think we need to re-find new_row_lengths after rebalance_partitions. suggestions on efficient way to do this?
i think we need to re-find new_row_lengths after rebalance_partitions. suggestions on efficient way to do this?
@jbrockmendel it looks like draft PR #4893 is going to make rebalance_partitions return the new row lengths without doing any extra remote work. Will that work?
sounds like it, yes
@anmyachev tried to incorporate the rebalance_partition changes but it broke a bunch of new stuff. suggestions?
@anmyachev tried to incorporate the rebalance_partition changes but it broke a bunch of new stuff. suggestions?
Perhaps https://github.com/modin-project/modin/issues/4914 will result in an issue with the insert operation.
@jbrockmendel please rebase on master
This pull request introduces 2 alerts when merging 68210152efa56ab50d80116c2e837c66e40562f4 into 84f638cc703beb550edecce715532c8e86fb7b25 - view on LGTM.com
new alerts:
- 2 for Unused local variable
I don't see a path here, closing.