modin icon indicating copy to clipboard operation
modin copied to clipboard

DOCS: Document that sort_values is not fully distributed

Open lemerchand opened this issue 3 years ago • 9 comments

Isn't is a supported method? It seems so from the docs, and I can't find any errors on google or github similar to mine. I have a dataframe I've been playing with, and I go to sort the values like this:

df.sort_values(by='temp_f', ascending=False)

only to receive the following warning:

UserWarning: sort_values is not currently supported by PandasOnRay, defaulting to pandas implementation.

Is it something I am doing wrong?

lemerchand avatar Oct 25 '22 02:10 lemerchand

Hi @lemerchand! sort_values currently defaults to pandas; however, we are working on a fix for this #4601 - after which you shouldn't receive that warning anymore! In general, Modin defaults to pandas whenever it doesn't have a distributed implementation of the specific API; although we're constantly working on getting 100% of pandas API implemented.

RehanSD avatar Oct 25 '22 18:10 RehanSD

@RehanSD we do claim that sort_values is implemented here. Should we fix that?

mvashishtha avatar Oct 25 '22 18:10 mvashishtha

@RehanSD we do claim that sort_values is implemented here. Should we fix that?

Yea this is what had me confused, but good to know it's being worked on! Thanks for the replies, @mvashishtha and @RehanSD

lemerchand avatar Oct 25 '22 18:10 lemerchand

Ah I see - I think this is because the sort_values is currently implemented by gathering the full columns we are sorting by on the head node, sorting them, and then using the index of the sorted columns to do a mask on the dataframe - so we aren't technically defaulting to pandas, but its not a true distributed sort. I think we should either remove the UserWarning or change the implementation status - although I'm not certain what the correct answer is here. Perhaps a good compromise would be to change the UserWarning so that it says that its doing a sort but that it is not distributed, and change the implementation status to Partial?

RehanSD avatar Oct 25 '22 18:10 RehanSD

@RehanSD I think "partial" is a good label for the current implementation. I've turned this issue into the task of updating the sort_values implementation status.

@lemerchand if you're interested in making the documentation fix yourself, we'd be happy to help you merge a PR. Relevant docs are here for dataframes and here for series. There's a quick guide to getting started with contributions here, and there's a more complete guide here.

mvashishtha avatar Oct 26 '22 02:10 mvashishtha

Hey, yea that is something I'd be happy to do, though I won't be on my computer until tomorrow, so if it isn't fixed by then l see about looking through the contribution guidelines!

lemerchand avatar Oct 26 '22 04:10 lemerchand

This appears to no longer print any warning on 0.18, and I believe it was fixed by #3535

jkew avatar Jan 08 '23 10:01 jkew

We now have a distributed sort_values for axis=0, but not for axis=1 so I think the issue should still be kept open.

cc @RehanSD

YarShev avatar Jan 08 '23 10:01 YarShev

Quick question here...

it IS distributed right now right? So it should be possible to sort a df that is larger as the Nodes RAM (single node setup)..? Or doesnt it mean that?

It seams regardless of NPartitions this just fills up my RAM until a raylet dies. (Even though the dataset is just 40gigs (maybe a bit more in memory) and the node has 64gigs of RAM. Am I misunderstanding distribution in modin, or modin in general?

I think I just realized that axis i swapped axis 0 and 1. So it is NOT distributed to sort rows? just columns?

to sad :p

Liquidmasl avatar Aug 08 '24 14:08 Liquidmasl