modin
modin copied to clipboard
DOCS: Document that sort_values is not fully distributed
Isn't is a supported method? It seems so from the docs, and I can't find any errors on google or github similar to mine. I have a dataframe I've been playing with, and I go to sort the values like this:
df.sort_values(by='temp_f', ascending=False)
only to receive the following warning:
UserWarning: sort_values is not currently supported by PandasOnRay, defaulting to pandas implementation.
Is it something I am doing wrong?
Hi @lemerchand! sort_values currently defaults to pandas; however, we are working on a fix for this #4601 - after which you shouldn't receive that warning anymore! In general, Modin defaults to pandas whenever it doesn't have a distributed implementation of the specific API; although we're constantly working on getting 100% of pandas API implemented.
@RehanSD we do claim that sort_values is implemented here. Should we fix that?
@RehanSD we do claim that
sort_valuesis implemented here. Should we fix that?
Yea this is what had me confused, but good to know it's being worked on! Thanks for the replies, @mvashishtha and @RehanSD
Ah I see - I think this is because the sort_values is currently implemented by gathering the full columns we are sorting by on the head node, sorting them, and then using the index of the sorted columns to do a mask on the dataframe - so we aren't technically defaulting to pandas, but its not a true distributed sort. I think we should either remove the UserWarning or change the implementation status - although I'm not certain what the correct answer is here. Perhaps a good compromise would be to change the UserWarning so that it says that its doing a sort but that it is not distributed, and change the implementation status to Partial?
@RehanSD I think "partial" is a good label for the current implementation. I've turned this issue into the task of updating the sort_values implementation status.
@lemerchand if you're interested in making the documentation fix yourself, we'd be happy to help you merge a PR. Relevant docs are here for dataframes and here for series. There's a quick guide to getting started with contributions here, and there's a more complete guide here.
Hey, yea that is something I'd be happy to do, though I won't be on my computer until tomorrow, so if it isn't fixed by then l see about looking through the contribution guidelines!
This appears to no longer print any warning on 0.18, and I believe it was fixed by #3535
We now have a distributed sort_values for axis=0, but not for axis=1 so I think the issue should still be kept open.
cc @RehanSD
Quick question here...
it IS distributed right now right? So it should be possible to sort a df that is larger as the Nodes RAM (single node setup)..? Or doesnt it mean that?
It seams regardless of NPartitions this just fills up my RAM until a raylet dies. (Even though the dataset is just 40gigs (maybe a bit more in memory) and the node has 64gigs of RAM. Am I misunderstanding distribution in modin, or modin in general?
I think I just realized that axis i swapped axis 0 and 1. So it is NOT distributed to sort rows? just columns?
to sad :p