datasets icon indicating copy to clipboard operation
datasets copied to clipboard

'sort' method sorts one column only

Open shachardon opened this issue 3 years ago • 3 comments

The 'sort' method changes the order of one column only (the one defined by the argument 'column'), thus creating a mismatch between a sample fields. I would expect it to change the order of the samples as a whole, based on the 'column' order.

shachardon avatar Jul 05 '22 11:07 shachardon

Hi ! ds.sort() does sort the full dataset, not just one column:

from datasets import *

ds = Dataset.from_dict({"foo": [3, 2, 1], "bar": ["c", "b", "a"]})
print(d.sort("foo").to_pandas()
#    foo bar
# 0    1   a
# 1    2   b
# 2    3   c

What made you think it was not the case ? Did you experience a situation where it was only sorting one column ?

lhoestq avatar Jul 05 '22 14:07 lhoestq

Hi! thank you for your quick reply! I wanted to sort the cnn_dailymail dataset by the length of the labels (num of characters). I added a new column to the dataset (ds.add_column) with the lengths and then sorted by this new column. Only the new length column was sorted, the reset left in their original order.

shachardon avatar Jul 07 '22 10:07 shachardon

That's unexpected, can you share the code you used to get this ?

lhoestq avatar Jul 07 '22 12:07 lhoestq