datasets icon indicating copy to clipboard operation
datasets copied to clipboard

map() function removes columns when input_columns is not None

Open pramodith opened this issue 3 years ago • 3 comments

Describe the bug

The map function, removes features from the dataset that are not present in the input_columns list of columns, despite the columns being removed not mentioned in the remove_columns argument.

Steps to reproduce the bug

from datasets import Dataset
ds = Dataset.from_dict({"a" : [1,2,3],"b" : [0,1,0], "c" : [2,4,5]})

def double(x,y):
  x = x*2
  y = y*2
  return {"d" : x, "e" : y}

ds.map(double, input_columns=["a","c"])

Expected results

Dataset({
    features: ['a', 'b', 'c', 'd', 'e'],
    num_rows: 3
})

Actual results

Dataset({
    features: ['a', 'c', 'd', 'e'],
    num_rows: 3
})

In this specific example feature b should not be removed.

Environment info

  • datasets version: 2.4.0
  • Platform: linux (colab)
  • Python version: 3.7.13
  • PyArrow version: 6.0.1

pramodith avatar Aug 16 '22 20:08 pramodith

Hi! Thanks for reporting! This looks like a bug. I've just opened a PR with the fix.

mariosasko avatar Sep 12 '22 18:09 mariosasko

Awesome! Thank you. I'll close the issue once the PR gets merged. :-)

pramodith avatar Sep 12 '22 20:09 pramodith

I guess we should reopen after the revert by:

  • #5006

albertvillanova avatar Sep 21 '22 15:09 albertvillanova