unitxt icon indicating copy to clipboard operation
unitxt copied to clipboard

Fix missing deep copy in MapInstanceValues

Open yoavkatz opened this issue 1 year ago • 3 comments

If the value we map to is a list, we must deepcopy the value, otherwise external changes to the list effect the value of the mapper .

yoavkatz avatar Oct 08 '24 13:10 yoavkatz

I think all the copies in unitxt should be one of these copy functions in utils.py Where every function add more power with additional price:

  • shallow_copy
  • recursive_copy
  • recursive_shallow_copy
  • recursive_deep_copy
  • deep_copy

In this case I think that recursive shallow copy is enough. what it does is to recursively recreate every list, dict, tuple, with shallow copy of the object inside.

In this case of list of images for example it will recreate a new list of images where every image is with new reference, yet the pixels are shared between images.

elronbandel avatar Oct 08 '24 14:10 elronbandel

Hi, First, just a reminder, that the docstring of MapInstanceValue states:

Attributes: mappers (Dict[str, Dict[str, str]]): The mappers to use for mapping instance values. Keys are the names of the fields to be mapped, and values are dictionaries that define the mapping from old values to new values.

The code does verify that the keys in the mapping from old values to new values are strings:

    def verify(self):
        # make sure the mappers are valid
        for key, mapper in self.mappers.items():
            assert isinstance(
                mapper, dict
            ), f"Mapper for given field {key} should be a dict, got {type(mapper)}"
            for k in mapper.keys():
                assert isinstance(
                    k, str
                ), f'Key "{k}" in mapper for field "{key}" should be a string, got {type(k)}'

and it is not clear why not verified that also the value in each mapper is a str. Not any obvious reason. Now, if the existing (coming in the processed instance) old value to be mapped is not a str, the code casts it to a str:

    def get_mapped_value(self, instance, key, mapper, val):
        val_as_str = str(val)  # make sure the value is a string
        if val_as_str in mapper:
            return mapper[val_as_str]

I think that we should thus do one of the following: [1] Verify that also the target in each mapping is a string, (not only verify the key of each mapping) [2] Return a value that is casted to a str (the same as we do when we look up with a given value in the instance, that is expected to be a str - we cast that value to a str), and let the user, who planted a list in the mappers - to worry about this. [3] Take the worry from [2] onto our shoulders: In prepare(), make each val in the mappers (the val in each of the dicts, the val that we do not verify its being a str, although the docstring says it is a string) a tuple: the casting to string of that val, and the original type thereof. When returning: return a cast to the saved type of the saved str. [4] Do the careful copy that @elronbandel created: a careful copy that treats non-str as they deserve - copy their individual ingredients, not just their shallow incarnation. Specifically, use here: unitxt.utils.recursive_copy

Good old copy.deepcopy costs a lot of runtime, and I would not recommend do it for each and every innocent str (which expectedly is the most commonly used type of val, as the docstring states)

dafnapension avatar Oct 08 '24 16:10 dafnapension

Agree, hence my recommendation for recursive_shallow_copy [or even recursive_copy (in the case of string will return the same string over and over. ]

elronbandel avatar Oct 09 '24 09:10 elronbandel

@dafnapension Thanks for making the changes and adding the test. Looks good.

I can not approve it (only you or @elronbandel ).

yoavkatz avatar Oct 10 '24 07:10 yoavkatz

Thank you, @yoavkatz , I will approve now.

dafnapension avatar Oct 10 '24 09:10 dafnapension