thinc More general remap

More general remap_ids

Open kadarakos opened this issue 1 year ago • 16 comments

The remap_ids layer in spaCy is never used as far as I can tell, but I am implementing a MultiEmbed version of MultiHashEmbed. This uses the more conventional Embed layer instead of the HashEmbed.

Before running Embed it runs remap_ids to convert token.i into positions in the embedding table: chain(remap_ids(), Embed().

To be able to use it as a replacement for MultiHashEmbed a couple modifications had to be made. HashEmbed takes a column attribute, because in spaCy the FeatureExtractor returns an Ints2d which is number of tokens x number of features. To be able to run the same idea with Embed the remap_ids now also implements column: chain(remap_ids(column=column), Embed(column=0)).

Furthermore in the common case where the mapping_table is a Dict[int, int] there was a bit of a strange error: whereas with numpy this works {3: 'q'}[np.array([3])[0]] with cupy we get:

----> 1 {3: 'q'}[cp.array([3])[0]]

TypeError: unhashable type: 'cupy._core.core.ndarray'

In the new version when we detect a cupy integer array we cast the elements to int.

Finally, the backprop used to return a [], but when the input is an xp array then this can lead to confusing errors e.g.: if the remap_ids is within a with_array block as is the case with Embed. Now it returns an empty array instead.

Jul 13 '22 15:07 kadarakos

Why isn't column implemented in the model in the same way as in HashEmbed?

Jul 27 '22 08:07 adrianeboyd

Why isn't column implemented in the model in the same way as in HashEmbed?

How is it different?

Jul 27 '22 09:07 kadarakos

Both HashEmbed and the PR description look more like this:

https://github.com/explosion/thinc/blob/40c129f6ee51d4060f1af6b29f903ce51fd89736/thinc/layers/hashembed.py#L48-L53

It's not that I think the output here is incorrect, but I'm trying to understand why it's implemented differently.

Jul 27 '22 09:07 adrianeboyd

Both HashEmbed and the PR description look more like this:

https://github.com/explosion/thinc/blob/40c129f6ee51d4060f1af6b29f903ce51fd89736/thinc/layers/hashembed.py#L48-L53

It's not that I think the output here is incorrect, but I'm trying to understand why it's implemented differently.

In the HashEmbed the init creates a new model that's as the comment says

 This is equivalent to array[:, column].

and I think this is convenient, because the backprop now works properly due to the chain combinator combining column selection with the original embedding model. In remap_ids we don't need backprop so we can just do the simpler array[:, column].

Jul 27 '22 12:07 kadarakos

Ah, I see what you mean. I was mainly thinking that implementing this the same way would avoid a lot of the casting and numpy/cupy problems.

In general I'm not sure this is going to work (easily) for Hashable due to the types? It looks like util.to_numpy converts everything to strings for your test case, and I think there may be cases where it would convert ints to floats for numeric input?

I wonder if this would be easier if it becomes remap_ids.v2 and is restricted to IntsXd instead?

Jul 28 '22 08:07 adrianeboyd

Ah, I see what you mean. I was mainly thinking that implementing this the same way would avoid a lot of the casting and numpy/cupy problems.

In general I'm not sure this is going to work (easily) for Hashable due to the types? It looks like util.to_numpy converts everything to strings for your test case, and I think there may be cases where it would convert ints to floats for numeric input?

I wonder if this would be easier if it becomes remap_ids.v2 and is restricted to IntsXd instead?

Hmm I guess you are right. But I'm thinking that having Ints1d and Ints2d is basically what we would need from the point of view of spaCy, because we use the integer attributes there and not the strings themselves. But in general I think people would like to use remap_ids with str2int maps more likely (provided anyone would use this outside of spaCy).

What do you think about implementing remap_ids in such a way that it only supports Union[Sequence[str], Sequence[int], Ints1d, Ints2d] and that's it?

Jul 28 '22 12:07 kadarakos

I think that sounds reasonable.

Jul 28 '22 12:07 adrianeboyd

I think that sounds reasonable.

Added the new version.

Aug 02 '22 15:08 kadarakos

This version looks good! It does change the parameters and types enough that I think it should be remap_ids.v2, though. I think we can just have both in parallel here?

Aug 05 '22 08:08 adrianeboyd

This version looks good! It does change the parameters and types enough that I think it should be remap_ids.v2, though. I think we can just have both in parallel here?

I'm not sure what to have for .v1. I think that changing

    def backprop(dY: OutT) -> InT:
        return []

    def backprop(dY: OutT) -> InT:
        if xp_input:
            return model.ops.xp.empty(dY.shape)
        else:
            return []

is a bugfix that prevents remap_ids to break when its in the with_array context.

Same with idx = to_numpy(inputs), because values = [table.get(x, default) for x in idx] doesn't actually work with cupy integer arrays.

So to me it felt more like that .v1 was not tested within a pipeline and the with_array bug didn't surface and was also not tested on the GPU so the cupy bug didn't come up either.

Since I broke .v1 on the first try of using it with chain(remap_ids, Embed), I was kinda imagining that this new version is actually .v1. You know what I mean? I kinda feel like this is the first working version.

Aug 05 '22 10:08 kadarakos

I agree that the existing version is wonky, but the main issue is the parameters, since this removes dtype and adds column. If you hadn't removed the existing test it would break on dtype, I think?

Aug 05 '22 10:08 adrianeboyd

I agree that the existing version is wonky, but the main issue is the parameters, since this removes dtype and adds column. If you hadn't removed the existing test it would break on dtype, I think?

That's true, but I'm also not sure why the dtype was added. The original typing was:

InT = Sequence[Any]
OutT = Ints2d

so like we could give dtype float, but then this would return something not use able:

    arr = model.ops.asarray2i(values, dtype=dtype)
    output = model.ops.reshape2i(arr, -1, 1)

Not sure why we need to pass a dtype to asarray2i

   def asarray2i(
        self, data: Union[Ints2d, Sequence[int]], *, dtype: Optional[DTypes] = "int32"
    ) -> Ints2d:
        return cast(Ints2d, self.asarray(data, dtype=dtype))

Aug 05 '22 10:08 kadarakos

So yeah I guess I find it a bit hard to understand what should I include in .v1

Aug 05 '22 10:08 kadarakos

I don't think you need to spend much effort trying to clean up .v1. It can be buggy/broken and we wouldn't touch it again.

For now I would propose keeping a copy of .v1 unmodified and just adding the new one as .v2.

Aug 05 '22 10:08 adrianeboyd

Ok I've added a legacy version in remap_ids.py is this what you've meant?

Aug 05 '22 11:08 kadarakos

Also please update the corresponding docs: https://thinc.ai/docs/api-layers#remap_ids

Aug 08 '22 09:08 adrianeboyd

As a final step can you format the docs?

Aug 23 '22 09:08 adrianeboyd

As a final step can you format the docs?

Sorry, I've forgot, what are we using to format the docs?

Aug 30 '22 15:08 kadarakos

prettier: https://github.com/explosion/thinc/tree/master/website#setup-and-installation

Aug 30 '22 15:08 adrianeboyd

And reformat again!

Sep 01 '22 08:09 adrianeboyd

Let me do a last-minute renaming to have new _v2 instead of renamed _v1.

Sep 02 '22 09:09 adrianeboyd

thinc thinc copied to clipboard

More general remap_ids

thinc
thinc copied to clipboard