thinc
thinc copied to clipboard
More general remap_ids
The remap_ids layer in spaCy is never used as far as I can tell, but I am implementing a MultiEmbed version of MultiHashEmbed. This uses the more conventional Embed layer instead of the HashEmbed.
Before running Embed it runs remap_ids to convert token.i into positions in the embedding table: chain(remap_ids(), Embed().
To be able to use it as a replacement for MultiHashEmbed a couple modifications had to be made. HashEmbed takes a column attribute, because in spaCy the FeatureExtractor returns an Ints2d which is number of tokens x number of features.
To be able to run the same idea with Embed the remap_ids now also implements column: chain(remap_ids(column=column), Embed(column=0)).
Furthermore in the common case where the mapping_table is a Dict[int, int] there was a bit of a strange error: whereas with numpy this works {3: 'q'}[np.array([3])[0]] with cupy we get:
----> 1 {3: 'q'}[cp.array([3])[0]]
TypeError: unhashable type: 'cupy._core.core.ndarray'
In the new version when we detect a cupy integer array we cast the elements to int.
Finally, the backprop used to return a [], but when the input is an xp array then this can lead to confusing errors e.g.: if the remap_ids is within a with_array block as is the case with Embed. Now it returns an empty array instead.
Why isn't column implemented in the model in the same way as in HashEmbed?
Why isn't
columnimplemented in the model in the same way as inHashEmbed?
How is it different?
Both HashEmbed and the PR description look more like this:
https://github.com/explosion/thinc/blob/40c129f6ee51d4060f1af6b29f903ce51fd89736/thinc/layers/hashembed.py#L48-L53
It's not that I think the output here is incorrect, but I'm trying to understand why it's implemented differently.
Both
HashEmbedand the PR description look more like this:https://github.com/explosion/thinc/blob/40c129f6ee51d4060f1af6b29f903ce51fd89736/thinc/layers/hashembed.py#L48-L53
It's not that I think the output here is incorrect, but I'm trying to understand why it's implemented differently.
In the HashEmbed the init creates a new model that's as the comment says
This is equivalent to array[:, column].
and I think this is convenient, because the backprop now works properly due to the chain combinator combining column selection with the original embedding model. In remap_ids we don't need backprop so we can just do the simpler array[:, column].
Ah, I see what you mean. I was mainly thinking that implementing this the same way would avoid a lot of the casting and numpy/cupy problems.
In general I'm not sure this is going to work (easily) for Hashable due to the types? It looks like util.to_numpy converts everything to strings for your test case, and I think there may be cases where it would convert ints to floats for numeric input?
I wonder if this would be easier if it becomes remap_ids.v2 and is restricted to IntsXd instead?
Ah, I see what you mean. I was mainly thinking that implementing this the same way would avoid a lot of the casting and numpy/cupy problems.
In general I'm not sure this is going to work (easily) for
Hashabledue to the types? It looks likeutil.to_numpyconverts everything to strings for your test case, and I think there may be cases where it would convert ints to floats for numeric input?I wonder if this would be easier if it becomes
remap_ids.v2and is restricted toIntsXdinstead?
Hmm I guess you are right. But I'm thinking that having Ints1d and Ints2d is basically what we would need from the point of view of spaCy, because we use the integer attributes there and not the strings themselves. But in general I think people would like to use remap_ids with str2int maps more likely (provided anyone would use this outside of spaCy).
What do you think about implementing remap_ids in such a way that it only supports Union[Sequence[str], Sequence[int], Ints1d, Ints2d] and that's it?
I think that sounds reasonable.
I think that sounds reasonable.
Added the new version.
This version looks good! It does change the parameters and types enough that I think it should be remap_ids.v2, though. I think we can just have both in parallel here?
This version looks good! It does change the parameters and types enough that I think it should be
remap_ids.v2, though. I think we can just have both in parallel here?
I'm not sure what to have for .v1. I think that changing
def backprop(dY: OutT) -> InT:
return []
to
def backprop(dY: OutT) -> InT:
if xp_input:
return model.ops.xp.empty(dY.shape)
else:
return []
is a bugfix that prevents remap_ids to break when its in the with_array context.
Same with idx = to_numpy(inputs), because values = [table.get(x, default) for x in idx] doesn't actually work with cupy integer arrays.
So to me it felt more like that .v1 was not tested within a pipeline and the with_array bug didn't surface and was also not tested on the GPU so the cupy bug didn't come up either.
Since I broke .v1 on the first try of using it with chain(remap_ids, Embed), I was kinda imagining that this new version is actually .v1. You know what I mean? I kinda feel like this is the first working version.
I agree that the existing version is wonky, but the main issue is the parameters, since this removes dtype and adds column. If you hadn't removed the existing test it would break on dtype, I think?
I agree that the existing version is wonky, but the main issue is the parameters, since this removes
dtypeand addscolumn. If you hadn't removed the existing test it would break ondtype, I think?
That's true, but I'm also not sure why the dtype was added. The original typing was:
InT = Sequence[Any]
OutT = Ints2d
so like we could give dtype float, but then this would return something not use able:
arr = model.ops.asarray2i(values, dtype=dtype)
output = model.ops.reshape2i(arr, -1, 1)
Not sure why we need to pass a dtype to asarray2i
def asarray2i(
self, data: Union[Ints2d, Sequence[int]], *, dtype: Optional[DTypes] = "int32"
) -> Ints2d:
return cast(Ints2d, self.asarray(data, dtype=dtype))
So yeah I guess I find it a bit hard to understand what should I include in .v1
I don't think you need to spend much effort trying to clean up .v1. It can be buggy/broken and we wouldn't touch it again.
For now I would propose keeping a copy of .v1 unmodified and just adding the new one as .v2.
Ok I've added a legacy version in remap_ids.py is this what you've meant?
Also please update the corresponding docs: https://thinc.ai/docs/api-layers#remap_ids
As a final step can you format the docs?
As a final step can you format the docs?
Sorry, I've forgot, what are we using to format the docs?
prettier: https://github.com/explosion/thinc/tree/master/website#setup-and-installation
And reformat again!
Let me do a last-minute renaming to have new _v2 instead of renamed _v1.