thinc
thinc copied to clipboard
More general remap_ids
The remap_ids
layer in spaCy
is never used as far as I can tell, but I am implementing a MultiEmbed
version of MultiHashEmbed
. This uses the more conventional Embed
layer instead of the HashEmbed
.
Before running Embed
it runs remap_ids
to convert token.i
into positions in the embedding table: chain(remap_ids(), Embed()
.
To be able to use it as a replacement for MultiHashEmbed
a couple modifications had to be made. HashEmbed
takes a column
attribute, because in spaCy
the FeatureExtractor
returns an Ints2d
which is number of tokens x number of features.
To be able to run the same idea with Embed
the remap_ids
now also implements column
: chain(remap_ids(column=column), Embed(column=0))
.
Furthermore in the common case where the mapping_table
is a Dict[int, int]
there was a bit of a strange error: whereas with numpy
this works {3: 'q'}[np.array([3])[0]]
with cupy we get:
----> 1 {3: 'q'}[cp.array([3])[0]]
TypeError: unhashable type: 'cupy._core.core.ndarray'
In the new version when we detect a cupy
integer array we cast the elements to int
.
Finally, the backprop
used to return a []
, but when the input is an xp array
then this can lead to confusing errors e.g.: if the remap_ids
is within a with_array
block as is the case with Embed
. Now it returns an empty array instead.
Why isn't column
implemented in the model in the same way as in HashEmbed
?
Why isn't
column
implemented in the model in the same way as inHashEmbed
?
How is it different?
Both HashEmbed
and the PR description look more like this:
https://github.com/explosion/thinc/blob/40c129f6ee51d4060f1af6b29f903ce51fd89736/thinc/layers/hashembed.py#L48-L53
It's not that I think the output here is incorrect, but I'm trying to understand why it's implemented differently.
Both
HashEmbed
and the PR description look more like this:https://github.com/explosion/thinc/blob/40c129f6ee51d4060f1af6b29f903ce51fd89736/thinc/layers/hashembed.py#L48-L53
It's not that I think the output here is incorrect, but I'm trying to understand why it's implemented differently.
In the HashEmbed
the init
creates a new model that's as the comment says
This is equivalent to array[:, column].
and I think this is convenient, because the backprop now works properly due to the chain
combinator combining column selection with the original embedding model. In remap_ids
we don't need backprop
so we can just do the simpler array[:, column].
Ah, I see what you mean. I was mainly thinking that implementing this the same way would avoid a lot of the casting and numpy/cupy problems.
In general I'm not sure this is going to work (easily) for Hashable
due to the types? It looks like util.to_numpy
converts everything to strings for your test case, and I think there may be cases where it would convert ints to floats for numeric input?
I wonder if this would be easier if it becomes remap_ids.v2
and is restricted to IntsXd
instead?
Ah, I see what you mean. I was mainly thinking that implementing this the same way would avoid a lot of the casting and numpy/cupy problems.
In general I'm not sure this is going to work (easily) for
Hashable
due to the types? It looks likeutil.to_numpy
converts everything to strings for your test case, and I think there may be cases where it would convert ints to floats for numeric input?I wonder if this would be easier if it becomes
remap_ids.v2
and is restricted toIntsXd
instead?
Hmm I guess you are right. But I'm thinking that having Ints1d
and Ints2d
is basically what we would need from the point of view of spaCy
, because we use the integer attributes there and not the strings themselves. But in general I think people would like to use remap_ids
with str2int
maps more likely (provided anyone would use this outside of spaCy
).
What do you think about implementing remap_ids
in such a way that it only supports Union[Sequence[str], Sequence[int], Ints1d, Ints2d]
and that's it?
I think that sounds reasonable.
I think that sounds reasonable.
Added the new version.
This version looks good! It does change the parameters and types enough that I think it should be remap_ids.v2
, though. I think we can just have both in parallel here?
This version looks good! It does change the parameters and types enough that I think it should be
remap_ids.v2
, though. I think we can just have both in parallel here?
I'm not sure what to have for .v1
. I think that changing
def backprop(dY: OutT) -> InT:
return []
to
def backprop(dY: OutT) -> InT:
if xp_input:
return model.ops.xp.empty(dY.shape)
else:
return []
is a bugfix that prevents remap_ids
to break when its in the with_array
context.
Same with idx = to_numpy(inputs)
, because values = [table.get(x, default) for x in idx]
doesn't actually work with cupy
integer arrays.
So to me it felt more like that .v1
was not tested within a pipeline and the with_array
bug didn't surface and was also not tested on the GPU so the cupy
bug didn't come up either.
Since I broke .v1
on the first try of using it with chain(remap_ids, Embed)
, I was kinda imagining that this new version is actually .v1
. You know what I mean? I kinda feel like this is the first working version.
I agree that the existing version is wonky, but the main issue is the parameters, since this removes dtype
and adds column
. If you hadn't removed the existing test it would break on dtype
, I think?
I agree that the existing version is wonky, but the main issue is the parameters, since this removes
dtype
and addscolumn
. If you hadn't removed the existing test it would break ondtype
, I think?
That's true, but I'm also not sure why the dtype
was added. The original typing was:
InT = Sequence[Any]
OutT = Ints2d
so like we could give dtype
float, but then this would return something not use able:
arr = model.ops.asarray2i(values, dtype=dtype)
output = model.ops.reshape2i(arr, -1, 1)
Not sure why we need to pass a dtype
to asarray2i
def asarray2i(
self, data: Union[Ints2d, Sequence[int]], *, dtype: Optional[DTypes] = "int32"
) -> Ints2d:
return cast(Ints2d, self.asarray(data, dtype=dtype))
So yeah I guess I find it a bit hard to understand what should I include in .v1
I don't think you need to spend much effort trying to clean up .v1
. It can be buggy/broken and we wouldn't touch it again.
For now I would propose keeping a copy of .v1
unmodified and just adding the new one as .v2
.
Ok I've added a legacy version in remap_ids.py
is this what you've meant?
Also please update the corresponding docs: https://thinc.ai/docs/api-layers#remap_ids
As a final step can you format the docs?
As a final step can you format the docs?
Sorry, I've forgot, what are we using to format the docs?
prettier: https://github.com/explosion/thinc/tree/master/website#setup-and-installation
And reformat again!
Let me do a last-minute renaming to have new _v2
instead of renamed _v1
.