OpenNMT-py
OpenNMT-py copied to clipboard
Feature Request - Support Continuous Features
I have a text with multiple features, some of them discrete, some continuous, on multiple simultaneous tracks:
For example, look at this image:
There are features for tilt
or yes-no
, and these features are not allowed to the lexical items.
Also, notice how words are spread out in none-regular intervals.
I would like the ability to add continuous features like "start" and "end" times (in frames) such that my input looks like:
Words:
JOHN|WORD|5|15
MAYBE|WORD|20|25
MOVE|WORD|45|53
....
And additional tokens:
FRONT|TILT|0|50
WIDE|APERT|0|100
...
I want these continuous features to have a "positional"-like embedding so the words could interact with the additional tokens using self-attention.
Hi there,
- source features have been disabled in 2.0 for now (should be put back in but no ETA yet);
- once source features are put back, it should not be that hard to make some of those continuous. Let us know if you'd be willing to contribute, we can give you some pointers to help you get started.
(Edit: some people on the forum seem interested in putting back the source features as well -- https://forum.opennmt.net/t/opennmt-py-2-0-release/3962/5?u=francoishernandez)
I could definitely try to contribute once you have source features back in. Thanks!
Hi,
I met the same problem. I use POS tags and some other discrete labels as features of the input. Is there any substitution methods for source features right now?
Hi there,
- source features have been disabled in 2.0 for now (should be put back in but no ETA yet);
- once source features are put back, it should not be that hard to make some of those continuous. Let us know if you'd be willing to contribute, we can give you some pointers to help you get started.
(Edit: some people on the forum seem interested in putting back the source features as well -- https://forum.opennmt.net/t/opennmt-py-2-0-release/3962/5?u=francoishernandez)
I'm doing some work that requires source features, so I guess if I'm going to do it I might as well do it the "right" way and contribute here. Could you provide some pointers to get started on this? Right now I'm just looking at the diffs between version 1.2 (when source features were last supported) and version 2.0 to figure out what needs to be worked on.
Hey @ongzexuan
That would be awesome!
The source features are historically handled via a _feature_tokenize
function, that is still present in the code in 2.0, but a few parts are missing.
The main missing thing in the new implementation is the vocab creation for such features. Indeed, in the prior paradigm, all the vocabs were created during the preprocessing step, when parsing the dataset(s). But now, vocab(s) must be prepared beforehand. Hence, feature vocab(s) must be prepared at this point as well.
For reference, you can check the legacy preprocess, where all the vocabs were created (by looping on fields):
https://github.com/OpenNMT/OpenNMT-py/blob/legacy/onmt/bin/preprocess.py
cc @Zenglinxiao who might have some inputs on this as well.
Hello @ongzexuan, If you want to contribute to this, you can start by looking at the code of how the vocabulary fields are built and what the field is (especially onmt's text_fileds). Currently, we disabled the source feature by setting the n_feat of each side to 0: https://github.com/OpenNMT/OpenNMT-py/blob/36748a5f280fbe781a86a82b7df85166796d49d7/onmt/inputters/fields.py#L11-L13 To enable that, we need to get the feature number of each side as previously in 1.2. And provide the corresponding number of vocabulary files for build the fields. Once you get through this while take aware of the tokenization, it should be functional.
Thanks @francoishernandez @Zenglinxiao for the tips! One more thing, do we intend to have transformations defined on a per feature basis? Or will all the (source) features share the same transformation pipeline?
I'm guessing it shouldn't be too hard to do the former once I've got the rest figured out, but just wondering if there's use cases for this.
That's a very good question.
Technically, each Transform
will receive a full example, with all its fields:
https://github.com/OpenNMT/OpenNMT-py/blob/36748a5f280fbe781a86a82b7df85166796d49d7/onmt/transforms/transform.py#L44-L51
But, for transforms that may split tokens for instance, you will have to explicitly handle the additional features to remain coherent with the tokens. E.g. when tokenizing in subwords, you will have to keep track of things like eating|verb
-(tokenize)-> eat|verb -ing|verb
.
I see what you mean - extending on your example, this probably gets more complicated when you consider that each feature might have a different tokenizer which may generate different length outputs for the various features that need to be made coherent. I can see why the source features have been taken out now haha.
I'll take a closer look at this this weekend, thanks for all your input!
Hi, Have you figured out how to handle tokenization for source features? Which is the current status of the feature request?
Maybe we could use tokenizer.detokenize_with_ranges
which returns a mapping between the tokenized output and the original sentence and use that info to project source features to the corresponding subwords in a independent transform after tokenization. Do you find this doable?