OpenNMT-py icon indicating copy to clipboard operation
OpenNMT-py copied to clipboard

Feature Request - Support Continuous Features

Open AmitMY opened this issue 4 years ago • 10 comments

I have a text with multiple features, some of them discrete, some continuous, on multiple simultaneous tracks:

For example, look at this image: image

There are features for tilt or yes-no, and these features are not allowed to the lexical items. Also, notice how words are spread out in none-regular intervals.

I would like the ability to add continuous features like "start" and "end" times (in frames) such that my input looks like:

Words: JOHN|WORD|5|15 MAYBE|WORD|20|25 MOVE|WORD|45|53 ....

And additional tokens: FRONT|TILT|0|50 WIDE|APERT|0|100...

I want these continuous features to have a "positional"-like embedding so the words could interact with the additional tokens using self-attention.

AmitMY avatar Nov 02 '20 10:11 AmitMY

Hi there,

  1. source features have been disabled in 2.0 for now (should be put back in but no ETA yet);
  2. once source features are put back, it should not be that hard to make some of those continuous. Let us know if you'd be willing to contribute, we can give you some pointers to help you get started.

(Edit: some people on the forum seem interested in putting back the source features as well -- https://forum.opennmt.net/t/opennmt-py-2-0-release/3962/5?u=francoishernandez)

francoishernandez avatar Nov 02 '20 11:11 francoishernandez

I could definitely try to contribute once you have source features back in. Thanks!

AmitMY avatar Nov 02 '20 11:11 AmitMY

Hi,

I met the same problem. I use POS tags and some other discrete labels as features of the input. Is there any substitution methods for source features right now?

GrangerLue avatar Jan 17 '21 15:01 GrangerLue

Hi there,

  1. source features have been disabled in 2.0 for now (should be put back in but no ETA yet);
  2. once source features are put back, it should not be that hard to make some of those continuous. Let us know if you'd be willing to contribute, we can give you some pointers to help you get started.

(Edit: some people on the forum seem interested in putting back the source features as well -- https://forum.opennmt.net/t/opennmt-py-2-0-release/3962/5?u=francoishernandez)

I'm doing some work that requires source features, so I guess if I'm going to do it I might as well do it the "right" way and contribute here. Could you provide some pointers to get started on this? Right now I'm just looking at the diffs between version 1.2 (when source features were last supported) and version 2.0 to figure out what needs to be worked on.

ongzexuan avatar Jan 29 '21 23:01 ongzexuan

Hey @ongzexuan That would be awesome! The source features are historically handled via a _feature_tokenize function, that is still present in the code in 2.0, but a few parts are missing. The main missing thing in the new implementation is the vocab creation for such features. Indeed, in the prior paradigm, all the vocabs were created during the preprocessing step, when parsing the dataset(s). But now, vocab(s) must be prepared beforehand. Hence, feature vocab(s) must be prepared at this point as well. For reference, you can check the legacy preprocess, where all the vocabs were created (by looping on fields): https://github.com/OpenNMT/OpenNMT-py/blob/legacy/onmt/bin/preprocess.py

cc @Zenglinxiao who might have some inputs on this as well.

francoishernandez avatar Jan 30 '21 11:01 francoishernandez

Hello @ongzexuan, If you want to contribute to this, you can start by looking at the code of how the vocabulary fields are built and what the field is (especially onmt's text_fileds). Currently, we disabled the source feature by setting the n_feat of each side to 0: https://github.com/OpenNMT/OpenNMT-py/blob/36748a5f280fbe781a86a82b7df85166796d49d7/onmt/inputters/fields.py#L11-L13 To enable that, we need to get the feature number of each side as previously in 1.2. And provide the corresponding number of vocabulary files for build the fields. Once you get through this while take aware of the tokenization, it should be functional.

Zenglinxiao avatar Feb 01 '21 12:02 Zenglinxiao

Thanks @francoishernandez @Zenglinxiao for the tips! One more thing, do we intend to have transformations defined on a per feature basis? Or will all the (source) features share the same transformation pipeline?

I'm guessing it shouldn't be too hard to do the former once I've got the rest figured out, but just wondering if there's use cases for this.

ongzexuan avatar Feb 01 '21 14:02 ongzexuan

That's a very good question. Technically, each Transform will receive a full example, with all its fields: https://github.com/OpenNMT/OpenNMT-py/blob/36748a5f280fbe781a86a82b7df85166796d49d7/onmt/transforms/transform.py#L44-L51 But, for transforms that may split tokens for instance, you will have to explicitly handle the additional features to remain coherent with the tokens. E.g. when tokenizing in subwords, you will have to keep track of things like eating|verb -(tokenize)-> eat|verb -ing|verb.

francoishernandez avatar Feb 01 '21 14:02 francoishernandez

I see what you mean - extending on your example, this probably gets more complicated when you consider that each feature might have a different tokenizer which may generate different length outputs for the various features that need to be made coherent. I can see why the source features have been taken out now haha.

I'll take a closer look at this this weekend, thanks for all your input!

ongzexuan avatar Feb 02 '21 19:02 ongzexuan

Hi, Have you figured out how to handle tokenization for source features? Which is the current status of the feature request?

Maybe we could use tokenizer.detokenize_with_ranges which returns a mapping between the tokenized output and the original sentence and use that info to project source features to the corresponding subwords in a independent transform after tokenization. Do you find this doable?

anderleich avatar Jul 30 '21 09:07 anderleich