sockeye icon indicating copy to clipboard operation
sockeye copied to clipboard

Vectors/Images/Video to Text Translation

Open AmitMY opened this issue 2 years ago • 2 comments

I would like to have sockeye for X-to-text translation. (X ∈ {text, vectors, image, video...})

What would the steps to do that be?


I understand prepare_data must get the source, but in the case of not text, there's nothing to prepare.

Also, unlike text, storage might be an issue. It is probably not advised to store all the data again in the output directory, so I'm thinking it would be best that the source should be the file path img.png or video.mp4.

In that case, one would also have to supply a loading / augmentation function to the training / inference code. I think that there could be standard modules, with load and process functions:

def load(file_path: str):
  # Loads the file, returns whatever  
  pass

def process(obj) -> List[Tuple[torch.Tensor, List[int]]:
  # Takes the output of `load`, returns a sequence of vectors and arbitrary number of positions
  pass

Then adding a training argument --source-processor my_module.py which can be imported using importlib. The load function can be cached, or not, and the process function can be for data augmentation, or general processing.

Then in the model: If the sequence is of shape [BATCH, LENGTH] and is of type torch.int it passes through an embedding layer. Otherwise [BATCH, LENGTH, DIM] and torch.float is used as is for the encoder.


Or otherwise

Another option could be just supplying a torch DataLoader directly to training.

python train.py --data-loader my_loader.py --output OUTPUT

And the data loader would have two mandatory properties: SRC_VOCAB_SIZE and TGT_VOCAB_SIZE

Then in the training code, this data loader is used to sample training samples, and to perform augmentation. In this case, sockeye is used only for the model and training loop, and not for its data loader

AmitMY avatar Feb 26 '22 23:02 AmitMY

Hi @AmitMY thank you for your interest in extending Sockeye.

Extending Sockeye to support other modalities as input could be a major change. In fact, an earlier version of Sockeye supported image-to-text models (see https://arxiv.org/pdf/1810.04101.pdf and https://github.com/awslabs/sockeye/tree/sockeye_1/sockeye/image_captioning), but it was difficult to maintain this additional complexity in the long run.

In order to give some advice, it would be good to understand a bit better what the scope of this feature/modality support is. I think the idea of supporting custom Python modules that do not necessarily have to be part of the main Sockeye repository is a good one, but it is not yet clear to me what the requirements/impact would be on the main codebase.

fhieber avatar Mar 03 '22 14:03 fhieber

For my use case, I would like to perform sign language translation using a sequence of pose vectors (sequence of vectors - to - text)

AmitMY avatar Mar 04 '22 13:03 AmitMY

Closing for inactivity. Please feel free to reopen if there are any updates.

mjdenkowski avatar Dec 18 '22 17:12 mjdenkowski

For others: https://github.com/bricksdont/sign-sockeye-baselines

AmitMY avatar Dec 19 '22 10:12 AmitMY