sockeye
sockeye copied to clipboard
Vectors/Images/Video to Text Translation
I would like to have sockeye for X-to-text translation. (X ∈ {text, vectors, image, video...}
)
What would the steps to do that be?
I understand prepare_data
must get the source
, but in the case of not text, there's nothing to prepare.
Also, unlike text, storage might be an issue. It is probably not advised to store all the data again in the output directory, so I'm thinking it would be best that the source should be the file path img.png
or video.mp4
.
In that case, one would also have to supply a loading / augmentation function to the training / inference code. I think that there could be standard modules, with load
and process
functions:
def load(file_path: str):
# Loads the file, returns whatever
pass
def process(obj) -> List[Tuple[torch.Tensor, List[int]]:
# Takes the output of `load`, returns a sequence of vectors and arbitrary number of positions
pass
Then adding a training argument --source-processor my_module.py
which can be imported using importlib
.
The load
function can be cached, or not, and the process
function can be for data augmentation, or general processing.
Then in the model:
If the sequence is of shape [BATCH, LENGTH]
and is of type torch.int
it passes through an embedding layer.
Otherwise [BATCH, LENGTH, DIM]
and torch.float
is used as is for the encoder.
Or otherwise
Another option could be just supplying a torch DataLoader
directly to training.
python train.py --data-loader my_loader.py --output OUTPUT
And the data loader would have two mandatory properties: SRC_VOCAB_SIZE
and TGT_VOCAB_SIZE
Then in the training code, this data loader is used to sample training samples, and to perform augmentation.
In this case, sockeye
is used only for the model and training loop, and not for its data loader
Hi @AmitMY thank you for your interest in extending Sockeye.
Extending Sockeye to support other modalities as input could be a major change. In fact, an earlier version of Sockeye supported image-to-text models (see https://arxiv.org/pdf/1810.04101.pdf and https://github.com/awslabs/sockeye/tree/sockeye_1/sockeye/image_captioning), but it was difficult to maintain this additional complexity in the long run.
In order to give some advice, it would be good to understand a bit better what the scope of this feature/modality support is. I think the idea of supporting custom Python modules that do not necessarily have to be part of the main Sockeye repository is a good one, but it is not yet clear to me what the requirements/impact would be on the main codebase.
For my use case, I would like to perform sign language translation using a sequence of pose vectors (sequence of vectors - to - text)
Closing for inactivity. Please feel free to reopen if there are any updates.
For others: https://github.com/bricksdont/sign-sockeye-baselines