NEKO
NEKO copied to clipboard
optimize tokenization/embedding
Let's look at tokenize_input_dicts https://github.com/ManifoldRG/gato-control/blob/2deb510246ebd6b13dd53199f8de7df4e0b96f34/gato/policy/gato_policy.py#L193-L211 to https://github.com/ManifoldRG/gato-control/blob/2deb510246ebd6b13dd53199f8de7df4e0b96f34/gato/policy/gato_policy.py#L408
which is utilized in GatoPolicy.forward for embedding inputs before passing them to the transformer.
This function takes a list of dictionaries as input, where each dictionary refers to one sequence in the batch. A batch of size 256 would have 256 dictionaries in the list, where each dictionary can use different keys to specify which modalities are used. This function as is, loops over these dictionaries, and separately tokenizes/embeds input. After each one is embedded, they are joined together:
https://github.com/ManifoldRG/gato-control/blob/2deb510246ebd6b13dd53199f8de7df4e0b96f34/gato/policy/gato_policy.py#L383-L391
-
It may be worthwhile to implement, and benchmark an alternative form of embedding, where instead, all input is reshaped, and concatenated in a way that we can embed all trajectories at once, switching to a fully "vectorized" function. Implement an alternative function and benchmark against the previous function.
-
It is assumed that all inputs are both not tokenized or embedded prior to this function. This function takes in say continuous actions, and tokenizes each dimension into discrete values. These tokenized values are then embedded. However, in another instance, say training on text data, we commonly pre-trokenize our data before training, rather than repeatedly during the training loop. Modify the function to allow mixing of tokenized, and non-tokenized data, or consider this problem in other ways.