NEKO icon indicating copy to clipboard operation
NEKO copied to clipboard

optimize tokenization/embedding

Open daniellawson9999 opened this issue 1 year ago • 0 comments

Let's look at tokenize_input_dicts https://github.com/ManifoldRG/gato-control/blob/2deb510246ebd6b13dd53199f8de7df4e0b96f34/gato/policy/gato_policy.py#L193-L211 to https://github.com/ManifoldRG/gato-control/blob/2deb510246ebd6b13dd53199f8de7df4e0b96f34/gato/policy/gato_policy.py#L408

which is utilized in GatoPolicy.forward for embedding inputs before passing them to the transformer.

This function takes a list of dictionaries as input, where each dictionary refers to one sequence in the batch. A batch of size 256 would have 256 dictionaries in the list, where each dictionary can use different keys to specify which modalities are used. This function as is, loops over these dictionaries, and separately tokenizes/embeds input. After each one is embedded, they are joined together:

https://github.com/ManifoldRG/gato-control/blob/2deb510246ebd6b13dd53199f8de7df4e0b96f34/gato/policy/gato_policy.py#L383-L391

  1. It may be worthwhile to implement, and benchmark an alternative form of embedding, where instead, all input is reshaped, and concatenated in a way that we can embed all trajectories at once, switching to a fully "vectorized" function. Implement an alternative function and benchmark against the previous function.

  2. It is assumed that all inputs are both not tokenized or embedded prior to this function. This function takes in say continuous actions, and tokenizes each dimension into discrete values. These tokenized values are then embedded. However, in another instance, say training on text data, we commonly pre-trokenize our data before training, rather than repeatedly during the training loop. Modify the function to allow mixing of tokenized, and non-tokenized data, or consider this problem in other ways.

daniellawson9999 avatar Aug 13 '23 18:08 daniellawson9999