Andy Lo
Andy Lo
Suggestion for better docs: The order in which things are concatenated together in the residual stack isn't always clear. Example: https://github.com/neelnanda-io/TransformerLens/blob/829084a53836c5b8b388aa37a5ffce73b6371712/transformer_lens/ActivationCache.py#L1026-L1039 Specifically "... decomposition of the residual stream into **embed,...
Doesn't it get optimised away by the compiler anyway? (Haven't actually checked though) Plus pointwise operations are bandwidth-limited anyway, so adding/removing a few flops shouldn't make a difference.
A temporary workaround is to save to a temp directory and copy the saved content to the remote file system, though this wouldn't work so easily with the checkpoint manager...