nntrainer icon indicating copy to clipboard operation
nntrainer copied to clipboard

optimize embedding layer

Open kparichay opened this issue 2 years ago • 2 comments

The embedding layer is essentially a hash table where certain elements are selected based on the input. Current design: load all the elements to the memory and select those elements, and do all the operations in memory. The forwarding involves copy operation from the hashtable, and backward updates the corresponding entries.

An issue with the above design is that the embedding layer can be quite big and loading all of it in memory is not feasible. For example, vocab sizes can be 8k entries for small datasets to over 32k entries for a larger dataset. Some of the common Bert variants have up to 30k entries.

This requires a significant amount to keep the full hash table in the entry (remember that each entry in the embedding can also range from 10s to 100s of bytes or higher). Further existing design effectiveness even reduces when we know the limited batch size for the on-device training scenario.

Proposed design 1: Lazy loading - embedding layer only loads the entries which are required by each element of the current batch. This significantly reduces the amount of memory required and still supports full functionality. This has two issues:

  • common elements in the same batch would require loading the same entries from the disk multiple times
  • embedding layer is going to be slow due to many disk accesses

Proposed design 2: lazy loading + cache - maintain a cache inside embedding layer, and maintain cache using Least Recently Used or Least Frequently used policy. This has two benefits:

  • common elements in the same batch can now be loaded from the cache itself
  • frequently occurring elements will sit in the cache eliminating any disk overhead

The cache size decides how slow or fast the embedding layer behaves. When cache size == vocab size, this behaves the same as current design, fastest but with maximum memory consumption. When cache size == 0, this behaves as proposed design 1, least memory overhead but the slowest.

Choosing an appropriate cache size going to be key here, which will be dependent on the dataset as well as the model.

  • cache size shouldn't be very small (especially with LRU) to avoid thrashing
  • cache size should be big enough to cache frequently occurring words
  • cache size should not exceed the maximum memory limit

kparichay avatar Sep 30 '21 00:09 kparichay

:octocat: cibot: Thank you for posting issue #1594. The person in charge will reply soon.

taos-ci avatar Sep 30 '21 00:09 taos-ci

One can even consider running the embedding layer very close to the data feeding pipeline because of the below advantages:

  • just like data loading, the embedding layer is also a disk heavy access
  • multiple threads from the data loader will help running the embedding layer at full throttle
  • have a common cache for all the data samples being loaded at once will keep the cache healthier as well
  • will even save further memory as the original data will no longer be stored and directly embeddings can be stored to memory

kparichay avatar Sep 30 '21 00:09 kparichay