vidur Can this simulator monitor the actual memory usage instead of memory reserved?

I am trying to stydy the scheduling algorithms used in the simulator, some scheduling algorithms like Orca just reserve the maximum memory for requests but they are not actually used. I think that if this simulator can monitor the actual usage of HBM memory like the metrics recorded in vLLM, it would be nice for users to study actual memory pattern,

Aug 15 '24 10:08 JasonZhang517

Hi @JasonZhang517, It is certainly possible to export the actual memory used by the requests instead of the reserved memory. For scheduling algorithms which do not use dynamic memory allocation like Orca, FasterTransformer, there is considerable difference between the two.

There is a self._allocation_map in base_replica_scheduler.py which contains the request_ids which are currently allocated.
For each of these requests, the request.total_tokens number contains the tokens whose memory is being used.
Divide this by block_size to get the number of blocks.We also have the total num of blocks. Dividing, we get the actual memory used by the replica.
Caveat - We are ignoring model weights and activation memory here.

Please feel free to implement this metric and raise a PR. I'll be happy to review!

Aug 16 '24 09:08 nitinkedia7

@nitinkedia7 @JasonZhang517 we actually already have tracking for active memory usage (excluding activation memory). Currently we just store the average memory usage over time, but this can be easily adapted to store a time series.

Aug 16 '24 09:08 AgrawalAmey

@AgrawalAmey I believe the question is about amount of KV cache memory which is being used for a token and not just reserved.

Aug 24 '24 15:08 nitinkedia7