Potential Issue Regarding Profiling
Dear maintainers,
Recently I attempted using the profiling workflow in the vidur project and collect profiling data on AWS EC2 instances. I experimented with the P5 48X which has 8X H100 connected using DGX with 8 GPUs for CodeLlama-34b-Instruct-hf. The code I used are vidur main branch and sarathi-serve vidur branch. However, the profiling results I got differ significantly from the ones in the provided data folder.
I have attached my collected data. I noticed several differences and potential issues.
- New profiled data uses flashinfer, while the reference uses flash_attention.
- New profiling data has additional columns for kv_cache_save.
Using the profiling data, vidur's prediction varies significantly from using the reference data. Could you please help me understand the correct profiling workflow?
New_H100_codellama_CodeLlama-34b-Instruct-hf_attention.csv
Thank you for your help.