torchchat icon indicating copy to clipboard operation
torchchat copied to clipboard

Add max-autotune for CPU, update profile and fix next token calculation

Open yanbing-j opened this issue 1 year ago • 5 comments

This PR is to add max-autotune for CPU in torch.compile. Meanwhile, split first token and next token in the log print.

yanbing-j avatar Aug 23 '24 08:08 yanbing-j

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1055

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit 74f921c8bfc79347804a26f2f1b8e51413a98c38 with merge base 8cb8a35d3f311f4889e872e3525bbdfe88947e94 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Aug 23 '24 08:08 pytorch-bot[bot]

Thanks for the plumbing through max_autotune @yanbing-j that part looks great to me

@vmpuri Can you give the profiling/token calculation a quick pass though?

Jack-Khuu avatar Aug 25 '24 21:08 Jack-Khuu

@Jack-Khuu @vmpuri Thanks for the review! Let me clarify something in updating profile and fix next token calculation.

In profiling, I add the logic of print profiling table both for CPU and GPU. In next token calculation, t includes first token (prefill) and next token (decode_n_tokens). num_tokens_generated is next token length, therefore, t and num_tokens_generated are not match. I suppose this should be a typo when adding first token time. And I also add first token latency and next token latency in the print log seperately.

yanbing-j avatar Aug 26 '24 01:08 yanbing-j

@Jack-Khuu @vmpuri Could you please help review and merge this PR?

yanbing-j avatar Aug 28 '24 05:08 yanbing-j

@Jack-Khuu Thanks for the comments! Please review again!

yanbing-j avatar Aug 30 '24 08:08 yanbing-j

@Jack-Khuu Thanks for the review!

I have rebased on main branch. And I also hide tokens_sec with jit compilation time. Then the average throughput will be more accurate. Meanwhile, print out the average of total throughput, first token throughput and next tokens throughput.

yanbing-j avatar Sep 03 '24 13:09 yanbing-j

Hi @Jack-Khuu , please help merge this PR. Thanks!

yanbing-j avatar Sep 04 '24 09:09 yanbing-j

Hi @Jack-Khuu , please help review and merge this PR. Thanks!

yanbing-j avatar Sep 05 '24 04:09 yanbing-j

Thanks for following up. I'm debugging some weird behavior with the output messages at the moment (on main)

Will merge this in once that's resolved

Jack-Khuu avatar Sep 05 '24 07:09 Jack-Khuu

@Jack-Khuu Thanks! All the CI passes. Please help me update branch, because If I do the rebase, all the CI need to run again.

yanbing-j avatar Sep 05 '24 08:09 yanbing-j

Thanks again for the changes @yanbing-j

Merging in (I'll tweak some nits in a separate PR)

Jack-Khuu avatar Sep 05 '24 20:09 Jack-Khuu

@yanbing-j, with these changes, I observed different behavior from earlier while running generate.py. Not sure if it's because of this PR, or because of other changes introduced in torchchat.

With this PR's commits merged onto torchchat's main branch, I see a lot of auto-tuning benchmarking results, even for the same shapes, after I run python3 torchchat.py generate llama3.1 --prompt 'Hello my name is' --quantize '{"linear:int8": {"bitwidth": 8, "groupsize": 0}}' --compile --num-samples 5 --device cpu --tokenizer-path /localdisk/sanchitj/llama_3.1/original/tokenizer.model --max-autotune

Is it expected behavior? Thanks!

sanchitintel avatar Sep 05 '24 20:09 sanchitintel

@yanbing-j, turns out torch._inductor.config.trace.log_autotuning_results = True is simply displaying more auto-tuning results, but that's fine since auto-tuning is not being done for duplicate input shapes, so it's just that enabling this logging results in duplicate data being printed.

sanchitintel avatar Sep 05 '24 20:09 sanchitintel

@sanchitintel The logs you observed from autotuning is printed by setting torch._inductor.config.trace.log_autotuning_results = True.

yanbing-j avatar Sep 06 '24 01:09 yanbing-j

Thanks, @yanbing-j! That's what I meant.

Should we disable it, as it's too verbose? Even without torch._inductor.config.trace.log_autotuning_results = True, we get benchmarking logs for all unique input shapes. Thanks!

sanchitintel avatar Sep 06 '24 01:09 sanchitintel

@sanchitintel Remove this config in https://github.com/pytorch/torchchat/pull/1112.

yanbing-j avatar Sep 06 '24 02:09 yanbing-j