fast_gpt2 Performance on Apple M1 Max

Performance on Apple M1 Max

Open certik opened this issue 1 year ago • 3 comments

I am using the latest main (409c640916734053f48beb3f41cd535e91303d71) plus the following patch that make both PyTorch and fast_gpt2 run exactly the same model, and text (20 tokens), no Cuda in either:

diff --git a/src/lib.rs b/src/lib.rs
index 367e2ca..9eb9347 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -87,7 +87,7 @@ pub async fn run() -> Result<(), Gpt2Error> {
     #[cfg(not(feature = "dfdx"))]
     let gpt2 = Gpt2::from_tensors(&tensors, num_heads);
 
-    let string = "My name is";
+    let string = "Alan Turing theorized that computers would one day become very powerful, but even he could not imagine";
 
     let encoded = tokenizer.encode(string, false).unwrap();
     println!("Loaded & encoded {:?}", start.elapsed());
@@ -101,7 +101,7 @@ pub async fn run() -> Result<(), Gpt2Error> {
     let mut current_ids = ids.clone();
     #[cfg(feature = "cuda")]
     profiler_start()?;
-    for _i in 0..10 {
+    for _i in 0..20 {
         // println!("-------------");
         let start = std::time::Instant::now();
         let new_id = gpt2.forward(&current_ids, &mut past_key_values);
diff --git a/test.py b/test.py
index 608b4cf..5405733 100644
--- a/test.py
+++ b/test.py
@@ -4,7 +4,7 @@ start = datetime.datetime.now()
 import torch
 
 print(f"Loaded torch {datetime.datetime.now() - start}")
-torch.zeros((2, 2)).cuda()
+torch.zeros((2, 2))
 print(f"Loaded torch (cuda) {datetime.datetime.now() - start}")
 
 
@@ -13,12 +13,12 @@ from transformers import pipeline
 print(f"Loaded transformers {datetime.datetime.now() - start}")
 
 
-pipe = pipeline(task="text-generation", model="gpt2-large", do_sample=False, device=0)
-pipe.model.config.max_length = None
+pipe = pipeline(task="text-generation", model="gpt2", do_sample=False)
+#pipe.model.config.max_length = None
 print(f"Loaded in {datetime.datetime.now() - start}")
 inf_start = datetime.datetime.now()
-new_tokens = 10
-out = pipe("My name is", max_length=3 + new_tokens)
+new_tokens = 20
+out = pipe("Alan Turing theorized that computers would one day become very powerful, but even he could not imagine", max_new_tokens=new_tokens)
 print(f"Tokens: {(datetime.datetime.now() - inf_start)/new_tokens}/tokens")
 print(f"Inference took: {(datetime.datetime.now() - inf_start)}")
 print(out)

Here is what I got for fast_gpt2:

$ cargo run --example run --release    
    Finished release [optimized] target(s) in 0.11s
     Running `target/release/examples/run`
Safetensors 1.86ms
Tokenizer 31.226958ms
Loaded & encoded 461.879041ms
Loop in 156.600333ms
Loop in 80.137333ms
Loop in 80.596916ms
Loop in 81.4075ms
Loop in 79.844708ms
Loop in 81.373583ms
Loop in 82.741458ms
Loop in 107.9175ms
Loop in 83.611083ms
Loop in 80.898125ms
Loop in 84.577875ms
Loop in 84.253166ms
Loop in 84.087083ms
Loop in 85.110708ms
Loop in 85.1405ms
Loop in 84.291708ms
Loop in 84.722125ms
Loop in 84.515916ms
Loop in 84.030916ms
Loop in 84.704333ms
Result Ok("Alan Turing theorized that computers would one day become very powerful, but even he could not imagine how they would be able to do so.\n\n\"I think that the most important thing is")
Total Inference 2.222943541s

And PyTorch (installed from conda-forge):

$ TRANSFORMERS_OFFLINE=1 python test.py
Loaded torch 0:00:00.359938
Loaded torch (cuda) 0:00:00.360043
Loaded transformers 0:00:02.340165
Loaded in 0:00:04.140099
/Users/ondrej/mambaforge/envs/pico/lib/python3.9/site-packages/transformers/generation/utils.py:1186: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Tokens: 0:00:00.040217/tokens
Inference took: 0:00:00.804370
[{'generated_text': 'Alan Turing theorized that computers would one day become very powerful, but even he could not imagine how they would be able to do so.\n\n"I think that the most important thing is'}]
Ran in 0:00:04.944507

So fast_gpt2 runs in 2.2s, and PyTorch in 0.8s.

In order to speedup fast_gpt2, we can use the fast matrix matrix multiply from the Accelerate library, as shown in https://github.com/Narsil/fast_gpt2/issues/10#issuecomment-1454851500.

Mar 04 '23 21:03 certik

fast_gpt2 fast_gpt2 copied to clipboard

Performance on Apple M1 Max

fast_gpt2
fast_gpt2 copied to clipboard