deno icon indicating copy to clipboard operation
deno copied to clipboard

deno compile produces slower ARM binaries than x86 binaries on Apple M1

Open taoeffect opened this issue 2 years ago • 1 comments

$ deno --version
deno 1.22.0 (release, x86_64-apple-darwin)
v8 10.0.139.17
typescript 4.6.2

Installed via Homebrew.

On an M1, the binary produced with --target aarch64-apple-darwin is consistently slower than the one produced by --target x86_64-apple-darwin.

On first run, --target x86_64-apple-darwin took 2.726 seconds as measured by the time command. On subsequent runs it took 0.097 seconds.

I missed the first run of --target aarch64-apple-darwin, but subsequent commands run consistently at about 0.31 seconds. This is more than 3x slower.

This is very surprising as one would expect the x86 binary to consistently be slower than the ARM binary.

taoeffect avatar Jun 22 '22 03:06 taoeffect

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 08 '22 21:09 stale[bot]

Request: could issues be closed as wontfix instead of being marked stale?

taoeffect avatar Sep 16 '22 01:09 taoeffect

Is this still an issue? I cannot reproduce this on M1:

time deno compile a.js --target x86_64-apple-darwin
Compile file:///Users/divy/a.js
Emit a

________________________________________________________
Executed in  134.79 millis    fish           external
   usr time   27.88 millis    8.27 millis   19.61 millis
   sys time   65.96 millis    1.83 millis   64.13 millis
time deno compile a.js --target aarch64-apple-darwin
Compile file:///Users/divy/a.js
Emit a

________________________________________________________
Executed in  125.10 millis    fish           external
   usr time   29.78 millis    9.53 millis   20.25 millis
   sys time   67.12 millis    1.95 millis   65.18 millis

littledivy avatar Sep 16 '22 03:09 littledivy

@littledivy not the amount of time it takes to compile a file, but the amount of time it takes to run the resulting executable.

taoeffect avatar Sep 16 '22 03:09 taoeffect

Ok, its impossible to tell without looking at the code being compiled. Just a console.log doesn't seem to reproduce it:

# x86_64
Hello

________________________________________________________
Executed in   50.00 millis    fish           external
   usr time   32.47 millis    0.24 millis   32.23 millis
   sys time   29.68 millis    1.99 millis   27.69 millis
# arm64
Hello

________________________________________________________
Executed in   38.38 millis    fish           external
   usr time   40.74 millis    0.29 millis   40.45 millis
   sys time   19.14 millis    2.22 millis   16.92 millis

littledivy avatar Sep 16 '22 03:09 littledivy

I downloaded latest version of Deno, 1.25.3, and was able to reproduce the issue.

If you'd like to reproduce it yourself, here are the steps:

  1. Obtain an M1 or M2 computer.
  2. Clone https://github.com/okTurtles/chel
  3. Run: deno task build && deno task compile
  4. Unzip the two *-apple-darwin.tar.gz files in dist/
  5. Run time on the two binaries. Note that as stated in the issue, the first time you run time the x86 version will be slower, but on subsequent runs it will remain faster.
$ time ./dist/aarch64-apple-darwin/chel version
1.1.2
./dist/aarch64-apple-darwin/chel version  0.03s user 0.02s system 7% cpu 0.695 total
$ time ./dist/x86_64-apple-darwin/chel version 
1.1.2
./dist/x86_64-apple-darwin/chel version  0.03s user 0.02s system 1% cpu 2.771 total
$ time ./dist/aarch64-apple-darwin/chel version
1.1.2
./dist/aarch64-apple-darwin/chel version  0.03s user 0.01s system 23% cpu 0.178 total
$ time ./dist/x86_64-apple-darwin/chel version 
1.1.2
./dist/x86_64-apple-darwin/chel version  0.03s user 0.01s system 83% cpu 0.054 total
$ time ./dist/aarch64-apple-darwin/chel version
1.1.2
./dist/aarch64-apple-darwin/chel version  0.03s user 0.01s system 22% cpu 0.180 total
$ time ./dist/x86_64-apple-darwin/chel version 
1.1.2
./dist/x86_64-apple-darwin/chel version  0.04s user 0.01s system 77% cpu 0.063 total

The results for a more complicated subcommand of chel too btw.

taoeffect avatar Sep 19 '22 01:09 taoeffect

Ok but I cannot reproduce the claims here: This is more than 3x slower.. The slow first run is expected because rosetta doing its thing / cold start.

Also, since we are measuring in <100ms, time can be very noisy dependeing on your system. Here are the results using hyperfine:

image

littledivy avatar Sep 19 '22 03:09 littledivy

The slow first run is expected because rosetta doing its thing / cold start.

I mentioned this above and in the original post, yes Rosetta will cause the x86 binary to be slower on first run. But on subsequent runs the x86 binary is faster than the arm binary, when it should be the reverse.

Ok but I cannot reproduce the claims here: This is more than 3x slower..

~~I just tried with hyperfine the exact same command, and it shows the x86 binary being over 15 times faster than the arm binary!~~

EDIT: After fixing my zsh shell (it was an x86 binary - now it's arm64), I rebuilt everything and re-ran the benchmark.

Now the arm binary is only 2.36x slower than the x86 binary. Still slower though. Updated benchmarks below:

$ hyperfine --warmup 2 './dist/x86_64-apple-darwin/chel version' './dist/aarch64-apple-darwin/chel version'
Benchmark 1: ./dist/x86_64-apple-darwin/chel version
  Time (mean ± σ):      53.2 ms ±   1.5 ms    [User: 30.7 ms, System: 6.0 ms]
  Range (min … max):    50.9 ms …  56.6 ms    44 runs
 
Benchmark 2: ./dist/aarch64-apple-darwin/chel version
  Time (mean ± σ):     125.7 ms ±  14.0 ms    [User: 28.9 ms, System: 3.9 ms]
  Range (min … max):   108.6 ms … 150.6 ms    20 runs
 
Summary
  './dist/x86_64-apple-darwin/chel version' ran
    2.36 ± 0.27 times faster than './dist/aarch64-apple-darwin/chel version'

taoeffect avatar Sep 19 '22 04:09 taoeffect

Updated my comment above with:

EDIT: After fixing my zsh shell (it was an x86 binary - now it's arm64), I rebuilt everything and re-ran the benchmark.

Now the arm binary is only 2.36x slower than the x86 binary. Still slower though. Updated benchmarks below:

taoeffect avatar Sep 21 '22 20:09 taoeffect

Update: I tried again using deno 1.32.4 to see if anything had changed regarding this, and here are the results:

-> % hyperfine --warmup 2 './dist/x86_64-apple-darwin/chel version' './dist/aarch64-apple-darwin/chel version'
Benchmark 1: ./dist/x86_64-apple-darwin/chel version
  Time (mean ± σ):      50.0 ms ±   4.8 ms    [User: 32.2 ms, System: 6.8 ms]
  Range (min … max):    42.5 ms …  69.8 ms    45 runs
 
Benchmark 2: ./dist/aarch64-apple-darwin/chel version
  Time (mean ± σ):     117.6 ms ±  10.2 ms    [User: 33.4 ms, System: 3.4 ms]
  Range (min … max):   101.7 ms … 144.1 ms    26 runs
 
Summary
  './dist/x86_64-apple-darwin/chel version' ran
    2.35 ± 0.30 times faster than './dist/aarch64-apple-darwin/chel version'
-> % lipo -info dist/aarch64-apple-darwin/chel 
Non-fat file: dist/aarch64-apple-darwin/chel is architecture: arm64
-> % lipo -info dist/x86_64-apple-darwin/chel 
Non-fat file: dist/x86_64-apple-darwin/chel is architecture: x86_64

You can see that I'm not confusing the binaries, as lipo outputs that the aarch64-apple-darwin binary is indeed arm64.

Here's how we generate these binaries:

#!/usr/bin/env -S deno run --allow-run --allow-read=. --allow-write=./dist

import { sh } from '../src/deps.ts'

function $ (command: string) {
  return sh(command, { printOutput: true })
}

const { default: { version } } = await import('../package.json', { assert: { type: "json" } })

export async function compile () {
  // NOTE: Apple ARM is slower than x86 on M1!
  // https://github.com/denoland/deno/issues/14935
  const archs = ['x86_64-unknown-linux-gnu', 'x86_64-pc-windows-msvc', 'x86_64-apple-darwin', 'aarch64-apple-darwin']
  for (const arch of archs) {
    const dir = `./dist/tmp/${arch}`
    const bin = arch.includes('windows') ? 'chel.exe' : 'chel'
    // note: could also use https://examples.deno.land/temporary-files
    await $(`mkdir -vp ${dir}`)
    await $(`deno compile --allow-read=./ --allow-write=./  --allow-net --no-remote --import-map=vendor/import_map.json -o ${dir}/${bin} --target ${arch} ./build/main.js`)
    await $(`tar -C ./dist/tmp -czvf ./dist/chel-v${version}-${arch}.tar.gz ${arch}`)
  }
  await $(`sha256sum dist/chel-v${version}-*`)
  // TODO: sign the sha256sum! pipe this to gpg and include a link to your GPG key in the release notes!
}

try {
  await compile()
} catch (e) {
  console.error('caught:', e.message)
} finally {
  await sh(`rm -rf ./dist/tmp`)
}

The relevant line is:

await $(`deno compile --allow-read=./ --allow-write=./  --allow-net --no-remote --import-map=vendor/import_map.json -o ${dir}/${bin} --target ${arch} ./build/main.js`)

taoeffect avatar Apr 17 '23 01:04 taoeffect

I tried again just now with Deno 1.39.2 and all of a sudden hyperfine results are looking correct:

hyperfine --warmup 2 './dist/x86_64-apple-darwin/chel version' './dist/aarch64-apple-darwin/chel version'
Benchmark 1: ./dist/x86_64-apple-darwin/chel version
  Time (mean ± σ):      93.1 ms ±   2.3 ms    [User: 75.2 ms, System: 22.6 ms]
  Range (min … max):    87.6 ms …  97.8 ms    31 runs
 
Benchmark 2: ./dist/aarch64-apple-darwin/chel version
  Time (mean ± σ):      53.2 ms ±   1.4 ms    [User: 46.7 ms, System: 12.0 ms]
  Range (min … max):    49.9 ms …  56.3 ms    54 runs
 
Summary
  ./dist/aarch64-apple-darwin/chel version ran
    1.75 ± 0.06 times faster than ./dist/x86_64-apple-darwin/chel version

So, closing 🤷‍♂️

Glad it seems to be working!

taoeffect avatar Jan 12 '24 20:01 taoeffect