magicl icon indicating copy to clipboard operation
magicl copied to clipboard

Should we default to OpenBLAS?

Open notmgsk opened this issue 4 years ago • 6 comments

The OpenBLAS implementation seems to be all-round better (at least on macos), owing to concurrency and hardware-specific optimizations.

A (perhaps too) simple comparison:

With system (macos) default BLAS

(time (progn (reduce #'magicl:multiply-complex-matrices (mapcar (lambda (_) (quil::random-special-unitary 1000)) (list 1 2 3 4))) 1))

Evaluation took:
  53.799 seconds of real time
  53.642728 seconds of total run time (53.109868 user, 0.532860 system)
  [ Run times consist of 0.618 seconds GC time, and 53.025 seconds non-GC time. ]
  99.71% CPU
  145,903,290,806 processor cycles
  28 page faults
  5,055,004,624 bytes consed
  
1

With OpenBLAS

(time (progn (reduce #'magicl:multiply-complex-matrices (mapcar (lambda (_) (quil::random-special-unitary 1000)) (list 1 2 3 4))) 1))

Evaluation took:
  6.386 seconds of real time
  8.379127 seconds of total run time (7.713601 user, 0.665526 system)
  [ Run times consist of 0.482 seconds GC time, and 7.898 seconds non-GC time. ]
  131.21% CPU
  17,310,196,679 processor cycles
  5,078,847,024 bytes consed
  
1

notmgsk avatar Aug 06 '19 15:08 notmgsk

I don’t find this simple example too simple at all.

ecpeterson avatar Aug 06 '19 16:08 ecpeterson

OpenBLAS is the direct descendant of GotoBLAS. Intel hired Kazushige Goto, the original author of GotoBLAS and, shortly thereafter, the Intel MKL BLAS implementation started to be as performant as GotoBLAS. I think OpenBLAS and Intel MKL's BLAS are the fastest in town at the moment.

jmbr avatar Aug 06 '19 16:08 jmbr

I think OpenBLAS and Intel MKL's BLAS are the fastest in town at the moment.

Neat. The former is FOSS, the latter is FAIB (free as in beer).

notmgsk avatar Aug 06 '19 16:08 notmgsk

Also, this reminds me that I have a PR to submit that would allow us to reduce the consing significantly. Example (perhaps modulo row-major vs column-major order):

CL-USER> (defparameter *vector* (magicl:random-matrix 2048 1))
CL-USER> (time (let* ((matrix-data (magicl::matrix-data *matrix*))
                      (m (magicl:matrix-rows *matrix*))
                      (n (magicl:matrix-cols *matrix*))
                      (vector-data (magicl::matrix-data *vector*))
                      (result (make-array n :element-type '(complex double-float) :initial-element #c(0.0d0 0.0d0))))
                 (magicl.blas-cffi:%zgemv "N" m n #c(1.0d0 0.0d0) matrix-data m vector-data 1 #c(0.0d0 0.0d0) result 1)))
Evaluation took:
  0.007 seconds of real time
  0.048209 seconds of total run time (0.047635 user, 0.000574 system)
  685.71% CPU
  19,543,927 processor cycles
  8,576 bytes consed
CL-USER> (progn
           (sb-ext:gc :full t)
           (time (magicl:multiply-complex-matrices *matrix* *vector*))
           (values))
Evaluation took:
  0.041 seconds of real time
  0.063840 seconds of total run time (0.047902 user, 0.015938 system)
  [ Run times consist of 0.003 seconds GC time, and 0.061 seconds non-GC time. ]
  156.10% CPU
  111,235,118 processor cycles
  67,141,680 bytes consed

jmbr avatar Aug 06 '19 17:08 jmbr

Yowza! That CPU usage is impressive (and the consing). Is that MKL?

notmgsk avatar Aug 06 '19 17:08 notmgsk

According to dtruss, SBCL is picking up /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib which should be the library bundled with Mac OS

jmbr avatar Aug 06 '19 17:08 jmbr