primitive icon indicating copy to clipboard operation
primitive copied to clipboard

memset is terrible and not performant likely (for nonzeroing path)

Open cartazio opened this issue 5 years ago • 20 comments

it falls back to a scalar c loop when not zeroing. which seems disappointing. Also not sure if it correctly handles -0 in the floating point code

cartazio avatar Mar 09 '20 19:03 cartazio

this is gonna be some fun hackery to do a sane portable performant thingy :)

cartazio avatar Mar 09 '20 19:03 cartazio

@bgamari under what circumstances does the GHC code gen do something smart with memset/memmove? As best i can recall, the only instances are internally generated and via the llvm codegen?

cartazio avatar Mar 09 '20 19:03 cartazio

GHC has a primop for memset (setByteArray#) that is not currently used by primitive. There is a PR that switches to using this. In GHC 8.10, GHC is going to become smarter about unrolling this when the size is known and small.

andrewthad avatar Mar 09 '20 20:03 andrewthad

Original issue on GHC is here and Artem's work that resolved the issue is here.

andrewthad avatar Mar 09 '20 20:03 andrewthad

I’ll look at how ghc does that primop and figure out what the different choices are. Thanks for helping out :)

cartazio avatar Mar 09 '20 22:03 cartazio

ok, so https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/GHC/StgToCmm/Prim.hs#L2480-2496

-- ----------------------------------------------------------------------------
-- Setting byte arrays

-- | Takes a 'MutableByteArray#', an offset into the array, a length,
-- and a byte, and sets each of the selected bytes in the array to the
-- character.
doSetByteArrayOp :: CmmExpr -> CmmExpr -> CmmExpr -> CmmExpr
                 -> FCode ()
doSetByteArrayOp ba off len c = do
    dflags <- getDynFlags

    let byteArrayAlignment = wordAlignment dflags -- known since BA is allocated on heap
        offsetAlignment = cmmExprAlignment off
        align = min byteArrayAlignment offsetAlignment

    p <- assignTempE $ cmmOffsetExpr dflags (cmmOffsetB dflags ba (arrWordsHdrSize dflags)) off
    emitMemsetCall p c len align

and emitMemsetCall is defined as

-- | Emit a call to @memset@.  The second argument must fit inside an
-- unsigned char.
emitMemsetCall :: CmmExpr -> CmmExpr -> CmmExpr -> Alignment -> FCode ()
emitMemsetCall dst c n align = do
    emitPrimCall
        [ {- no results -} ]
        (MO_Memset (alignmentBytes align))
        [ dst, c, n ]

then for known stuff we get

https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/nativeGen/X86/CodeGen.hs#L2222-2293

genCCall' dflags _ (PrimTarget (MO_Memset align)) _
         [dst,
          CmmLit (CmmInt c _),
          CmmLit (CmmInt n _)]
         _
    | fromInteger insns <= maxInlineMemsetInsns dflags = do
        code_dst <- getAnyReg dst
        dst_r <- getNewRegNat format
        if format == II64 && n >= 8 then do
          code_imm8byte <- getAnyReg (CmmLit (CmmInt c8 W64))
          imm8byte_r <- getNewRegNat II64
          return $ code_dst dst_r `appOL`
                   code_imm8byte imm8byte_r `appOL`
                   go8 dst_r imm8byte_r (fromInteger n)
        else
          return $ code_dst dst_r `appOL`
                   go4 dst_r (fromInteger n)
  where
    maxAlignment = wordAlignment dflags -- only machine word wide MOVs are supported
    effectiveAlignment = min (alignmentOf align) maxAlignment
    format = intFormat . widthFromBytes $ alignmentBytes effectiveAlignment
    c2 = c `shiftL` 8 .|. c
    c4 = c2 `shiftL` 16 .|. c2
    c8 = c4 `shiftL` 32 .|. c4

    -- The number of instructions we will generate (approx). We need 1
    -- instructions per move.
    insns = (n + sizeBytes - 1) `div` sizeBytes

    -- The size of each move, in bytes.
    sizeBytes :: Integer
    sizeBytes = fromIntegral (formatInBytes format)

    -- Depending on size returns the widest MOV instruction and its
    -- width.
    gen4 :: AddrMode -> Integer -> (InstrBlock, Integer)
    gen4 addr size
        | size >= 4 =
            (unitOL (MOV II32 (OpImm (ImmInteger c4)) (OpAddr addr)), 4)
        | size >= 2 =
            (unitOL (MOV II16 (OpImm (ImmInteger c2)) (OpAddr addr)), 2)
        | size >= 1 =
            (unitOL (MOV II8 (OpImm (ImmInteger c)) (OpAddr addr)), 1)
        | otherwise = (nilOL, 0)

    -- Generates a 64-bit wide MOV instruction from REG to MEM.
    gen8 :: AddrMode -> Reg -> InstrBlock
    gen8 addr reg8byte =
      unitOL (MOV format (OpReg reg8byte) (OpAddr addr))

    -- Unrolls memset when the widest MOV is <= 4 bytes.
    go4 :: Reg -> Integer -> InstrBlock
    go4 dst left =
      if left <= 0 then nilOL
      else curMov `appOL` go4 dst (left - curWidth)
      where
        possibleWidth = minimum [left, sizeBytes]
        dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))
        (curMov, curWidth) = gen4 dst_addr possibleWidth

    -- Unrolls memset when the widest MOV is 8 bytes (thus another Reg
    -- argument). Falls back to go4 when all 8 byte moves are
    -- exhausted.
    go8 :: Reg -> Reg -> Integer -> InstrBlock
    go8 dst reg8byte left =
      if possibleWidth >= 8 then
        let curMov = gen8 dst_addr reg8byte
        in  curMov `appOL` go8 dst reg8byte (left - 8)
      else go4 dst left
      where
        possibleWidth = minimum [left, sizeBytes]
        dst_addr = AddrBaseIndex (EABaseReg dst) EAIndexNone (ImmInteger (n - left))


otherwise it falls back to the c function call on a given platform.

cartazio avatar Mar 09 '20 23:03 cartazio

now to understand what code impls on what platforms!

https://github.com/freebsd/freebsd/blob/master/lib/libc/amd64/string/memset.S is the amd 64 platform impl for freebsd. and the other related operations

glibC , at least on AMD64, has a whole mess of fallback cases for different micro architectures as can be seen https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/multiarch or here https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;h=942a61db8fc4615bf1998c484d883720b82c1e70;hb=HEAD

cartazio avatar Mar 09 '20 23:03 cartazio

the default cutoff for ghc level unrolling in native code gen is https://gitlab.haskell.org/ghc/ghc/blob/4898df1cc25132dc9e2599d4fa4e1bbc9423cda5/compiler/main/DynFlags.hs#L2121-2124


        maxInlineAllocSize = 128,
        maxInlineMemcpyInsns = 32,
        maxInlineMemsetInsns = 32,

so roughly, ghc native code gen can / will unroll any const call to memset when it has constant arguments such that
(# of bytes copied + native word size in bytes - 1 ) / (native word size in bytes) <= 32

on 64bit amd64 that simplifies to n + 8 -1 / 8 <= 32, which rewrite to n + 7 <= 256, so n <= 249 bytes will get unrolled with this current scheme. (using ~ at most 32 GPR flavored asm moves), and otherwise falls back to lib c stuff.

how was this threshold chosen? Fun aside: probably could improve this now that we assume >= SSE2 on all AMD64/x86 platforms.

Also theres a question about whether the OS/LIb C flavor is acceptable on each platform.

cartazio avatar Mar 09 '20 23:03 cartazio

interstingly, according to godbolt, clang optimizes to "call lib c" for large constant args

https://godbolt.org/z/JBhCJz

cartazio avatar Mar 09 '20 23:03 cartazio

https://godbolt.org/z/dxwawW heres some smaller thresholds and how clang does it

cartazio avatar Mar 10 '20 00:03 cartazio

https://godbolt.org/z/6Fs3pA thres some intersting variation in code selection depending on what cpu architecture is chosen

cartazio avatar Mar 10 '20 00:03 cartazio

ok, so i think for small enough argumements < 249 on amd64, and < 125 bytes on x86, ghc does a reasonable thing for this stuff. THOUGH, i think we'd get a genuine win from using sse registers instead of gpr wrt halving the code size.. I think we have roughly the right unrolling threshold, at least if clang and gcc tuning are actually decent.

cc @andrewthad @bgamari @m37 (andreask?)

it does look like on platforms that aren't linux, we might not be able to assume a GOOD optimized fallback c symbol? (though some benchmarking will be needed i guess)

cartazio avatar Mar 10 '20 00:03 cartazio

my current thinking is : write the primitive stuff such that for known args, theres a branch that goes through into the unrolling case for ghc, and we have a fallback for large stuff thats fancy and fast and simple

cartazio avatar Mar 10 '20 00:03 cartazio

as best i can tell, the only OSS lib c implementation which has a SSE/AVX/simd anything implementation is linux/glib c, everything just does hand written assembly to (probably?) prevent the c compiler from being tooooo clever and replacing the mem* implementation with a call to itself!

edit: some implementations do a tiny c code thing

cartazio avatar Mar 10 '20 01:03 cartazio

zooming out: the real answer is benchmarks! :)

cartazio avatar Mar 10 '20 02:03 cartazio

I think the only reasonable thing to do in primitive is to use the primop that GHC provides. Optimizing the implementation further is a GHC issue, not a primitive issue. Having GHC use wider registers when possible might be a win, but then you won't be doing aligned writes (which I think doesn't really matter on more recent Intel processors), and you have to load the 128- or 256-bit word from memory rather than including it in the instruction, so for writing 16 or 24 bytes, this may not actually be better.

andrewthad avatar Mar 10 '20 10:03 andrewthad

The portable simd c is easy to write. The concern which I need to do some measurements to determine , is how big a difference in performance for large non const arg calls is there between the agner fog or glib c impls vs say the FreeBSD ones. Especially on really large memory ranges. If there’s even a 20 percent difference on large inputs on multiple cpu micro architectures that’s a pretty important thing

Ghc abi and native suport for simd are orthogonal to understanding the quality of the underlying lib c fallback case for non const arguments. Any improvements should of course be added to ghc too. But that’s a very different release cadence.

On Tue, Mar 10, 2020 at 6:22 AM Andrew Martin [email protected] wrote:

I think the only reasonable thing to do in primitive is to use the primop that GHC provides. Optimizing the implementation further is a GHC issue, not a primitive issue. Having GHC use wider registers when possible might be a win, but then you won't be doing aligned writes (which I think doesn't really matter on more recent Intel processors), and you have to load the 128- or 256-bit word from memory rather than including it in the instruction, so for writing 16 or 24 bytes, this may not actually be better.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/primitive/issues/265?email_source=notifications&email_token=AAABBQXDZWUHENONHYR3V3LRGYIHFA5CNFSM4LEP3ZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOK2RQI#issuecomment-597010625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQRTNFMWDQYSWRW6PWDRGYIHFANCNFSM4LEP3ZVQ .

cartazio avatar Mar 10 '20 15:03 cartazio

andrew: to clarify, i'm not concerned with the ghc unrolling case, i'm concerned with the performance characteristics for input sizes that fall back to libc / system provided memset and friends. (the conclusion may very well be that libc simple codes are fine on all cpus we care about, or not). Theres also a subtle mismatch in how we use the mem* related operations in ghc / haskell today versus the specified of how they're done today.

Also theres some nontrival ways where a more haskell specific flavor will actually make a 2x difference in storable vector in some cases (at least per asymptotics, not sure if the constant factors work out in practice)

cartazio avatar Mar 10 '20 17:03 cartazio

as a bit of context for why this actually should be done as performantly as possible for unknown size / large buffers is how it comes into storable vector see https://github.com/haskell/vector/blob/eeb42ad42aa345ce192086baed80c805bcfc3e72/Data/Vector/Storable/Mutable.hs#L202-L256

cartazio avatar Mar 11 '20 04:03 cartazio

I think the only reasonable thing to do in primitive is to use the primop that GHC provides.

~~Yes! And currently it doesn't. It uses memset directly, so doesn't benefit from any unrolling or from -fcheck-prim-bounds debug support.~~

I take it back, shimmedSetWord8Array# does use GHC.Exts.setByteArray#.

dcoutts avatar Mar 26 '24 14:03 dcoutts