Chris Elrod

Results 832 comments of Chris Elrod
trafficstars

Thanks for the report. The [vmap](https://github.com/chriselrod/LoopVectorization.jl/blob/c21c174f0f6676ea9098e632d9bd79e5fb51e885/src/map.jl#L30) code is actually fairly simple and isolated from the rest of LoopVectorization. Hopefully it's rather accessible for anyone who wants to take a look....

Hmm, I get ```julia julia> @btime test_sum_turbo($D) 94.021 ms (0 allocations: 0 bytes) 9.999797263890678e7 julia> @btime test_sum_turbo($D_bit) 14.198 ms (0 allocations: 0 bytes) 99995026 julia> @btime test_sum_turbo($D_bool) 15.336 ms (0...

> edit: Yup, Ryzen 9 5000 series doesn't have AVX512 support. That'll account for the entire discrepancy? Is this going to keep slowing me down by factors of 5-10x? I'm...

Sure, a PR would be welcome. The assembly isn't great for the other methods either. You could leave this PR open to track that. I have an idea of something...

> If you want to explain it, I'd be happy to give it a shot. Basically, I think this copying is related to the duplications we get in the phi...

> maybe `@turbo` doesn't like the CartesianIndex being dynamically created? Yes. `@turbo` actually doesn't like `CartesianIndex` at all. It has special handling for `CartesianIndices`, where it turns `CartesianIndices{N}` into `N`...

> The difference in total sums is probably due to fastmath, but can be surprisingly different sometimes. Differences that big are definitely bugs. The bug in the first three was...

This is hard to solve, because LV works in parallel. Thus if it tries to read, increment, and write to the same index of `h` in parallel, the answer will...

Thanks for the issue. I think correctly handling this will wait for the rewrite, which will track dependencies between loop iterations. Not sure about a timeline for it, though.

Yes. Worse than that, it currently tries to hoist the loads out of the loop. Although that doesn't explain the `######RHS###4######5### not defined` error.