luajit2 icon indicating copy to clipboard operation
luajit2 copied to clipboard

feature: runtime detection of CRC32 and ARM64 support.

Open thibaultcha opened this issue 5 years ago • 16 comments

This PR refines the implementation of the CRC32 runtime detection implemented by @isage in https://github.com/openresty/luajit2/pull/20, and previously refined by @siddhesh in https://github.com/openresty/luajit2/pull/75. It also brings in refined ARM64 support originally implemented by @debayang in https://github.com/openresty/luajit2/pull/21.

The refined implementation compares itself with the original ones as such:

  • It does not rely on new global variables.
  • It does not rely on the __attribute__((constructor)) attribute to detect/initialize CRC32 support.
  • It has a smaller footprint on the upstream codebase and less potential for future merge conflicts.
  • It can be detected and enabled even if LuaJIT is compiled without JIT support (-DLUAJIT_DISABLE_JIT).
  • It implements new jit.* APIs allowing users some introspection into this optimization.
  • It adds a minimum amount of new macros, and allows for granularly disabling this optimization via a new LJ_OR_* macros (as proposed in https://github.com/openresty/luajit2/issues/63).
  • It reorganizes the core to encapsulate string hashing implementations in lj_str_hash.c and leaves the room open to new string hashing implementations in the future.

In addition, this PR offers to greatly improve the CI tests matrix by running all 3 test suites against many different builds, on both x64 and arm64 architectures (thanks to the recent Travis-CI updates on arm64 support).

CC for reviews: @yangshuxin, @agentzh, @siddhesh, @dndx.

thibaultcha avatar Oct 10 '19 22:10 thibaultcha

I skimmed through the patches and overall they seem to be in the direction I wanted to take, thanks! I can do a more thorough review in some time if you like. One thing I noticed for example is the _OR_ in the name for the feature check macro. Please keep it generic luajit since I am working on merging it into my fork of the default luajit as well.

siddhesh avatar Oct 10 '19 22:10 siddhesh

Ping @javierguerragiraldez.

dndx avatar Oct 10 '19 22:10 dndx

Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project. It has been almost a year and it is about time that we move on. Only thing I need from Fedora's perspective is regular release tarballs and I'll work on merging all of my patches here :)

siddhesh avatar Oct 10 '19 23:10 siddhesh

Hi all, I have been away from the computer for some time.

@siddhesh

I had a closer look and it looks like you've excluded the actual dynamic bits from my patch because of which this is not really runtime detection at all. You still have to build the whole binary with crc32 support and that kinda defeats the purpose, at least from the distribution point of view.

That is one thing that I forgot to mention in this PR: this patch voluntarily leaves out the bits forcibly asking the binary to be built with SSE4.2/CRC32 support:

  1. @agentzh's comment questioning the compatibility with older ARM architectures is a concern that I share, and I was not able to get my hands on said architectures to properly test this change (I only was able to test this PR on armv8 for now).
  2. In absolute, such a change should be introduced in a subsequent commit (in this PR or a subsequent one) to guarantee the atomicity of the git history; thus ensuring we separate: a) runtime CRC32 detection, b) ARM64 support, and c) auto-enablement of the string hashing optimization at runtime depending on a).

For the time being, this PR is safer as-is by being fully backwards-compatible (i.e. OpenResty already specifies -msse4.2 in its LuaJIT build system on x86). Besides, it still needs to be decided how exactly this repository should balance OpenResty's tailored use-cases vs. providing a more un-opinionated LuaJIT fork; this is tangential to other on-going discussions, some of which you have been a part of: preserving merge-compatibility, naming of the build macros, release scheduling, and of course, whether or not to enable OpenResty optimizations by default (including this one).

Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project.

This is a great segway now that the topic has been mentioned in my above paragraph, and my take on it is that for the time being (and probably for the foreseeable future), we should, when possible, ensure that merge-compatibility be preserved as the upstream v2.1 branch may still receive bugfixes for as long as the project is not officially declared unmaintained.

thibaultcha avatar Oct 22 '19 11:10 thibaultcha

Hi all, I have been away from the computer for some time.

My turn to apologize; I'm still kinda away but stole some time for this :)

That is one thing that I forgot to mention in this PR: this patch voluntarily leaves out the bits forcibly asking the binary to be built with SSE4.2/CRC32 support:

Got it. Would it be clearer if the PR title is reworded to remove the 'runtime' bit?

1. [@agentzh's comment](https://github.com/openresty/luajit2/pull/21#discussion_r186247541) questioning the compatibility with older ARM architectures is a concern that I share, and I was not able to get my hands on said architectures to properly test this change (I only was able to test this PR on armv8 for now).

I can help answer that question. Other than the very old xgene1 processor (the APM Mustang processors), all other cores that gcc knows about has CRC support. So while from a theoretical point of view it is unsafe, from a practical standpoint I think the assumption is safe because I don't think anybody actually would think of using xgene to run openresty on production. It is definitely safer than assuming SSE4.2 since there are far more machines in active operation that don't have SSE4.2.

2. In absolute, such a change should be introduced in a subsequent commit (in this PR or a subsequent one) to guarantee the atomicity of the git history; thus ensuring we separate: a) runtime CRC32 detection, b) ARM64 support, and c) auto-enablement of the string hashing optimization at runtime depending on a).

Fair point, if that is the intention I think the change is fine with an updated PR title.

Also as a side note, I am beginning to wonder how much we need to bother with merge compatibility with the original LuaJIT project.

This is a great segway now that the topic has been mentioned in my above paragraph, and my take on it is that for the time being (and probably for the foreseeable future), we should, when possible, ensure that merge-compatibility be preserved as the upstream v2.1 branch may still receive bugfixes for as long as the project is not officially declared unmaintained.

I'm not holding my breath anymore on an official announcement :)

siddhesh avatar Oct 27 '19 02:10 siddhesh

This PR needs to rebase to the latest master.

agentzh avatar Jan 13 '20 03:01 agentzh

@siddhesh Thanks for looking at this one. You definitely know about the actual arm hardware usage better than me. We can always enable it for the aarch64 world then.

We still need to be careful about merging conflicts. I just spent a lot of time a few days ago to solve the merge conflicts from upstream LuaJIT repo :)

agentzh avatar Jan 13 '20 03:01 agentzh

We'd better hurry on this PR, we're about to cut a new release. Hopefully we can have it in the next OpenResty version :)

agentzh avatar Jan 13 '20 03:01 agentzh

So with this PR the same x64 openresty binary can work on modern intel chips and amd chips at the same time now?

agentzh avatar Jan 13 '20 03:01 agentzh

My understanding is that it's not the goal (hence my request to remove the dynamic word from the PR title) of this PR to have the same binary work on all chips. This PR only takes the lj_str_hash cleanup stuff from my patchset and leaves dynamic detection for a later date.

As for the rebase, the only thing to do here is to drop the _original_hash function since Mike has rewritten those bits.

siddhesh avatar Jan 13 '20 04:01 siddhesh

@siddhesh I see. Thanks for the clarification. I'm really looking forward to binary-level auto-scheduling for different microarchitectures.

agentzh avatar Jan 13 '20 05:01 agentzh

That's what #75 does; you can test it in action in moonjit. I think @thibaultcha has a plan to stage the changes in though (I don't fully understand how things fit into OpenResty, etc) so y'all might want to discuss that.

siddhesh avatar Jan 13 '20 14:01 siddhesh

@siddhesh Please have a look at the latest commit which adds the bits we previously mentioned to unconditionally build the binary with SSE4.2/CRC32.

thibaultcha avatar Jan 14 '20 00:01 thibaultcha

@thibaultcha I'm afraid that still does not help. I just skimmed through the patchset and building the entire binary with -msse4.2 is incorrect because it will not run on CPUs that don't have SSE4.2; the compiler is free to use SSE4.2 instructions wherever it sees fit throughout the code base whereas we want to use it only for the string hash function. That is why we need to build only the SSE4.2 bits with that flag. Likewise for the aarc64 crc32 function, although that's much less of a problem since crc32 is common on aarch64 servers.

siddhesh avatar Jan 14 '20 01:01 siddhesh

@thibaultcha I haven't had a chance to do a thorough review, but a quick skim gives me the impression that this should work correctly. I'll sync this into moonjit this weekend (since the patchset in moonjit is slightly different, only in some minor details and tests) and let you know if I find issues.

siddhesh avatar Jan 15 '20 17:01 siddhesh

As a general question, in order to be "dynamic", we have to replace direct, possibly inlinable, function calls to be indirectly function calls. Unfortunately, call-sites involved are pretty frequent. Do you see any performance penalty in your stress testing?

yangshuxin avatar Jan 21 '20 00:01 yangshuxin