ucx
ucx copied to clipboard
UCX: Adding support for RISC-V architecture; RV64G
What
Adding support for the RISC-V architecture to OpenUCX. This includes support for generic RV64G (IMAFDC) micro architectures. Core development by @ct-clmsn and @jleidel
Why ?
Adds support for RISC-V architecture across all subcomponents. This includes support for basic build infrastructure, tooling, ucg, ucm, ucm/bistro, ucp, ucs and uct. Tested using the in-situ test suite on actual RISC-V RV64G hardware in a cluster configuration (see below for configuration details).
How ?
All development and testing was performed using the following parameters:
- SiFive HiFive Unmatched boards
- Ubuntu 21.04
- GCC 10.3.0
@yosefe - @jleidel 's company has signed CLA.
@shamisp @yosefe quick update: several modifications have been made and pushed into this PR.
@ct-clmsn can you pls fix the commit title according to code style https://github.com/openucx/ucx/wiki/Guidance-for-contributors item 3? (would need to force-push to resolve)
@yosefe change the automated commit message for the merge (commit # a3e2624)?
Not the merge commit; the other ones.
@yosefe apologies for the 2 rebases - found an issue with a single message. I believe the commit titles/messages are now updated to meet the specification.
@shamisp @yosefe did some testing today. It turns out riscv64's implementation of brk is not equivalent to syscall(SYS_brk, ...). This is the source of several issues and work arounds you have seen in the patch. I'm in the process of implementing an architecture specific solution in the function ucm_brk_syscall in src/util/sys.c around line 378. ucm_brk_syscall already has architecture specific code for calling syscall(SYS_brk, ...) for x86; comments in the function state architecture specific code should be used in the implementation. This should resolve all of the issues above, with exception to any outstanding character-spacing issues.
@yosefe @shamisp ucm_log_vsnprintf (in the file src/ucm/util/log.c) accepts an argument va_list ap. A line of code value.s = va_arg(ap, char *); expects va_arg to return NULL or 0 when it currently does not behave like this on RV64. Working on a solution.
At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants.
At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants.
@yosefe @ct-clmsn @jleidel Since the u-arch/arch issues cannot be resolved what about moving parts of the abstraction that got ifdef-ed (UCM_FIRE_EVENT, ucs_arch_clear_cache, _UCM_DEFINE_REPLACE_FUNC) to arch files in bistro. This way we can have one generic implementation and RV64 will have an option to overwrite those without ifdefs in the main code. @yosefe ?
@shamisp @yosefe I've made some updates to the bistro support and placed compile guards around some of the openmp code.
The icache issues have been resolved.
The modifications have been tested against a sifive unmatched and a pine64/star64 device. The code runs successfully on both devices.
Also, I've merged the implementation against master so, it is ready to go.
@ct-clmsn can you pls fix the commit title? would need force-push
@yosefe apologies, to clarify this commit? 06a1c3f
@yosefe apologies, to clarify this commit? 06a1c3f
@ct-clmsn pls see https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=66476&view=logs&j=a3e7afe8-b282-5869-712a-9857fbd4b4a9&t=fd231581-c084-51fb-f10e-187ae5f1c153&l=30
Bad commit title: 'needed omp compile guards'
@yosefe @jleidel @shamisp UCX has changed significantly since this PR was proposed. This PR is OBE (overcome by events) and I'd encourage merging #9168 as it's a more up to date proposal.