ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCX: Adding support for RISC-V architecture; RV64G

Open jleidel opened this issue 3 years ago • 10 comments

What

Adding support for the RISC-V architecture to OpenUCX. This includes support for generic RV64G (IMAFDC) micro architectures. Core development by @ct-clmsn and @jleidel

Why ?

Adds support for RISC-V architecture across all subcomponents. This includes support for basic build infrastructure, tooling, ucg, ucm, ucm/bistro, ucp, ucs and uct. Tested using the in-situ test suite on actual RISC-V RV64G hardware in a cluster configuration (see below for configuration details).

How ?

All development and testing was performed using the following parameters:

  • SiFive HiFive Unmatched boards
  • Ubuntu 21.04
  • GCC 10.3.0

jleidel avatar May 19 '22 17:05 jleidel

@yosefe - @jleidel 's company has signed CLA.

shamisp avatar May 19 '22 20:05 shamisp

@shamisp @yosefe quick update: several modifications have been made and pushed into this PR.

ct-clmsn avatar Jun 08 '22 15:06 ct-clmsn

@ct-clmsn can you pls fix the commit title according to code style https://github.com/openucx/ucx/wiki/Guidance-for-contributors item 3? (would need to force-push to resolve)

yosefe avatar Jun 08 '22 15:06 yosefe

@yosefe change the automated commit message for the merge (commit # a3e2624)?

ct-clmsn avatar Jun 08 '22 16:06 ct-clmsn

Not the merge commit; the other ones.

yosefe avatar Jun 08 '22 16:06 yosefe

@yosefe apologies for the 2 rebases - found an issue with a single message. I believe the commit titles/messages are now updated to meet the specification.

ct-clmsn avatar Jun 08 '22 20:06 ct-clmsn

@shamisp @yosefe did some testing today. It turns out riscv64's implementation of brk is not equivalent to syscall(SYS_brk, ...). This is the source of several issues and work arounds you have seen in the patch. I'm in the process of implementing an architecture specific solution in the function ucm_brk_syscall in src/util/sys.c around line 378. ucm_brk_syscall already has architecture specific code for calling syscall(SYS_brk, ...) for x86; comments in the function state architecture specific code should be used in the implementation. This should resolve all of the issues above, with exception to any outstanding character-spacing issues.

ct-clmsn avatar Aug 09 '22 21:08 ct-clmsn

@yosefe @shamisp ucm_log_vsnprintf (in the file src/ucm/util/log.c) accepts an argument va_list ap. A line of code value.s = va_arg(ap, char *); expects va_arg to return NULL or 0 when it currently does not behave like this on RV64. Working on a solution.

ct-clmsn avatar Aug 11 '22 18:08 ct-clmsn

At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants.

jleidel avatar Aug 18 '22 18:08 jleidel

At this point, I believe we have exhausted all possible options for removing architecture-specific I$ flushes in the RV64 implementation. Given that this is currently the "most" functional platform we have available for UCX prototyping, we limited in our options with respect to removing architecture specific code from non-architecture specific sections. Our current platform lacks certain support for hardware-driven cache flushes (RV-CMOBASE extension), compat headers (for libfabrics support; kernel 5.11.0-1028) and certain ELF extensions (DT_JMPREL is missing in binutils). We've tried all combinations of architecture-agnostic user and system calls to arrive at a solution. Many of which would potentially induce performance issues in future platform variants. Given the "experimental" nature of RISC-V platforms in HPC, we would like to request that we allow this PR to merge as is. This would allow everyone in the RISC-V community to make forward progress on other HPC runtime/comm layers. However, we will also sign up to amend the code once we have a more functional/stable baseline platform to use for tests and experiments. The RISC-V "platforms" and "profiles" specs will define specific extension requirements that will ease the pain of porting complex runtimes such as OpenUCX to future RISC-V variants.

@yosefe @ct-clmsn @jleidel Since the u-arch/arch issues cannot be resolved what about moving parts of the abstraction that got ifdef-ed (UCM_FIRE_EVENT, ucs_arch_clear_cache, _UCM_DEFINE_REPLACE_FUNC) to arch files in bistro. This way we can have one generic implementation and RV64 will have an option to overwrite those without ifdefs in the main code. @yosefe ?

shamisp avatar Aug 19 '22 13:08 shamisp

@shamisp @yosefe I've made some updates to the bistro support and placed compile guards around some of the openmp code.

The icache issues have been resolved.

The modifications have been tested against a sifive unmatched and a pine64/star64 device. The code runs successfully on both devices.

Also, I've merged the implementation against master so, it is ready to go.

ct-clmsn avatar Jul 13 '23 15:07 ct-clmsn

@ct-clmsn can you pls fix the commit title? would need force-push

yosefe avatar Jul 26 '23 13:07 yosefe

@yosefe apologies, to clarify this commit? 06a1c3f

ct-clmsn avatar Jul 26 '23 13:07 ct-clmsn

@yosefe apologies, to clarify this commit? 06a1c3f

@ct-clmsn pls see https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=66476&view=logs&j=a3e7afe8-b282-5869-712a-9857fbd4b4a9&t=fd231581-c084-51fb-f10e-187ae5f1c153&l=30

Bad commit title: 'needed omp compile guards'

yosefe avatar Jul 26 '23 14:07 yosefe

@yosefe @jleidel @shamisp UCX has changed significantly since this PR was proposed. This PR is OBE (overcome by events) and I'd encourage merging #9168 as it's a more up to date proposal.

ct-clmsn avatar Jul 28 '23 18:07 ct-clmsn