criu icon indicating copy to clipboard operation
criu copied to clipboard

issues in mips version

Open Aatrox00 opened this issue 4 years ago • 23 comments

i was able to build criu but i could not dump process. When I run criu check --all, there were no first catagory errors. My enviroments are listed below: cpu : loongson 3A4000 kernel 4.19.0-12-loongson-3 criu version:3.15

Aatrox00 avatar Sep 27 '21 07:09 Aatrox00

Hi,

please provide dump.log and show how you issuing criu command (all arguments).

mihalicyn avatar Sep 27 '21 07:09 mihalicyn

I issued criu like this "sudo criu/criu/criu dump --shell-job -v4 -o dump.log -t 24987 -D imgs" dump.log

Aatrox00 avatar Sep 27 '21 07:09 Aatrox00

According to your dump.log this seems to be the problem:

(00.017056) Error (criu/parasite-syscall.c:88): si_code=4 si_pid=24987 si_status=10
(00.017064) Error (criu/parasite-syscall.c:95): 24987 was stopped by 10 unexpectedly

The mips support was done by @sunny868 (if I remember it correctly), maybe @sunny868 knows why it does not work.

adrianreber avatar Sep 27 '21 10:09 adrianreber

Looks like your process (pid=24987) have recieved SIGBUS signal:

$ cat mips/include/uapi/asm/signal.h | grep 10
#define SIGBUS		10	/* BUS error (4.2 BSD).	 */

It may mean that the CRIU issued unaligned access to the memory.

What's the kind of process you've been trying to dump? Do you have the same problem with any another processes that you been trying to dump? It's important for us to understand if CRIU doesn't work at all for you or you just have the problem with particular program.

Is it possible for you to provide access to your MIPS machine for us to take a look on that and try to debug?

mihalicyn avatar Sep 27 '21 10:09 mihalicyn

@Aatrox00 do you see this error with CRIU 3.16 as well?

rst0git avatar Sep 27 '21 10:09 rst0git

It looks like victim process crashes here:

static int parasite_init_daemon(struct parasite_ctl *ctl)
{
...
	if (prepare_tsock(ctl, pid, args))
		goto err;

	/* after this we can catch parasite errors in chld handler */
	if (setup_child_handler(ctl)) <-- ok, because we have chld handler called
		goto err;

	regs = ctl->orig.regs;
	if (parasite_run(pid, PTRACE_CONT, ctl->parasite_ip, ctl->rstack, &regs, &ctl->orig)) <-- SIGBUS after jumping into parasite blob
		goto err;

	futex_wait_while_eq(&args->daemon_connected, 0);

mihalicyn avatar Sep 27 '21 10:09 mihalicyn

@Aatrox00, couldn't you try to revert commit ("compel: don't mmap parasite as RWX"), rebuild CRIU (please use make clean && make to perform full rebuild including parasite blob) and run criu dump ...?

mihalicyn avatar Sep 27 '21 10:09 mihalicyn

you

@Aatrox00 do you see this error with CRIU 3.16 as well?

yeah the same

Aatrox00 avatar Sep 27 '21 11:09 Aatrox00

Looks like your process (pid=24987) have recieved SIGBUS signal:

$ cat mips/include/uapi/asm/signal.h | grep 10
#define SIGBUS		10	/* BUS error (4.2 BSD).	 */

It may mean that the CRIU issued unaligned access to the memory.

What's the kind of process you've been trying to dump? Do you have the same problem with any another processes that you been trying to dump? It's important for us to understand if CRIU doesn't work at all for you or you just have the problem with particular program.

Is it possible for you to provide access to your MIPS machine for us to take a look on that and try to debug?

The process i tried to dump is just a simple single threaded program. The same program works on my x86 machine. Acturally,i've tried several different programs to dump, but none of them worked. As for providing access to the machine, since it doesn't own a public ip address, it cant be accessed through ssh.

Aatrox00 avatar Sep 27 '21 11:09 Aatrox00

@Aatrox00

As for providing access to the machine, since it doesn't own a public ip address, it cant be accessed through ssh.

That's not a problem for us ;) We can setup reverse ssh tunnel from your machine to some machine controlled by CRIU devs as an option. But let's try to make some initial guess and surround the problem before taking extraordinary measures :)

I repeat my question:

couldn't you try to revert commit ("compel: don't mmap parasite as RWX"), rebuild CRIU (please use make clean && make to perform full rebuild including parasite blob) and run criu dump ...?

mihalicyn avatar Sep 27 '21 12:09 mihalicyn

couldn't you try to revert commit ("compel: don't mmap parasite as RWX"), rebuild CRIU (please use make clean && make to perform full rebuild including parasite blob) and run criu dump ...?

I tried this just now. It didnt work. The dump.log is just the same as before.

Aatrox00 avatar Sep 27 '21 12:09 Aatrox00

Ok then I will try to reproduce this on Qemu VM.

Upd.

@Aatrox00 which GNU/Linux distro you've used on your mips machine?

mihalicyn avatar Sep 27 '21 12:09 mihalicyn

Ok then I will try to reproduce this on Qemu VM.

Upd.

@Aatrox00 which GNU/Linux distro you've used on your mips machine?

Thx for your help. I am using Loongnix-20.mips64el.rc2(http://ftp.loongnix.cn/os/loongnix/20/mips64el/isos/) I've also tried debian with kernel version 5.10.64

Aatrox00 avatar Sep 27 '21 12:09 Aatrox00

Hi @Aatrox00,

I've experimented with MIPS in VM on amd64. Sigh. :)

First of all, qemu-system-mips64el -cpu Loongson-3A4000 doesn't work for me at all (it doesn't start kernel boot).

Ok,

qemu-system-mips64el \
-cdrom debian-11.0.0-mipsel-netinst.iso \
-hda disk_malta.qcow2 \
-M malta \
-cpu 5KEc \
-smp 1 \
-kernel vmlinuz-5.10.0-8-5kc-malta \
-boot d \
-initrd initrd.img-5.10.0-8-5kc-malta \
-m 2G \
-nographic \
-device virtio-net-pci,netdev=eth0 -netdev type=user,id=eth0,hostfwd=tcp::2222-:22 \
-virtfs local,path=.,mount_tag=host0,security_model=mapped,id=host0 \
-append "root=/dev/sda1 nokaslr" 

worked for me, but CRIU compilation took about 20 minutes. I also caught:

[19119.370910] do_page_fault(): sending SIGSEGV to compel-host-bin for invalid read access from 0000000000000000
[19119.371543] epc = 000000fff39409e0 in libc-2.31.so[fff388d000+1b5000]
[19119.371846] ra  = 000000fff3920adc in libc-2.31.so[fff388d000+1b5000]
[19144.846684] do_page_fault(): sending SIGSEGV to compel-host-bin for invalid read access from 0000000000000000
[19144.849278] epc = 000000fff36ce9e0 in libc-2.31.so[fff361b000+1b5000]
[19144.850421] ra  = 000000fff36aeadc in libc-2.31.so[fff361b000+1b5000]
[19397.138507] do_page_fault(): sending SIGSEGV to compel-host-bin for invalid read access from 0000000000000000
[19397.139118] epc = 000000fff3ab69e0 in libc-2.31.so[fff3a03000+1b5000]
[19397.139562] ra  = 000000fff3a96adc in libc-2.31.so[fff3a03000+1b5000]

Seems like something is totally wrong with the compel.

Perhaps it's better to move to our second plan with using hardware node to debug problem or wait when @sunny868 comes and save us :) From tomorrow I will be on vacation with (possibly) poor internet for about 10 days. So I can try to take a look on your problem today or... after vacation.

Thanks, Alex

mihalicyn avatar Sep 28 '21 10:09 mihalicyn

@mihalicyn @Aatrox00 Sorry, I have something else to do recently, I will check this problem as soon as possible.

sunny868 avatar Sep 28 '21 11:09 sunny868

[19119.370910] do_page_fault(): sending SIGSEGV to compel-host-bin for invalid read access from 0000000000000000 [19119.371543] epc = 000000fff39409e0 in libc-2.31.so[fff388d000+1b5000] [19119.371846] ra = 000000fff3920adc in libc-2.31.so[fff388d000+1b5000] [19144.846684] do_page_fault(): sending SIGSEGV to compel-host-bin for invalid read access from 0000000000000000 [19144.849278] epc = 000000fff36ce9e0 in libc-2.31.so[fff361b000+1b5000] [19144.850421] ra = 000000fff36aeadc in libc-2.31.so[fff361b000+1b5000] [19397.138507] do_page_fault(): sending SIGSEGV to compel-host-bin for invalid read access from 0000000000000000 [19397.139118] epc = 000000fff3ab69e0 in libc-2.31.so[fff3a03000+1b5000] [19397.139562] ra = 000000fff3a96adc in libc-2.31.so[fff3a03000+1b5000] Thanks again for your help. I got same error messages on the Loonson 3A4000 machine.

Aatrox00 avatar Sep 28 '21 11:09 Aatrox00

Hi @mihalicyn, how's your vacation? It's been ten days since we exchanged messages last time. I'm wondering when it's a suitable time for you to help me to debug on the hardware node? Thanks.

Aatrox00 avatar Oct 08 '21 02:10 Aatrox00

Hi @Aatrox00, I've returned from vacation :) All fine.

Sure, I'm ready to take a look. We can contact in our Gitter https://gitter.im/save-restore/CRIU or Google Hangouts, email and so on. My e-mail is [email protected] (google hangouts has the same address).

mihalicyn avatar Oct 12 '21 15:10 mihalicyn

Thanks to @Aatrox00 for providing a working node.

I've managed to reproduce the issue and it looks like our MIPS support is don't work at all for loongson 3A4000 processors.

1st problem (almost obvious): ./compel/compel cflags is crashed with Segmentation fault The problem is that we have no cflags field initialization here https://github.com/checkpoint-restore/criu/blob/criu-dev/compel/src/main.c#L57

#elif defined CONFIG_S390
	.arch = "s390",
	.cflags = COMPEL_CFLAGS_PIE,
#elif defined CONFIG_MIPS
	.arch = "mips", <--- we have to have at least cflags
#else
#error "CONFIG_<ARCH> not defined, or unsupported ARCH"
#endif
};

and we are crashing there: https://github.com/checkpoint-restore/criu/blob/criu-dev/compel/src/main.c#L174

printf("%s\n", compat ? flags.cflags_compat : flags.cflags);

Okay, this is fixed and I've moved to the next step. I've tried to play with "fdspy" compel example:

lx@lx-pc:~/criu-3.16.1/compel/test/fdspy$ make
gcc -O2 -g -Wall -Werror -I/home/lx/criu-3.16.1/include/ -o victim victim.c
gcc -O2 -g -Wall -Werror -I/home/lx/criu-3.16.1/include/ -c  -o parasite.o parasite.c
ld -r -z noexecstack -T ../../../compel/arch/mips/scripts/compel-pack.lds.S -o parasite.po parasite.o ../../../compel/plugins/std.lib.a ../../../compel/plugins/fds.lib.a
ld: ../../../compel/plugins/std.lib.a(parasite-head.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/std.lib.a(infect.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/std.lib.a(syscalls-64.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/std.lib.a(fds.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/std.lib.a(log.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/std.lib.a(string.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/std.lib.a(memcpy.o): warning: linking abicalls files with non-abicalls files
ld: ../../../compel/plugins/fds.lib.a(fds.o): warning: linking abicalls files with non-abicalls files
../../../compel/compel-host hgen -o parasite.h -f parasite.po
Error (compel/arch/mips/src/lib/handle-elf-host.c:20): Unsupported Elf format detected
make: *** [Makefile:26: parasite.h] Error 255

Let's look at the code:

static const unsigned char __maybe_unused elf_ident_64_le[EI_NIDENT] = {
	0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, /* clang-format */
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};

extern int __handle_elf(void *mem, size_t size);

int handle_binary(void *mem, size_t size)
{
	if (memcmp(mem, elf_ident_64_le, sizeof(elf_ident_64_le)) == 0)
		return __handle_elf(mem, size);

	pr_err("Unsupported Elf format detected\n");
	return -EINVAL;
}
lx@lx-pc:~/criu-3.16.1/compel/test/fdspy$ readelf  --header parasite.po 
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 01 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       1
  Type:                              REL (Relocatable file)
  Machine:                           MIPS R3000
  Version:                           0x1
  Entry point address:               0x100
  Start of program headers:          0 (bytes into file)
  Start of section headers:          52672 (bytes into file)
  Flags:                             0x80000005, noreorder, cpic, mips64r2
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         8
  Section header string table index: 7

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .MIPS.abiflags    MIPS_ABIFLAGS    0000000000000000  00000040
       0000000000000018  0000000000000018   A       0     0     8
  [ 2] .text             PROGBITS         0000000000000100  00000100
       0000000000009350  0000000000000000 WAX       0     0     256
  [ 3] .rela.text        RELA             0000000000000000  0000adc8
       0000000000001fb0  0000000000000018   I       5     2     8
  [ 4] .mdebug.abi64     PROGBITS         0000000000000000  00009450
       0000000000000000  0000000000000000           0     0     1
  [ 5] .symtab           SYMTAB           0000000000000000  00009450
       0000000000001050  0000000000000018           6    33     8
  [ 6] .strtab           STRTAB           0000000000000000  0000a4a0
       0000000000000927  0000000000000000           0     0     1
  [ 7] .shstrtab         STRTAB           0000000000000000  0000cd78
       0000000000000043  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  p (processor specific)

There are no program headers in this file.
lx@lx-pc:~/criu-3.16.1/compel/test/fdspy$ hexdump -C parasite.po 
00000000  7f 45 4c 46 02 01 01 00  01 00 00 00 00 00 00 00  |.ELF............|
00000010  01 00 08 00 01 00 00 00  00 01 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 00 00 00  c0 cd 00 00 00 00 00 00  |................|
00000030  05 00 00 80 40 00 00 00  00 00 40 00 08 00 07 00  |....@.....@.....|
00000040  00 00 40 02 02 02 00 01  00 00 00 00 00 00 00 00  |..@.............|
00000050  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

We can see that on the offset 8 (starting from 0) we have 01 but it should be 00. First 16-bytes of the file belongs to the elfhdr e_ident array.

typedef struct elfhdr{
     unsigned char   e_ident[EI_NIDENT]; /* ELF Identification */
     Elf32_Half  e_type;     /* object file type */
     Elf32_Half  e_machine;  /* machine */
     Elf32_Word  e_version;  /* object file version */
     Elf32_Addr  e_entry;    /* virtual entry point */
     Elf32_Off   e_phoff;    /* program header table offset */
     Elf32_Off   e_shoff;    /* section header table offset */
     Elf32_Word  e_flags;    /* processor-specific flags */
     Elf32_Half  e_ehsize;   /* ELF header size */
     Elf32_Half  e_phentsize;    /* program header entry size */
     Elf32_Half  e_phnum;    /* number of program header entries */
     Elf32_Half  e_shentsize;    /* section header entry size */
     Elf32_Half  e_shnum;    /* number of section header entries */
     Elf32_Half  e_shstrndx; /* section header table's "section
                        header string table" entry offset */
 } Elf32_Ehdr;

I've tried to understand what does means this 01 and not found anything about it the Linux kernel code or somewhere else. According to the documentation all bytes after offset 8 are padding and should be zero. Okay, I've patched this byte "by hands" and tried to run make for a second time and get:

lx@lx-pc:~/criu-3.16.1/compel/test/fdspy$ ../../../compel/compel-host hgen -o parasite.h -f parasite.po
Error (compel/src/lib/handle-elf-host.c:641): Unsupported relocation of type 7

relocation type 7 is R_MIPS_GPREL16 and we really not handle this relocation type in the compel.

mihalicyn avatar Oct 21 '21 15:10 mihalicyn

At this point I can't understand how MIPS support worked before? If we take that cflags field wasn't initialized and ./compel/compel cflags crashes in this case. It means that during parasite compilation we got something like:

gcc -O2 -g -Wall -Werror -c  -o parasite.o parasite.c

instead of

gcc -O2 -g -Wall -Werror -c -Wstrict-prototypes -fno-stack-protector -nostdlib -fomit-frame-pointer -fpie -I ../../../compel/include/uapi -o parasite.o parasite.c

I can't imagine the situation when parasite compiled without -nostdlib flag is working correctly.

mihalicyn avatar Oct 21 '21 15:10 mihalicyn

Hmm, interesting. We currently have a simple test for cross compilation (.github/workflows/cross-compile.yml), perhaps we need to extend it (or create one) to run zdtm tests as well.

rst0git avatar Oct 22 '21 12:10 rst0git

Unfortunately, we have no our own MIPS node. Aatrox00 provided his own node for debugging temporarily.

mihalicyn avatar Oct 22 '21 13:10 mihalicyn

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] avatar Nov 22 '21 00:11 github-actions[bot]