ssd-gpu-dma icon indicating copy to clipboard operation
ssd-gpu-dma copied to clipboard

Floating Point Exception

Open ZaidQureshi opened this issue 5 years ago • 16 comments

I am trying to run the example nvm-identify but I get the following output: Resetting controller and setting up admin queues... Floating point exception

The dmesg output is this: [May24 16:40] traps: nvm-identify[3179] trap divide error ip:7f6d2f98a434 sp:7ffd9a74e3b0 error:0 in libnvm.so[7f6d2f985000+9000] I am not sure what is going on. Any help would be appreciated.

ZaidQureshi avatar May 24 '19 23:05 ZaidQureshi

Hi Zaid,

This probably means that there is a division by zero somewhere. Not really sure where that would be. Would it be possible for you to add some print statements in the examples/identify/module.c and examples/identify/common.c files to see how far it gets before the bug occurs, and ideally which library call that fails? I'm wondering if the error occurs in the reset_ctrl function (in common.c) or in the identify_ctrl function (also in common.c).

Thanks

enfiskutensykkel avatar May 25 '19 00:05 enfiskutensykkel

I did some debugging and it seems the problem is on line 51 in include/nvm_queue.h which has the following text: if ((uint16_t) ((sq->tail - sq->head) % sq->max_entries) == sq->max_entries - 1) So for some reason the tail, head, and max_entries of the admin queue are all 0.

ZaidQureshi avatar May 28 '19 16:05 ZaidQureshi

Thank you for your effort.

I suspect that this may be because it is not able to read controller registers properly. I assume you are using the kernel module version (and not the SmartIO version)?

In examples/identify/module.c, could you print out the members of ctrl directly after the call to nvm_ctrl_init, particularly the values of ctrl->page_size and ctrl->max_entries ?

enfiskutensykkel avatar May 28 '19 17:05 enfiskutensykkel

These are the values for each of the members of ctrl after the nvm_ctrl_init call:

CTRL->DSTRD: 0
CTRL->timeout: 30000
CTRL->MAX_ENTRIES: 0
CTRL->MM_SIZE: 8192
CTRL->mm_ptr: 0x7f511d31f000

ZaidQureshi avatar May 28 '19 18:05 ZaidQureshi

Thank you again.

This is a bit strange, it appears that it managed to read the time-out value from the registers, but he maximum number of entries has become 0. I may be doing something wrong in src/ctrl.c or maybe I’ve interpreted the spec wrong. There are some fields that are defined as 0 meaning 1, so I need to check if this is the case here.

Out of curiosity, what kind of nvme controller/disk is this?

enfiskutensykkel avatar May 28 '19 18:05 enfiskutensykkel

This is a Corsair NX500 SSD.

ZaidQureshi avatar May 28 '19 18:05 ZaidQureshi

So the value returned from CAP.MQES reg value in initialize_handle in src/ctrl.c for me is 65535. When we add 1, it overflows and turns 0. Now the question is why does it return 65535. When I remove the plus 1, I get valid values for max entries for the admin queues and the controller information is returned correctly..

ZaidQureshi avatar May 28 '19 19:05 ZaidQureshi

The reason I add 1 is that MQES is, according to the spec,

This is a 0's based value. The minimum value is 1h, indicating two entries Meaning that a value of 0 is illegal, and a value of 1 means it should be 2 etc.

If it returns 65535, I suspect a different type of issue, namely that all MMIO reads are returning 0xFFs (this is a common symptom when reading does not work). If you attempt to read any other values directly from the memory mapped pointer, does it all return 0xFFs? Could you for example try this: printf("%lx\n", *((uint64_t*) mm_ptr));

enfiskutensykkel avatar May 28 '19 19:05 enfiskutensykkel

This is the value printed due to printf("%lx\n", *((uint64_t*) ctrl->mm_ptr)); after calling nvm_ctrl_init: 203c03ffff

ZaidQureshi avatar May 28 '19 19:05 ZaidQureshi

That's a good thing!

The controller probably supports 65536 entries, and the bug is that I'm using 16-bit variable to read out the value and add one. I'm trying to find a datasheet or documentation in order to verify that it supports that many queue entries, but I suspect that this is the case.

I'll add a quick fix for it and add your name in the commit message (if you want), or you can send me a patch or alternatively a pull request if you want. In the mean time, you can probably comment out the +1 for now.

enfiskutensykkel avatar May 28 '19 19:05 enfiskutensykkel

Thank you for your time and effort, by the way.

enfiskutensykkel avatar May 28 '19 19:05 enfiskutensykkel

Wait but the specification states that this register is only 16 bits.

ZaidQureshi avatar May 28 '19 19:05 ZaidQureshi

Yes, the register is only 16 bits, but I belive the value allowed to be 65536. I'll look around, and check the spec errata if it is mentioned, and I'll also have a look at what the Linux kernel NVMe driver does. Does the indentify example program work for you if you don't add one?

enfiskutensykkel avatar May 28 '19 19:05 enfiskutensykkel

Yes that example works but I think typically what people do is they always use the MQES+1 value. So I think you will need to change the type of the max_entries field in all of your code. I believe this is leading to another issue at the end of the latency benchmark when you are unmapping memory as I get the following error at the end of the latency benchmark: [unmap_memory] Page unmapping kernel request failed: Invalid argument

I am trying to debug it and see what is happening.

ZaidQureshi avatar May 28 '19 19:05 ZaidQureshi

The spec says in the introduction chapter:

Support for 65,535 I/O queues, with each I/O queue supporting up to 64K outstanding commands.

Since it's 2^16 - 1 number of I/O queues, does that mean 64K in this case perhaps means 2^16?

Regardless, the unmapping is a known bug (see #18), I haven't had time to rewrite the kernel module yet. The problem is how I'm currently storing address mappings, as I'm storing everything into one global list when I should instead create a list per open descriptor. It happens with the nvm-latency-benchmark because I check against the PID, and for some reason CUDA programs fork into child processes so the "wrong" PID is attempting to clean up the stored mapping.

It will clean up automatically once you unload the module, so do that every now and then if you run the nvm-latency-bench program multiple times. The warnings and errors look bad, but it is expected.

enfiskutensykkel avatar May 28 '19 19:05 enfiskutensykkel

The permanent fix would be to change the type of max_entries to uint32_t as you suggested, or alternatively always use + 1 when I reference it. I'll have to look into the impact of both solutions.

enfiskutensykkel avatar May 28 '19 19:05 enfiskutensykkel