micropython-ulab icon indicating copy to clipboard operation
micropython-ulab copied to clipboard

[BUG] dtype of (ndarray + integer_constant) not as expected

Open arnonsen opened this issue 3 years ago • 13 comments

Two bugs:

  1. numpy casting does not follow expected casting (compared with my python 3 or https://www.onlinegdb.com/) Adding the value '1' to an 8-bits array changes the type of the array to 16 bits.
  2. np does not have version ?! (not really why I opened this issue)

from ulab import numpy as np print(np.version) Traceback (most recent call last): File "", line 1, in AttributeError: 'module' object has no attribute 'version'

import ulab print(ulab.version) 2.4.5-2D

To Reproduce from ulab import numpy as np print(np.arange(1,dtype=np.int8)+1) array([1], dtype=int16) print(np.arange(1,dtype=np.uint8)+1) array([1], dtype=uint16)

Expected behavior (using onlinegdb) import numpy as np a = np.arange(1,dtype=np.int8)+1 print(type(a[0])) a = np.arange(1,dtype=np.uint8)+1 print(type(a[0])

<class 'numpy.int8'> <class 'numpy.uint8'>

Additional context I will fix it and propose my fix

arnonsen avatar May 25 '21 07:05 arnonsen

@arnonsen

Please, add a descriptive title to your issue!

np does not have version ?! (not really why I opened this issue)

It doesn't. ulab incorporates three sub-modules at the moment (a fourth one is on the way), and numpy is only one of them. It really doesn't make sense to track the sub-modules, so there is a single version number, and that is for ulab:

import ulab

print(ulab.__version__)

Adding version numbers to sub-modules would also be extremely confusing: we do not necessarily add functions in the same order as does numpy, so if we had np.version = 1.11 in ulab, what exactly should that mean?

numpy casting does not follow expected casting

It cannot: numpy has at least 13 different dtypes, while ulab has 5. We had to make the cut. Also, your example,

a = np.arange(1,dtype=np.int8)+1
type(a[0])

returns np.int8, but the conceptionally equivalent

a = np.arange(1,dtype=np.int8)+129
type(a[0])

returns np.int16, and

a = np.arange(1,dtype=np.uint8)+129
type(a[0])

returns np.uint8. The question really is, what dtype you assign to the scalar constant, and how that dtype compares to the dtype of your arange. I think upcasting is not a trivial problem, especially, if you want to keep the flash footprint small.

I would also like to see, where this causes a real issue. ulab is not a one-to-one replacement for numpy (if you want numpy, then you should use numpy), rather a trimmed-down implementation of a syntactically compatible numerical module. numpy-compatibility at all costs is not feasible. You have to keep in mind that the compiled size of ulab is approx. thousand times smaller than that of numpy (let alone the scipy functions), and we have absolutely no external OS utilities that we can depend on, therefore, one should not expect that everything works as in numpy. micropython itself has a number of instances, where its behaviour diverges from CPython, and people are not too concerned about this.

Having said this, I am open to suggestions, but there are non-trivial boundary conditions that we have to keep in mind.

v923z avatar May 25 '21 09:05 v923z

Following the behavior of full python within the existing types adds very little code, I'm in a process if doing it right now and I will share the fix. For example it will give the following results: [array of int8] + (-128..127) --> [array of int8] [array of uint8] + (0..255) --> [array of uint8] Currently the array is up-cased to 16 bits (on this example) so it doubles the RAM taken for the generated array.

arnonsen avatar May 25 '21 10:05 arnonsen

Following the behavior of full python within the existing types adds very little code,

There are several places in the code, where upcasting happens. One is the binary operators, another is compare.c.

Currently the array is up-cased to 16 bits (on this example) so it doubles the RAM taken for the generated array.

This is definitely true.

v923z avatar May 25 '21 11:05 v923z

Update, The attached script tests all the combinations of ARRAY+ARRAY and ARRAY+SCALRE, here is its partial printout. Test s/s: s8 + s8 = array([0], dtype=int8) s8 + s16 = array([0], dtype=int16) s8 + s32 = array([0], dtype=int32) s8 + s64 = array([0], dtype=int64) s16 + s8 = array([0], dtype=int16) s16 + s16 = array([0], dtype=int16) s16 + s32 = array([0], dtype=int32) .... .... See full log on attached upcast_log.txt: upcast log,txt

See the py script that runs this test (rename to .py): operator_upcasting_rule.txt I will share source code.

arnonsen avatar May 25 '21 13:05 arnonsen

I pushed code to my fork:

  1. Add new types: INT32 UINT32 INT64
  2. Fix upscaling to match formal Numpy.
  3. Support several operators which were not supported before
  4. Reduce code size by ~20% with new types ! The code reduction is work in progress.

arnonsen avatar May 27 '21 08:05 arnonsen

I pushed code to my fork:

Currently I have a number of compilation errors, so I can't really comment on the performance.

1. Add new types: INT32  UINT32  INT64

These should be optional, i.e., you should add a pre-processor switch for these data types, and you should guard these cases everywhere in the code.

2. Fix upscaling to match formal 

Thanks!

3. Support several operators which were not supported before

These should also be optional.

4. Reduce code size by ~20% with new types !
   The code reduction is work in progress.

A couple of comments:

  1. You try to change a number of things in a single sweep, and that is not advisable. A single commit should address a single and isolated issue. E.g., your original post about the upcasting is definitely valid, and this we could fix with two lines of code. A PR should be the subject of this single issue. If you want to change how upcasting works, then leave the string comparison function alone! https://github.com/v923z/micropython-ulab/blob/9cc363860495331975157487802f3865fd4f480c/code/ndarray.c#L1542 We can discuss that in a separate issue.

  2. People don't usually need int32, uint32, and int64, so these should be optional, and should be handled in a separate PR.

  3. Adding new operators falls into the same category. Also, new features can only be accepted in the main, if there are tests for them.

  4. I am not quite certain about the statement on code reduction: first, I have the feeling that your code does not handle views, which is a show stopper. Second, size reduction comes at the expense of RAM. If I understand your code correctly, convert and copy the data into a temporary buffer, and then you operate on that. The downside is that if you want to add an int8, and a uint8 of size 1000, then you need 3 * 4 * 1000 bytes of extra RAM. I think that is too much, and can only be accepted, if this approach is optional (pre-processor switch). You have to keep in mind that there are MCUs that have virtually no RAM. E.g., the ESP8266 has approx. 30 kB of free space. Even if you disregard this comment, allocating memory leads to RAM fragmentation. (By the way, this is why it might still be beneficial to keep the m_new, and m_free functions...)

  5. If you want to implement changes, do it in the current structure. E.g., while there is the overhead of a single function call for each binary operator, handling each operator in a single function leads to readable code. If you want to insert your method into the code, then do it in the respective binary functions!

  6. Do you have speed benchmarks? I understand that calling the conversion function only once is faster than calling it for each element in the array (using the function pointer), but I can't imagine that converting and copying the data is speed-wise comparable to the current implementation, which does not do any conversion.

  7. I have tried to adhere to the micropython code convention. You should, too.

What you are proposing in your fork is quite disruptive, and should definitely be broken into smaller chunks. I would suggest that we start with the upcasting problem. Do you want to request a pull for that, or should I implement the changes?

v923z avatar May 27 '21 16:05 v923z

Thanks a lot for your comments, I agree with most of them and I will refer to each.

  1. my recent fork and commit reflects work I've done for a few days (on an older Ulab version) once I realized it's preferable to share my work rather than keep it to my self. I spent many hours today cleaning up etc. just for that, this is why it includes multiple changes. future commits would be incremental. As for the "string comparison", I get your point and I'm curious to know the benefit of using the strcmp here.

  2. For my use (audio processing) INT32 is mandatory and the two others came with zero effort. using "my operators approach" 99% of the code they would take is in "numpy_utils.c" functions and it is very each to mask it there using compilation flags.

  3. I will get help for testing EVERYTHING (operators, method, functionalities etc.) and compare it with the "formal" Numpy/Scipy but meantime I'm running forward and shortly test each change without keeping and logging them.

  4. The core reduction issue is simple to explain. s8/u8/s16/u16/float has 12 combinations (I think) for each operator. for 20 operators this means 20*12 = 240 double loops (for max 2D). using "my method" there are about 30 double loops (on numpy_utils.c) and only two 1D loops per operator. for 20 operators this is 30+20 = 50 double loops (instead of 120). the cost of each additional operator is less than 1/10 this way. The 20% reduction is not estimation but real numbers from my platform, and that is before completing the migration. Surely I don't know other platforms limitations and considerations but I think for most the use of the temporary buffer is feasible and preferable (so all me be under compilation flag?). BTW for your example I would use 2x4x1000 (not 3x4x1000). on my system this temp memory does not come from the gc_malloc but from the host's OS. I'm not sure what you are referring by "to keep the m_new, and m_free functions", is it to where I replaced m_new for "shape" and "stribes"? is so I prefer the small temporary payment on the stack then the huge payment of each gc_malloc.

  5. OK

  6. I started running benchmarks but not yet for the Numpy, I will do it soon. my estimation is that for vectors of less than 1k elements, the cost of copy_to_temp + operator + copy_from_temp is much "under" other Python's overheads. for example, I've changed the FFT code to one that takes maybe 1/10 of the cycles and then I've found that the five "m_new" related takes more then the FFT it self...

  7. ok

I understand it's immature and can't be taken "as is", I wanted to show you my direction and get your feedbacks, then we can plan ahead.

Please go ahead and implement the change. Thanks !!!

arnonsen avatar May 27 '21 20:05 arnonsen

please note the fix consists of two parts, the upcasting function and the array/scalar type matching.

arnonsen avatar May 28 '21 04:05 arnonsen

As for the "string comparison", I get your point and I'm curious to know the benefit of using the strcmp here.

I am not against comparing the raw pointers, all I wanted to say is that it should be the subject of a different commit. If you look at the title of this issue, dtype of (ndarray + integer_constant) not as expected, it does not concern the string comparison for the flatten method. If you change a lot of things under the same heading, then later on it will be really hard to find what caused which error.

2. For my use (audio processing) INT32 is mandatory and the two others came with zero effort. 

using "my operators approach" 99% of the code they would take is in "numpy_utils.c" functions and it is very each to mask it there using compilation flags.

I don't know how exactly you process, but you can directly convert int32 to float using one of the functions in the utils module. Someone had the same problem: https://github.com/v923z/micropython-ulab/issues/341

3. I will get help for testing EVERYTHING (operators, method, functionalities etc.) and compare it with 

the "formal" Numpy/Scipy but meantime I'm running forward and shortly test each change without keeping and logging them.

I guess, all you have to do is update the test scripts with the new types. The only difficulty is that we would have to have two sets, one with the new extra

4. The core reduction issue  is simple to explain. s8/u8/s16/u16/float has 12 combinations (I think) for 

each operator. for 20 operators this means 20*12 = 240 double loops (for max 2D). using "my method" there are about 30 double loops (on numpy_utils.c) and only two 1D loops per operator. for 20 operators this is 30+20 = 50 double loops (instead of 120). the cost of each additional operator is less than 1/10 this way.

That's clear, but what you are doing is trade firmware size for speed, and RAM. You replace the quadratic scaling with a linear one, by converting everything into a temporary buffer with a common dtype. We considered this some time ago, but for the reasons I outlined here, this did not seem to be a viable approach. We can bring it in with a pre-processor switch.

The 20% reduction is not estimation but real numbers from my platform, and that is before completing the migration. Surely I don't know other platforms limitations and considerations but I think for most the use of the temporary buffer is feasible and preferable (so all me be under compilation flag?).

Let me re-phrase my earlier comment: the ESP8266 has about 30 kB of free RAM. Some microcontrollers have 2 MB. That is close to a factor of 100 difference. In computers' terms, you are comparing a computer with 8 GB of free RAM with one with only 100 MB. Even the smallest raspberry pi has more RAM, so when you are developing code for computers, you don't usually see such huge differences. Taking the ESP8266, if you want to add 2 one-strong arrays, you need 3 kB for the operands and the result, plus 8 kB for the computation itself.

BTW for your example I would use 2x4x1000 (not 3x4x1000).

Don't you convert both operands, hold the result, and then convert the result to the end type? That is three buffers, but I might have missed something.

   on my system this temp memory does not come from the gc_malloc but from the host's OS.
   I'm not sure what you are referring by "to keep the m_new, and m_free functions", is it to where 

I replaced m_new for "shape" and "stribes"? is so I prefer the small temporary payment on the stack > then the huge payment of each gc_malloc.

This happens at quite a few places, not only, where you changed it. Usually, the problem is that you have to reserve some RAM before you can figure out, whether you actually work with the input, so if you have to bail out, then it is probably better to free up the memory beforehand. The garbage collector will eventually clean up a lot of things, but it might be too late.

  1. I started running benchmarks but not yet for the Numpy, I will do it soon. my estimation is that for vectors of less than 1k elements, the cost of copy_to_temp + operator + copy_from_temp is much "under" other Python's overheads.

Well, this is absolutely impossible to tell up front. Take this: https://github.com/thiagofe/ulab_samples Run times wildly scatter, even if you divide by the frequency. Also, we can't have different methods for arrays of different length.

for example, I've changed the FFT code to one that takes maybe 1/10 of the cycles and then I've found that the five "m_new" related takes more then the FFT it self...

The FFT routine in ulab works with no RAM overhead. This was a conscious decision, and one that I am satisfied with: it is relatively fast, and yet, you can transform 2048 data points on an ESP8266. It could run faster, and has its limitations (e.g., the length must be a power of 2), but given the constraints, I am not sure you can do much better. The only reasonable addition could be to call hardware FFT on platforms that have it. Yours might be such, I don't know.

v923z avatar May 28 '21 06:05 v923z

please note the fix consists of two parts, the upcasting function and the array/scalar type matching.

OK, I'll do.

v923z avatar May 28 '21 06:05 v923z

After updating "compare" code reduction is now 26%. Regarding the FFT, mine has pre-prepared tables and it's platform specific. I fully understand your choice of no RAM overhead.

arnonsen avatar May 31 '21 12:05 arnonsen

Regarding the FFT, mine has pre-prepared tables and it's platform specific. I fully understand your choice of no RAM overhead.

If your routine brings significant speed improvements (say, a factor of 2), I am not against optionally adding that.

v923z avatar May 31 '21 17:05 v923z

https://github.com/v923z/micropython-ulab/pull/398 should resolve the original issue.

v923z avatar Jun 01 '21 16:06 v923z

I believe that between #398 and #650 this issue should be closed as well

s-ol avatar Dec 12 '23 09:12 s-ol