pytest-regressions `num_regression` cannot work with non squared numerical data

I wanted to test the coordinates of a geometry generated from a polygon and make sure it's always the same. the output is a geojson dictionnary and here is a small example if you are not familiar with the format:

{
  "type": "Feature",
  "geometry": {
    "type": "Point",
    "coordinates": [125.6, 10.1]
  },
  "properties": {
    "name": "Dinagat Islands"
  }
}

As the only thing I want to check is the geometry I will look into "coordinates". These coordinates are always set with many digits (8 in my case) which always create issue when dealing with the absolute compariasons of data_regression. So I turn myself to the num_regressionfixture that seems perfect for my use case. It was working fine until I tested more complicated geometries, specifically multipolygons. In this case the list of coordinates cannot be transformed into a nd.array as the different polygons forming the geometry are not of equal size. raising the following error:

ValueError: setting an array element with a sequence

Again to give an example my data can look like the following sequence:

numpy.array([[1, 2], [2, 3, 4]])         # wrong!

Question is why do you need to make transform the data into nd.array in the first place ? Could you instead support the comparison manually and thus support any sequence shape ?

Jun 12 '24 10:06 12rambau

Hi @12rambau,

Question is why do you need to make transform the data into nd.array in the first place ?

That's how it is implemented today, we turn it into an nd.array because in the end we dump the arrays into a dataframe:

https://github.com/ESSS/pytest-regressions/blob/master/src/pytest_regressions/num_regression.py

Could you instead support the comparison manually and thus support any sequence shape ?

Not sure. If you want to dig into the code and see if there is a simple and backward compatible solution, we would love to review a PR in that direction. :+1:

Jun 12 '24 11:06 nicoddemus

It will be necessary for my use case so i'll definitely look into it. I think the easiest way is to make num_regression independant from its dataframe counterpart ence recoding the check mechanism. It looks like a fun challenge. I cannot commit to a speady implementation but I'll try to work something out. For retro compatibility I guess the main challenge will be the the file format. I was thinking on relying on a yaml file (to honor nesting) but that won't work with the existing .csv format used in the current implementation.

Jun 12 '24 12:06 12rambau

Actually I though about another solution that i already applied in my pytest-regressions extention for GEE object: pytest-gee: What if instead i was adding a "prescision" mechanism in data_regression ?

At the end of the day, the only type that is problematic is float, rest is either int or serialized string that will always have the same representation in a yaml dictionary. So I kept the data_regression fixture and simply applied the following little massage on the data:

def round_data(data: Union[list, dict], prescision: int = 6) -> Union[list, dict]:
    """Recusrsively Round the values of a list to the given prescision."""
    # change the generator depending on the collection type
    generator = enumerate(data) if isinstance(data, list) else data.items()
    for k, v in generator:
        if isinstance(v, (list, dict)):
            data[k] = round_data(v, prescision)
        elif isinstance(v, float):
            data[k] = round(v, prescision)
        else:
            data[k] = v
    return data

before checking the correspondance.Would you accept a POC PR ?

By doing so I stop having a problem with num_regression that can then be remain a square check i'm just slightly increasing the capacity of data_regression to manage floating values.

Dec 12 '24 20:12 12rambau

I like the idea of adding a precision: None|int=None parameter to data_regression which rounds the values before saving... that would match what Rust insta does with its "redaction" feature, and would be simple/good enough for data_regression (and there is always num_regression in case someone wants more control over that anyway).

@tadeu thoughts?

Dec 16 '24 10:12 nicoddemus

Yes, that would be welcome!

The ideal solution would be to actually compare the numerical data using relative and absolute tolerances in the same way that num_regression does. Rounding is different, it works most of the time, but sometimes numbers are very close and end up rounded to different numbers (e.g., 0.4999999 and 0.5000001 rounded to 0.0 and 1.0).

But since the current implementation is based on pure textual comparison that ideal solution would be complicated to implement, so rounding is good. Maybe let's name the parameter rounding_digits to be clear that it's using round.

Another thing to take care is if you have very small or very large values. round is absolute (as opposite to relative, i.e. it doesn't consider the magnitude), so if numbers are all like 1e-10, 5e-10, etc., and rounding digits are 5, they'll all be rounded to 0.0.

Dec 16 '24 13:12 tadeu

Another thing to take care is if you have very small or very large values. round is absolute (as opposite to relative, i.e. it doesn't consider the magnitude), so if numbers are all like 1e-10, 5e-10, etc., and rounding digits are 5, they'll all be rounded to 0.0.

"take care" in what sense, can you clarify? I mean, should the implementation do something special (if so, what?), or do you mean users need to be aware when using this (meaning we should document it accordingly)?

Dec 16 '24 16:12 nicoddemus

I'm also a bit confused by this comment becasue to me prescision from numpy is absolute as well.

Dec 16 '24 19:12 12rambau

Another thing to take care is if you have very small or very large values. round is absolute (as opposite to relative, i.e. it doesn't consider the magnitude), so if numbers are all like 1e-10, 5e-10, etc., and rounding digits are 5, they'll all be rounded to 0.0.

"take care" in what sense, can you clarify? I mean, should the implementation do something special (if so, what?), or do you mean users need to be aware when using this (meaning we should document it accordingly)?

Both things.

Users need to be aware because "precision" might not be what is expected. "Precision" is related to significant figures - with a precision of 3 places someone might expect that 1.234e10 and 1.235e10 are acceptable as approximately equal.

We could (1) change it to round using significant figures and keep using the name "precision", but this complicates the implementation a bit as there's no function for this in the stdlib or NumPy (there's a nice implementation in Stack Overflow though).

But, more simply, we could (2) just rename precision to rounding_digits, so the expectations are clear.

(Note that NumPy calls it decimals, implying "number of digits after the decimal separator".)

Dec 17 '24 12:12 tadeu