ModelicaStandardLibrary icon indicating copy to clipboard operation
ModelicaStandardLibrary copied to clipboard

Improvements to the CSV compare tool for zero reference signals

Open casella opened this issue 1 year ago • 12 comments

While reviewing regressions from MSL 4.0.0 to MSL 4.1.0 (#4341) with @tobolar, we stumbled upon several reference signals which were very close to zero, but not exactly at zero due to numerical approximation. In some cases (e.g. ModelicaTest.MultiBody.Joints.Universal.diff.universal.w_a) the variable, which has ideally an equilibrium trajectory at constant zero value, is actually unstable, so any small numerical error grows exponentially.

For non-zero variables, numerical errors are normally not a problem, because their relative magnitude is small, so they easily fall within the tolerance tube. Unfortunately, this is not the case for variables that are meant to be zero. In fact, the more precise the solution is, the more absurdly small will the high and low limits will be. We had regressions from MSL 4.0.0 to MSL 4.1.0 because of variables that were off by 1.3e-16 😨

IMHO there are two proper solutions to this problem.

The first could be to somehow include the nominal attribute of each variable in the CSV data. This would allow to check the error with a criterion such as

(v - v_ref)/max(abs(v), v.nominal) < tol

which will be robust also for variables that are ideally zero, since then the relative scaling will be performed with their nominal value.

The second is to take into account the fact that in most (but not all) cases these variables are parts of a vector, e.g. the three Cartesian coordinates of a position, velocity, or acceleration signal, the components of a quaternion, or the Real and Imaginary parts of a Complex number. In this case, the error should of course be evaluated by looking at the entire vector, e.g.

(v[j] - v_ref[j])/max(sqrt(sum(v[j] for j in 1:size(v))), v.nominal) < tol

for 3D vectors or

(v.re - v_ref.re)/max(sqrt(v.re^2 + v.im^2), v.nominal) < tol

Unfortunately, there is no clear Modelica semantics to declare that an array is a 3D vector or a quaternion, or that a Complex operator record is actually a vector in the Gauss plane. Maybe we need some annotations in the language to express that unambiguously? This could be added to the base classes in MSL, without the need of adding this information all the time in user models.

Until these issues are resolved, I am convinced that we should avoid to include reference signals that are expected to be exactly zero i the analytic solution, because their are basically just misleading. Of course you can reproduce them with the same tool, but even different versions of the same tool may produce completely different results, not to mention comparing different tools that use different numerical methods. We should avoid wasting time over "fake regressions" on such variables.

What do you think?

casella avatar Jun 18 '24 17:06 casella

I think @casella is right but it could be tedious to define nominal values for all reference signals manually. As far as I know all tools should determine some sort of nominal values to calculate integration errors to determine proper step size of integration algorithms with variable step size. It's just a question how to access these nominal values to utilize them ... @HansOlsson any idea?

AHaumer avatar Jun 18 '24 17:06 AHaumer

@AHaumer I meant the nominal attributes. They are often defined in basic SI types (e.g. Pressure has nominal = 1e6)

casella avatar Jun 19 '24 00:06 casella

A simple and pragmatic solution would be to assume that all of the variables have a nominal of at least 1e-3, and update the test-program based on that. We might miss some cases that should have a tighter tolerance - but we can deal with that later - without running the risk of accidentally removing those signals in this step.

HansOlsson avatar Jun 19 '24 07:06 HansOlsson

I think @casella is right but it could be tedious to define nominal values for all reference signals manually. As far as I know all tools should determine some sort of nominal values to calculate integration errors to determine proper step size of integration algorithms with variable step size. It's just a question how to access these nominal values to utilize them ... @HansOlsson any idea?

After additional thought I think this will be quite complicated, whereas just updating the tool to use a fixed nominal value (of 1e-3 or just 1) will be straightforward and solve the problem. Note that due to how nominals should work it will only be a relevant for variables whose nominal is lower than this fixed value. Pressures with a nominal of 1e5 will work as well (or badly) as before.

The reasons are due to the internals:

  • The comparison is based on result-files converted to csv-files; and there is no obvious place to store the nominal values in the CSV-files.
    • Storing it in the header would be the most obvious, but would mess up other uses of the CSV-files.
    • Storing it as a value with fake-time would "work", but would also mess it up in other cases.
  • Storing it in the comparisonSignals.txt as a second column would have the additional problems that the information isn't directly used by the csv-compare as far as I understand, and it would also be problematic to update
  • The result files (at least from Dymola) don't currently contain this information so it cannot be copied to the csv-files. It would be possible to add it, but with similar issues.

HansOlsson avatar Jun 19 '24 12:06 HansOlsson

The second is to take into account the fact that in most (but not all) cases these variables are parts of a vector, e.g. the three Cartesian coordinates of a position, velocity, or acceleration signal, the components of a quaternion, or the Real and Imaginary parts of a Complex number. In this case, the error should of course be evaluated by looking at the entire vector, e.g.

For what it's worth, this is a feature we use for our own correctness comparisons. However, we don't have any automatic mechanism for identifying when the feature should be used, so we apply it manually as needed.

A closely related feature is that when the unit quaternion is actually marked as a parameterization of SO(3), we can account for the quotient space effect of q and -q representing the same orientation. In practice, we rarely need this feature, but it addresses a fundamental problem when initializing orientations from complicated nonlinear equation systems where it is hard (or not desirable) to control exactly which of the two equivalent quaternions to expect.

henrikt-ma avatar Jun 24 '24 09:06 henrikt-ma

A simple and pragmatic solution would be to assume that all of the variables have a nominal of at least 1e-3, and update the test-program based on that. We might miss some cases that should have a tighter tolerance - but we can deal with that later - without running the risk of accidentally removing those signals in this step.

It's not the perfect solution but I see it as a reasonable one.

@HansOlsson the question is: who can take care of this and how long would that take? Can we make it in time for MSL 4.1.0 without delaying it unnecessarily?

casella avatar Jun 28 '24 22:06 casella

What I propose is to add an extra (and optional) row (explicitly marked by keyword "nominal") for the nominal values of each signal. For example:

"time", "v1", "v2"
0.0, 0.0, 1e-8 
1.0, 1.0, -2e-8
2.0, 1.5, 1.5e-9
3.0, 1.0, 5e-9
4.0, 0.0, -1e-8

becomes

"time", "v1", "v2"
"nominal", 1.0, 1e-8
0.0, 0.0, 1e-8 
1.0, 1.0, -2e-8
2.0, 1.5, 1.5e-9
3.0, 1.0, 5e-9
4.0, 0.0, -1e-8

This is all simple, self-contained within the simulation result file, optional and explicit.

beutlich avatar Sep 10 '24 19:09 beutlich

Would this still qualify as a CSV file?

casella avatar Sep 17 '24 10:09 casella

Would this still qualify as a CSV file?

I would say it would kind of quality, but most tools using CSV-files for signals expect that after the (optional) header there is just numerical data, not texts, so it wouldn't work well.

To me "kind of CSV" would lose the main portability advantage of a normal CSV-file.

HansOlsson avatar Sep 17 '24 10:09 HansOlsson

I see, it's only common to have one optional header record in CSV files (being non-standardized anyway), not more than one. Still I consider my proposal as convenient enough for the above mentioned reasons.

beutlich avatar Sep 17 '24 18:09 beutlich

Maybe one could have two CSV output files, one for the simulation results and one for the nominal values. This wouldn't cause problems reading the CSV file with functions meant for CSV files.

casella avatar Sep 17 '24 23:09 casella

I see, it's only common to have one optional header record in CSV files (being non-standardized anyway), not more than one. Still I consider my proposal as convenient enough for the above mentioned reasons.

If you have to go down that route, I'd recommend to at least put such a preamble before the column header lines, i.e. before the regular content, and mark it with a comment character.

# "nominal", 1.0, 1e-8
"time", "v1", "v2"
0.0, 0.0, 1e-8 
1.0, 1.0, -2e-8
2.0, 1.5, 1.5e-9
3.0, 1.0, 5e-9
4.0, 0.0, -1e-8

Some csv parsing APIs have a way to specify how many lines to skip at the beginning (e.g. pandas' read_csv() skiprows argument), so that way the "regular" content can at least be parsed correctly without having to manually delete the first line and without confusing potential datatype detection (otherwise the first value of the time column is a string).

Also, some APIs have a way to indicate a comment character like # with lines that should be ignored (again, read_csv(comment='#',...)), so the leading # helps in that case.

Finally, if you need to get at the content of "nominal" row, just reading in the first line of a file is simpler than having to count the lines.

bilderbuchi avatar Sep 18 '24 06:09 bilderbuchi