flownet Add MISFIT_PREPROCESSOR to ERT template

This PR implements scaling of correlated observations using the ERT build-in PCA scaling method.

Contributor checklist

[x] :tada: This PR closes #206.
[x] :scroll: I have broken down my PR into the following tasks:
- [x] ~~Add STD_SCALE_CORRELATED_OBS TRUE~~ (On its way out - should use MISFIT_PREPROCESSOR)
- [x] Bump ERT version to latest master commit
- [x] Bump Libres to version to v6.0.0.rc0
- [x] Bump ERT and libres to released 2.16 and v6.0.0 (waiting for release)
- [x] Add MISFIT_PREPROCESSOR as PRE_FIRST_UPDATE hooked workflow
- [ ] Test and visualize results
[ ] :robot: I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR.
[ ] :book: I have considered adding a new entry in CHANGELOG.md.
[ ] :books: I have considered updating the documentation.

Oct 14 '20 09:10 wouterjdb

✔️ kmeans clustering has now been added https://github.com/equinor/semeio/pull/286

🚫 currently still blocked by https://github.com/equinor/ert/pull/1316

Feb 09 '21 10:02 wouterjdb

✔️ kmeans clustering has now been added https://github.com/equinor/semeio/pull/286

✔️ speed improvement for many observations https://github.com/equinor/ert/pull/1316

Feb 18 '21 08:02 wouterjdb

🚫 Currently blocked by the new commits not yet being in pypi.

Feb 18 '21 08:02 wouterjdb

Both packages are now updated on pypi (2.21.b0 and 1.0.b0)

✔️ Ready for testing.

Feb 22 '21 09:02 wouterjdb

I have tested the MISFIT_PREPROCESSOR option in the Norne case by using the code in the branch of this PR. With this workflow job enabled, ERT writes some files to a subfolder inside the FlowNet output folder (<FLOWNET_OUTPUT_FOLDER>/reports/default_0):

Inside subfolder CorrelatedObservationsScalingJob, 3 files are created: a. scale_factor.json: [34.63, 14.76] b. svd.json: a 2D array of size (33, 2) containing what appears to be two lists of 33 singular values in decreasing order. c. workflow-log.txt: a text file with some information about the calculation of the scaling factors stored in scale_factor.json - in this case two blocks of information indicating the number of primary components, number of observations and a list of observation keys used to calculate the scaling factor.
Inside subfolder MisfitPreprocessorJob, 4 files are created: a. clusters.json: a Python dictionary of dictionaries associating the observation keys to their numbering b. correlation_matrix.csv: a rather large CSV file (950 MB) which was hard to inspect given its size (but I believe a square matrix Nobs x Nobs). c. svd.json: a 2D array of size (33, 1) containing what appears to be a list of 33 singular values in decreasing order (same as one of the lists stored in 1.b) d. workflow-log.txt: a text file with some information about the obtained clusters of observations - in this case two clusters as stored in clusters.json, cluster 0 and cluster 1, with their respective list of observation keys and numbering (cluster 1 appears to contain many more observation keys than cluster 0)

Mar 26 '21 12:03 edubarrosTNO

All in all, the only thing that I could infer from these output files is that 2 clusters of observations seem to be formed and assigned to calculated scaling factors based on some singular value decomposition or PCA (with 33 non-zero singular values). But it remains unclear why 2 clusters and how the singular values are used to determine the scaling factors.

Another observation is that, when I ran it for the second time, I noticed differences in the output of MISFIT_PREPROCESSOR with respect to the first attempt. In the second one, 3 clusters seem to have been formed: I saw that the scaling factor of cluster 0 remained close to the factor calculated in the first attempt and that the scaling factors of clusters 1 and 2 add up approximately to the scaling factor of cluster 1 in the first attempting (suggesting that, in this second run, old cluster 1 was split into two clusters). In summary, there seems to be some randomness associated with this MISFIT_PREPROCESSOR process despite that fact that the RANDOM_SEED fixed in the ERT config file is the same in both attempt runs. This should be reported in the ERT repository.

To conclude: based on my tests done in the Norne example, I would not recommend to merge this PR branch to master before we understand better what this option is doing exactly and ensure that we can control any possible randomness associated with this process. If we do proceed with merging, my advice would be to expose this as an optional setting in FlowNet config file and make sure to have it disabled as default. The large number of FlowNet failing simulations when this option was enabled stopped me from determining whether or not this would be useful to mitigate the problem of having a very large number of observations in our FlowNet runs.

Mar 26 '21 12:03 edubarrosTNO