baybe Feature/scikit fingerprints

See https://github.com/emdgroup/baybe/issues/359

Sep 04 '24 04:09 Hrovatin

All committers have signed the CLA.

Sep 04 '24 04:09 CLAassistant

For the future: I think theres no need for the fork, I added you as member, I think working directly on branches here has some advantages such that others can check out your branch seamlessly

Sep 05 '24 07:09 Scienfitz

@Scienfitz I made first round of changes covering all of the above What is still missing is figuring ot where NotEnoughPointsLeftError originates from

EDIT: The NotEnoughPointsLeftError is also resolved now, was associated with test param naming

Sep 06 '24 12:09 Hrovatin

@Hrovatin It would be great if, as a final sanity check, you could run the examples/backtesting/full_lookup.py example and post the resulting picture. Ive just done so on your fork and there are two observations

i) results improve which is fantastic ii) theres also somethign super weird with the rdkit curve

can you confirm?

Sep 10 '24 11:09 Scienfitz

@Hrovatin once the open comments are tended to we can mark this PR as ready for review

As last step before doing so: please i) make a copy of this branch as backup and then ii) merge main into this to update all the other changes that came into the repo since you started the fork. form the looks of it it doesnt seem many conflcits so that merge should be ok. Due to the amount of changes I thing rebasing is not an option but if you make backup branches you could also try that

Sep 26 '24 17:09 Scienfitz

@Scienfitz for RDKIt there is some ruggedness where it flattens off

I checked the encoding statistics and RDKit does not seem to differ from ECFP in their distn.

N of features:

Here are mean and variance for each feature:

Oct 04 '24 07:10 Hrovatin

@Scienfitz I made the merge. Please check comments that are still open and lmk if I should change sth or close them if ok with you

Oct 04 '24 07:10 Hrovatin

With tox -e fulltest-py312 I get - any ideas?

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <linear_operator.operators.matmul_linear_operator.MatmulLinearOperator object at 0x374c8f770>, eigenvectors = True, return_evals_as_lazy = False

    def _symeig(
        self: Float[LinearOperator, "*batch N N"],
        eigenvectors: bool = False,
        return_evals_as_lazy: Optional[bool] = False,
    ) -> Tuple[Float[Tensor, "*batch M"], Optional[Float[LinearOperator, "*batch N M"]]]:
        r"""
        Method that allows implementing special-cased symeig computation. Should not be called directly
        """
        from linear_operator.operators.dense_linear_operator import DenseLinearOperator
    
        if settings.verbose_linalg.on():
            settings.verbose_linalg.logger.debug(f"Running symeig on a matrix of size {self.shape}.")
    
        # potentially perform decomposition in double precision for numerical stability
        dtype = self.dtype
>       evals, evecs = torch.linalg.eigh(self.to_dense().to(dtype=settings._linalg_dtype_symeig.value()))
E       torch._C._LinAlgError: linalg.eigh: (Batch element 0): The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 2).

.tox/fulltest-py312/lib/python3.12/site-packages/linear_operator/operators/_linear_operator.py:892: _LinAlgError

The above exception was the direct cause of the following exception:

campaign = Campaign(searchspace=SearchSpace(discrete=SubspaceDiscrete(parameters=(CategoricalParameter(name='Categorical_1', _val... A            OK         2.0   22.748903        1    1.0, _cached_recommendation=Empty DataFrame
Columns: []
Index: [])
n_iterations = 3, batch_size = 3

    @pytest.mark.slow
    @pytest.mark.parametrize(
        "kernel", valid_kernels, ids=[c.__class__ for c in valid_kernels]
    )
    @pytest.mark.parametrize("n_iterations", [3], ids=["i3"])
    def test_kernels(campaign, n_iterations, batch_size):
>       run_iterations(campaign, n_iterations, batch_size)

tests/test_iterations.py:243: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

rs = <RetryCallState 13644971472: attempt #5; slept for 0.0; last result: failed (_LinAlgError linalg.eigh: (Batch element ... failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 2).)>

    def exc_check(rs: "RetryCallState") -> None:
        fut = t.cast(Future, rs.outcome)
        retry_exc = self.retry_error_cls(fut)
        if self.reraise:
            raise retry_exc.reraise()
>       raise retry_exc from fut.exception()
E       tenacity.RetryError: RetryError[<Future at 0x374c696d0 state=finished raised _LinAlgError>]

Oct 04 '24 08:10 Hrovatin

@Hrovatin the example picture is off, the curve for RDKIT is wrong, seem the batch size was influenced... something very strange here is going on. Ive noticed this in one of my very earliest tests too, before looking at the ruggedness etc the curve needs to be fixed

You can ignore the _LinAlgError , it is one of the exceptions. Some of our parameter combinations sometimes seem to cause numerical instabilities that even after several repeats with different random seeds will fail. This happens randomly, you can ignore it for the ourpose of this PR (it is often fixed by re-running the failed job anyway)

A second error you can ignore in the CI is that soemtimes serializations seem to fail, without a good indication as to why. This does not hapen lcoally oin clean environments and likely is an artifact of packages, the agent etc, if it happens in your PR ignore it

Oct 04 '24 08:10 Scienfitz

@Hrovatin the example picture is off, the curve for RDKIT is wrong, seem the batch size was influenced... something very strange here is going on. Ive noticed this in one of my very earliest tests too, before looking at the ruggedness etc the curve needs to be fixed

When just running SubstanceParameter(name=chem_type, data=data, encoding=encoding).comp_df with RDKit and the molecules from the example the shape of samples in the output df is correct. Any other ideas where this could be coming from?

Oct 04 '24 11:10 Hrovatin

@Hrovatin I chagned the example and corresponding plots, we will fix the issue encountered for RDKIT in a separate PR as we still dicuss the best way to implement it, but imo that shouldnt affect or brake this PR here I'll mark it ready for review @AVHopp @AdrianSosic just fyi I think its time for reviewing this

Oct 08 '24 15:10 Scienfitz

Hi @Scienfitz, @AVHopp, @Hrovatin:

As agreed with @Hrovatin, I've just rebuild the commit history to get rid of all the intermediate merge commits etc. In particular, what I've done:

Created a backup of the original state under backup/scikit-fingerprints
Squash-rebased everything on top of the latest main
Modified the author of that commit so that the work is clearly attribute to @Hrovatin
Separated out the commit of the svg files so that we can easily update/rebase them before merge

Next steps: I'm currently working on the review, that is, I will push a few own commits where I think changing it myself will be faster than trying to explain stuff here on Github + will add my open questions/TODOs as comments afterward

Oct 16 '24 10:10 AdrianSosic

hey docsmaster @AVHopp the docs building for some reason suddenly seems to take some of the SMILE strings as link, crashing the pipeline. Do you have any idea why? As far as I can see we didn't change any configuration or something relevant in the file where its failing, so I'm not sure

Nov 03 '24 17:11 Scienfitz