MMseqs2 E-value changes when using `--split-mode 0` (no way to set a constant db size for E-value computation)

E-value changes when using `--split-mode 0` (no way to set a constant db size for E-value computation)

Open apcamargo opened this issue 2 years ago • 1 comments

Expected Behavior

The E-values should be robust to changes in the number of splits.

Current Behavior

Splitting the database during search (--split-mode 0) will result in varying E-values, depending on the number of splits.

Steps to Reproduce (for bugs)

Run mmseqs search with --split-mode 0 and varying values for --split.

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

Suggestion

One possible way to fix this would be to set a constant database size internally when using --split-mode 0. It could be useful to have this parameter exposed to the user, similar to -Z in HMMER:

-Z <x>  Assert that the total number of targets in your searches is
        <x>, for the purposes of per-sequence E-value calculations,
        rather than the actual number of targets seen.

Jan 09 '23 15:01 apcamargo

I have a related issue. If I split a DB into several parts, then combine the results, they are significantly different than (about 20% fewer hits) than if I just query the whole DB using the same mmseqs search parameters. I would like the output to contain all the hits that match my search criteria (which are based on sequence identities and coverage, not e-values).

Oct 26 '23 12:10 daron-m-standley

MMseqs2 MMseqs2 copied to clipboard

E-value changes when using `--split-mode 0` (no way to set a constant db size for E-value computation)

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

MMseqs Output (for bugs)

Suggestion

MMseqs2
MMseqs2 copied to clipboard