MMseqs2
MMseqs2 copied to clipboard
E-value changes when using `--split-mode 0` (no way to set a constant db size for E-value computation)
Expected Behavior
The E-values should be robust to changes in the number of splits.
Current Behavior
Splitting the database during search (--split-mode 0
) will result in varying E-values, depending on the number of splits.
Steps to Reproduce (for bugs)
Run mmseqs search
with --split-mode 0
and varying values for --split
.
MMseqs Output (for bugs)
Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.
Suggestion
One possible way to fix this would be to set a constant database size internally when using --split-mode 0
. It could be useful to have this parameter exposed to the user, similar to -Z
in HMMER:
-Z <x> Assert that the total number of targets in your searches is
<x>, for the purposes of per-sequence E-value calculations,
rather than the actual number of targets seen.
I have a related issue. If I split a DB into several parts, then combine the results, they are significantly different than (about 20% fewer hits) than if I just query the whole DB using the same mmseqs search parameters. I would like the output to contain all the hits that match my search criteria (which are based on sequence identities and coverage, not e-values).