modules icon indicating copy to clipboard operation
modules copied to clipboard

gProfiler splits sources twice

Open akaviaLab opened this issue 4 months ago • 5 comments
trafficstars

Have you checked the docs?

Description of the bug

The code in gProfiler module https://github.com/nf-core/differentialabundance/blob/3dd360fed0dca1780db1bdf5dce85e5258fa2253/modules/nf-core/gprofiler2/gost/templates/gprofiler2_gost.R#L222 seems to split the sources list twice

   sources <- opt\$sources
    if (!is.null(sources)) {
        sources <-  strsplit(opt\$sources, split = ",")[[1]]
}
    if (!is.null(sources)) {
        sources <-  strsplit(opt\$sources, split = ",")[[1]]
    }

The default string when using differentialabundance is "GO:MF, GO:BP, GO:CC, KEGG, REAC, WP, TF, MIRNA, HPA, CORUM, HP", which will result in gProfiler only analyzing GO:MF.

Command used and terminal output


Relevant files

No response

System information

No response

akaviaLab avatar Jun 24 '25 13:06 akaviaLab

Are you able to fix this?

SPPearce avatar Jun 29 '25 10:06 SPPearce

I will be able to fix this in a week or so, when I get a better computer. My computer is limping along right now.

Is there an explanation of the logic for using round in this function https://github.com/nf-core/differentialabundance/blob/3dd360fed0dca1780db1bdf5dce85e5258fa2253/modules/nf-core/gprofiler2/gost/templates/gprofiler2_gost.R#L89? I'm thinking about opening another issue about it (it flattens everything below 1E-3), but I'd like to read the logic (if there is one) before hand.

On Sun, Jun 29, 2025 at 11:46 AM Simon Pearce @.***> wrote:

SPPearce left a comment (nf-core/modules#8701) https://github.com/nf-core/modules/issues/8701#issuecomment-3016581001

Are you able to fix this?

— Reply to this email directly, view it on GitHub https://github.com/nf-core/modules/issues/8701#issuecomment-3016581001, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQYYZVJWAHLDHY6TRDSCL33F67XTAVCNFSM6AAAAACAAJFNLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMJWGU4DCMBQGE . You are receiving this because you authored the thread.Message ID: @.***>

akaviaLab avatar Jun 30 '25 08:06 akaviaLab

I will be able to fix this in a week or so, when I get a better computer. My computer is limping along right now.

Is there an explanation of the logic for using round in this function https://github.com/nf-core/differentialabundance/blob/3dd360fed0dca1780db1bdf5dce85e5258fa2253/modules/nf-core/gprofiler2/gost/templates/gprofiler2_gost.R#L89? I'm thinking about opening another issue about it (it flattens everything below 1E-3), but I'd like to read the logic (if there is one) before hand.

I would imagine the logic for rounding is to prevent different compute architectures giving different results, but I suspect it should be rounding ~10dp not 3.

SPPearce avatar Jun 30 '25 12:06 SPPearce

On Mon, Jun 30, 2025 at 1:12 PM Simon Pearce @.***> wrote:

SPPearce left a comment (nf-core/modules#8701) https://github.com/nf-core/modules/issues/8701#issuecomment-3018919934

I would imagine the logic for rounding is to prevent different compute architectures giving different results, but I suspect it should be rounding ~10dp not 3.

I get why we'd want to round. I don't get why it is using round, which flattens data smaller than 1E-3. It would make more sense for it to use sprintf('%.3g') or something like that. Can you think of a reason to use round?

Reply to this email directly, view it on GitHub https://github.com/nf-core/modules/issues/8701#issuecomment-3018919934, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQYYZR3ESKRQ6GPVTTZO6T3GEST3AVCNFSM6AAAAACAAJFNLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMJYHEYTSOJTGQ . You are receiving this because you authored the thread.Message ID: @.***>

akaviaLab avatar Jun 30 '25 16:06 akaviaLab

On Mon, Jun 30, 2025 at 1:12 PM Simon Pearce @.***> wrote: SPPearce left a comment (nf-core/modules#8701) <#8701 (comment)>

I would imagine the logic for rounding is to prevent different compute architectures giving different results, but I suspect it should be rounding ~10dp not 3. I get why we'd want to round. I don't get why it is using round, which flattens data smaller than 1E-3. It would make more sense for it to use sprintf('%.3g') or something like that. Can you think of a reason to use round?

The round function inside R will round to whatever number of decimal places that you want, while leaving it a number.

However, the code suggests that the default behaviour is not to round at all, controlled by round_digits. The differential abundance pipeline seems to set this to default to 4 via this config, so try --report_round_digits 10 on the command line.

SPPearce avatar Jun 30 '25 20:06 SPPearce

I feel like I'm not communicating well.

round(1E-4, digits=3) => 0 sprintf('%.3g', 1E-4) => 1.0E-4

Limiting the number of digits reported is a good thing, but using the specific round() function is problematic. There are alternatives to round that will not flatten the data. sprintf is one option. The default behavior of write.table is to another option, using a more complicated algorithm (grisu3, spelling?), which will by default use a lot of significant digits.

round is problematic since it will effectively flatten very small numbers. Especially for GO enrichment, which gets p-values that hover around 1E-12 or 1E-16. Even if the code used round(digits=10), it would still flatten most of the values to 0.

I have managed to work around it easily and get the actual numbers, I just think it is a bug in the module as written, since anyone who doesn't read the code will assume the values are actually 0.

On Mon, Jun 30, 2025 at 9:06 PM Simon Pearce @.***> wrote:

SPPearce left a comment (nf-core/modules#8701) https://github.com/nf-core/modules/issues/8701#issuecomment-3020536459

On Mon, Jun 30, 2025 at 1:12 PM Simon Pearce @.***> wrote: SPPearce left a comment (nf-core/modules#8701 https://github.com/nf-core/modules/issues/8701) <#8701 (comment) https://github.com/nf-core/modules/issues/8701#issuecomment-3018919934>

I would imagine the logic for rounding is to prevent different compute architectures giving different results, but I suspect it should be rounding ~10dp not 3. I get why we'd want to round. I don't get why it is using round, which flattens data smaller than 1E-3. It would make more sense for it to use sprintf('%.3g') or something like that. Can you think of a reason to use round?

— … <#m_-8560262360182227038_>

The round function inside R will round to whatever number of decimal places that you want, while leaving it a number.

However, the code suggests that the default behaviour is not to round at all, controlled by round_digits. The differential abundance pipeline seems to set this to default to 4 https://github.com/nf-core/differentialabundance/blob/3dd360fed0dca1780db1bdf5dce85e5258fa2253/nextflow.config#L34C5-L34C32 via this config https://github.com/nf-core/differentialabundance/blob/3dd360fed0dca1780db1bdf5dce85e5258fa2253/conf/modules.config#L355, so try --report_round_digits 10 on the command line.

— Reply to this email directly, view it on GitHub https://github.com/nf-core/modules/issues/8701#issuecomment-3020536459, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQYYZQWH6D77NOYFKAAWNT3GGKFHAVCNFSM6AAAAACAAJFNLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRQGUZTMNBVHE . You are receiving this because you authored the thread.Message ID: @.***>

akaviaLab avatar Jun 30 '25 21:06 akaviaLab

Hi Simon I've been working on fixing the code, now that I have my dev environment set up. After changing the code, I run nf-test, and get a different snapshot. Before overwriting the old one, I wanted to manually compare the files to make sure the change is as I wanted it.

So I checked out master again, and reran nf-test, using the command

$ nf-test test modules/nf-core/gprofiler2/gost/tests/main.nf.test --profile docker

🚀 nf-test 0.9.2
https://www.nf-test.com
(c) 2021 - 2024 Lukas Forer and Sebastian Schoenherr

The actual result is the same as after my changes, where the snapshot doesn't match the saved snapshot (full log attached). However, this is master, and should match up. Can you help me figure out what is going on?

nf-test.log

akaviaLab avatar Jul 14 '25 09:07 akaviaLab

Yes, I can confirm I see the same test failure. This may well mean the current snapshots are unstable in some way (machine precision, timestamps).

SPPearce avatar Jul 14 '25 10:07 SPPearce

I think the gprofiler webserver changed, because one of the changes is a missing MIRNA file, and the website had an announcement that it was removed due to updates.

There is a way to get the versions of the data used, see https://biit.cs.ut.ee/gprofiler/page/apis and also the R command gprofiler2::get_base_url()

~~I could theoretically convert it to YML, but that would require adding some kind of JSON/YML converter to the conda/docker.~~ Rscript can easily output the YML of the parameters, without installing anything else - the R yaml library is already present. I think a different PR for changing the versions.yml file makes sense - @SPPearce what do you think?

The data version can be different per organism. The R template can print out by organism, but I'm not sure how to do that in main.nf. Both the R template and the main.nf have lines to print out the versions.yml - does the main.nf overwrite the output of R template?

akaviaLab avatar Jul 14 '25 10:07 akaviaLab

Please ignore my previous comment, I was still learning about the stub section of Nextflow.

I do have two questions for you @SPPearce - should I output the data version in versions.yml? I think it makes sense. Second question - should I make a separate PR for outputting the data versions in versions.yml or have it in this PR?

akaviaLab avatar Jul 14 '25 13:07 akaviaLab

You can just make all the improvements into one PR, particularly given the current tool output has changed. Yes, if the data version is relevant then definitely output that.

SPPearce avatar Jul 14 '25 15:07 SPPearce