SigProfilerExtractor icon indicating copy to clipboard operation
SigProfilerExtractor copied to clipboard

CPU usage and GPU usage too little

Open xiaoyaojianghuzai opened this issue 10 months ago • 10 comments

hi processor @mdbarnesUCSD I change my code and use the given parameters The cpu usage is still very low. When it gets the "making matries for INDELs", the cpu usage is too slow. It costs too much time to finish. I look through sigpro.py.
Then I find that you use the mutipleprocess package. But it seems doesn't take effect.

I run code in Ubuntu Linux 22.04 and the sigprofilerextractor package is the latest version.

xiaoyaojianghuzai avatar Apr 24 '24 04:04 xiaoyaojianghuzai

And the vcf files are placed in a harddisk.

After searching this question in Google. I find python have the GIL lock. Did this prevent the full usage of CPU? But I look through the past issues of sigprofilerextractor. Someone even have the problem of too high usage of CPU and GPU. I cannot find find the true answer and solve the problem。

So I use mutipleprocess on my hand for different types of cancer But for each cancer types, the problem is still on my way.

def extract_signature_for_folder(folder):
    output_dir = f"/harddisk/sxt/VCFinput/{folder}"
    output_path = f"/harddisk/sxt/output/{folder}"
    print(f"Signature extraction for {folder} started.")
    sig.sigProfilerExtractor("vcf", output_path, output_dir, "GRCh38")
    print(f"Signature for {folder} extracted.")


if __name__ == "__main__":
    
    with Pool(128) as p:
        p.map(extract_signature_for_folder, cancer_types)

xiaoyaojianghuzai avatar Apr 24 '24 04:04 xiaoyaojianghuzai

Sincerely waiting for your hearing.

xiaoyaojianghuzai avatar Apr 24 '24 04:04 xiaoyaojianghuzai

JOB_METADATA.txt

I run a small example for examination

xiaoyaojianghuzai avatar Apr 24 '24 04:04 xiaoyaojianghuzai

Hi @xiaoyaojianghuzai,

Your input matrix has 96 rows and 2 columns, but your extraction is from signatures 1 to 25. This does not work and you need a larger input matrix (the max rank is 2 for a 96x2 input).

Please review the README and run the example using the matrix file as input (code below):

from SigProfilerExtractor import sigpro as sig
def main_function():    
   # to get input from table format (mutation catalog matrix)
   path_to_example_table = sig.importdata("matrix")
   data = path_to_example_table # you can put the path to your tab delimited file containing the mutational catalog matrix/table
   sig.sigProfilerExtractor("matrix", "example_output", data, opportunity_genome="GRCh38", minimum_signatures=1, maximum_signatures=3)
if __name__=="__main__":
   main_function()

mdbarnesUCSD avatar Apr 25 '24 17:04 mdbarnesUCSD

hi professor @mdbarnesUCSD I use vcf_files as input. Should I change the max signature too?

xiaoyaojianghuzai avatar Apr 26 '24 01:04 xiaoyaojianghuzai

Plus, can I only extract signatures for SBS and DIUNC except INDELs? But when I change the context_type parameter, nothing has changed. It still generating matrices for INDELs as usual. How can I make it?

xiaoyaojianghuzai avatar Apr 26 '24 02:04 xiaoyaojianghuzai

Plus, can I only extract signatures for SBS and DIUNC except INDELs? But when I change the context_type parameter, nothing has changed. It still generating matries for INDELs as usual.

xiaoyaojianghuzai avatar Apr 26 '24 02:04 xiaoyaojianghuzai

sig.sigProfilerExtractor("vcf", "/harddisk/sxt/output/gum", "/harddisk/sxt/VCFinput/gum", "GRCh38", minimum_signatures=1,maximum_signatures=3)

image image

xiaoyaojianghuzai avatar Apr 27 '24 09:04 xiaoyaojianghuzai

how to choose the max_signatures parameter when using VCF files as input?

xiaoyaojianghuzai avatar Apr 27 '24 10:04 xiaoyaojianghuzai

Hi @xiaoyaojianghuzai,

The maximum_signatures needs to be a value less than the number of samples that you have.

I would suggest start with matrix inputs rather than VCFs. You can run SigProfilerMatrixGenerator to generate the matrices and this may help you identify if there are any issues with your VCFs. You can then use the INDEL matrix you created from SigProfilerMatrixGenerator as the input for SigProfilerExtractor.

mdbarnesUCSD avatar May 06 '24 20:05 mdbarnesUCSD

Hello @mdbarnesUCSD I run the code

sig.sigProfilerExtractor("vcf","/home/sxt/HDD/output/rectum", "/home/sxt/HDD/VCFinput/rectum", "GRCh38",minimum_signatures=1,maximum_signatures=3)

The terminal prints

(base) sxt@C233-Primary-Server:~$ /home/sxt/miniconda3/bin/conda run -p /home/sxt/miniconda3 --no-capture-output python /tmp/pycharm_project_825/rectum.py

************** Reported Current Memory Use: 0.5 GB *****************

Starting matrix generation for SNVs and DINUCs...Completed! Elapsed time: 337.43 seconds.
Starting matrix generation for INDELs...    

I find the SNVs and DINUCs costs 5minutes, but the INDELS has costed 10 days. It still didn't finish. Does the matrix generation step use methods that can speed up this process, such as mutipleprocess?

xiaoyaojianghuzai avatar May 10 '24 03:05 xiaoyaojianghuzai

Please generate your matrices separately and provide those as inputs to SigProfilerExtractor. The matrix generation step should not take anywhere near 10 days. How many mutations are you working with? Are you running out of memory?

mdbarnesUCSD avatar May 10 '24 17:05 mdbarnesUCSD

There's a lot of memory left. All vcf files are about 5 GB. I am going to try to generate matrices separately.

xiaoyaojianghuzai avatar May 11 '24 08:05 xiaoyaojianghuzai

Please reach out if you are still encountering issues with your run.

mdbarnesUCSD avatar May 31 '24 18:05 mdbarnesUCSD