CAMISIM icon indicating copy to clipboard operation
CAMISIM copied to clipboard

[MetagenomeSimulationPipeline] [Errno 24] Too many open files in line 99

Open mgabriell1 opened this issue 3 years ago • 10 comments

Hi, First of all thanks for developing and maintaining this tool! I'm simulating a 5 samples (5Gbp each) using two communities derived from the supplied genomes, but eventually, after the completion of the first sample I get this error:

2022-02-26 19:50:14 INFO: [GenomePreparation 40702338517] Simulating reads from euk_GCA_000260095.1: '/mnt/d/out_communityCreation_5Gbp/source_genomes/GCA_000260095.1_complete.fasta'
2022-02-26 19:50:18 ERROR: [MetagenomeSimulationPipeline] [Errno 24] Too many open files in line 99
2022-02-26 19:50:18 INFO: [MetagenomeSimulationPipeline] Metagenome simulation aborted
Traceback (most recent call last):
  File "metagenomesimulation.py", line 872, in <module>
    pipeline.run_pipeline()
  File "metagenomesimulation.py", line 140, in run_pipeline
    self._project_file_folder_handler.remove_directory_temp()
  File "/mnt/c/CAMISIM-1.3/scripts/projectfilefolderhandle.py", line 142, in remove_directory_temp
    shutil.rmtree(self._tmp_dir)
  File "/home/dottorandi/anaconda3/envs/camisim-env/lib/python3.6/shutil.py", line 486, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/dottorandi/anaconda3/envs/camisim-env/lib/python3.6/shutil.py", line 408, in _rmtree_safe_fd
    onerror(os.listdir, path, sys.exc_info())
  File "/home/dottorandi/anaconda3/envs/camisim-env/lib/python3.6/shutil.py", line 405, in _rmtree_safe_fd
    names = os.listdir(topfd)
OSError: [Errno 24] Too many open files: '/mnt/d/tmp/tmpr8xcds56' 

I have looked trying to search for Errno 24 in previous issues, but I have fuond none. The only other issue referring to line 99 is this one (https://github.com/CAMI-challenge/CAMISIM/issues/97) but inn my case I have the temporary folder and, in fact, I'm able to simulate the first couple of samples.

My config file is this one:

[Main]
seed=6327141179
phase=0
max_processors=7
dataset_id=RL
output_directory=/mnt/d/out_communityCreation_5Gbp
#output_directory=out_communityCreation_5Gbp
temp_directory=/mnt/d/tmp
#temp_directory=/tmp
gsa=True
pooled_gsa=True
anonymous=False
compress=0

[ReadSimulator]
readsim=tools/art_illumina-2.3.6/art_illumina
error_profiles=tools/art_illumina-2.3.6/profiles
samtools=tools/samtools-1.3/samtools
profile=mbarc
size=5
type=art
fragments_size_mean=270
fragment_size_standard_deviation=27

[CommunityDesign]
#distribution_file_paths=out/abundance0.tsv,out/abundance1.tsv,out/abundance2.tsv,out/abundance3.tsv,out/abundance4.tsv,out/abundance5.tsv,out/abundance6.tsv,out/abundance7.tsv,out/abundance8.tsv,out/abundance9.tsv
ncbi_taxdump=tools/ncbi-taxonomy_20170222.tar.gz
strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir
number_of_samples=5

[community0]
metadata=communityCreation_details/metadata_communityCreation_euk.tsv
id_to_genome_file=communityCreation_details/genome_to_id_communityCreation_euk.tsv
id_to_gff_file=
genomes_total=33
genomes_real=33
max_strains_per_otu=33
ratio=1
mode=differential
log_mu=1
log_sigma=2
gauss_mu=1
gauss_sigma=1
view=False

[community1]
metadata=communityCreation_details/metadata_communityCreation_prok.tsv
id_to_genome_file=communityCreation_details/genome_to_id_communityCreation_prok.tsv
id_to_gff_file=
genomes_total=216
genomes_real=216
max_strains_per_otu=216
ratio=20
mode=differential
log_mu=1
log_sigma=2
gauss_mu=1
gauss_sigma=1
view=False

mgabriell1 avatar Feb 28 '22 09:02 mgabriell1

Interesting, that never happened to me, even when creating the rather large CAMI datasets with >500 genomes. line 99 unfortunately only refers to the line in the main script, i.e. an error in line 99 always means "An error occured during read simulation". The traceback suggests that CAMISIM failed to copy the data from the temporary directory into the final output directory (or delete the temporary directory for that matter), I will have a look at that. It might also be a problem with the server/machine you are running CAMISIM on, but also points to some deeper flaws within CAMISIM. We are currently in the process of adopting CAMISIM into a nextflow workflow which should hopefully eliminate a lot of these file handling errors, but there is no clear ETA on that version.

AlphaSquad avatar Feb 28 '22 11:02 AlphaSquad

Could you run CAMISIM using the -debug flag, so I can have a more detailed look where the error comes from?

AlphaSquad avatar Feb 28 '22 12:02 AlphaSquad

Sure! I just started it. I'll post the result when it's done (I guess it could take a couple of days)

mgabriell1 avatar Feb 28 '22 13:02 mgabriell1

This is the debug log file including only the last ART simulation and the following errors: CAMISIM-last-read-error-log.txt If you need I can also share the full log, but it's about 130 MB so I figured out to start with this.. Hopefully it helps!

mgabriell1 avatar Mar 02 '22 08:03 mgabriell1

Thank you, yes that indeed helped a lot. It seems like this issue and #129 are linked after all. I implemented the solution proposed there and at least on my end it is working as intended. Could you re-run the latest version (again with the -debug option) to see if the error you described in the other thread re-occurs? If it reoccurs: As the error is connected to the bam files, there probably is some error during read simulation already, so either you will need to send me the complete log - I could provide a nextcloud instance you can upload it to - or skim it for errors occurring during read simulation, too.

AlphaSquad avatar Mar 02 '22 13:03 AlphaSquad

Yes, by doing what was suggested there I was able to simulate the samples and get the gsa (both pooled and per sample). Still there were some issues with the bam files (https://github.com/CAMI-challenge/CAMISIM/issues/129#issuecomment-1054083405). I launched it again with the -debug option. I will try to skim for any errors or warnings

mgabriell1 avatar Mar 02 '22 14:03 mgabriell1

Ok, this is the error message in debug mode:

2022-03-02 19:07:26 INFO: [MetadataReader 14374009020] Reading file: '/mnt/d/tmp/tmpxrzu2e53/read_start_positionsakm7x2vy'
2022-03-02 19:08:04 ERROR: [MetadataReader 14374009020] Format error. Bad number of values in line 16976382
2022-03-02 19:08:05 DEBUG: [MetagenomeSimulationPipeline] 
Traceback (most recent call last):
  File "metagenomesimulation.py", line 122, in run_pipeline
    self._create_binning_gs(list_of_output_gsa)
  File "metagenomesimulation.py", line 523, in _create_binning_gs
    dict_original_seq_pos = gff.get_dict_sequence_name_to_positions(list_file_paths_read_positions)
  File "/mnt/c/CAMISIM/scripts/GoldStandardFileFormat/goldstandardfileformat.py", line 186, in get_dict_sequence_name_to_positions
    table.read(sam_position_file, separator=self._separator)
  File "/mnt/c/CAMISIM/scripts/MetaDataTable/metadatatable.py", line 212, in read
    raise ValueError(msg)
ValueError: Format error. Bad number of values in line 16976382


2022-03-02 19:08:05 ERROR: [MetagenomeSimulationPipeline] Format error. Bad number of values in line 16976382 in line 122

I inspected the read_start_positionsakm7x2vy file and I noticed that at the mentioned line vim shows a very long series of null-byte characters ^@. What's peculiar is that they are not placed either at the beginning or at the end of the line by the line shows the first character of the read name, a sequence of ^@ and ends with the position of that read. It seems that other lines have null-bytes, but this one does not preserve the structure of the file:

$ perl -ne '/\000/ and print;' read_start_positionsakm7x2vy
prok_GCA_000075.1_KB913025.1-265204     1662654
e88946
euk_GCF_00040948485.1_NW_007361000.1-4617       151177

Might be unrelated, but I noticed that one of the contig headers in gsa_pooled.fasta ended with a series of null-bytes (I noticed as one script complained that the header was too long and that was the cause).

I could in theory remove the line which leads to an error, but I'm not sure on how I could restart the script to perform only these last steps instead of restarting from the beginning (and ending up with the same results)..

mgabriell1 avatar Mar 03 '22 07:03 mgabriell1

Unfortunately the mode to continue from a certain point in the simulation is nonfunctional since a while, so you would need to restart CAMISIM. There have been some issues with CAMISIM if the sequence headers of the reference genomes contain special characters (particularly - or _), it is strange though that this error occurs "so late" in the file.

AlphaSquad avatar Mar 03 '22 12:03 AlphaSquad

Ok, the headers in my reference genomes all contain _ so that might be the cause..

mgabriell1 avatar Mar 03 '22 13:03 mgabriell1

Okay, you could try to remove these and see if the problem persists - sorry for the inconvenience

AlphaSquad avatar Mar 03 '22 15:03 AlphaSquad