medaka icon indicating copy to clipboard operation
medaka copied to clipboard

Is it possible to use medaka in offline mode?

Open malvaradol opened this issue 4 months ago • 11 comments

Hi!

I'm currently trying to run medaka on a HPC server with an LSF grid, however, the computing nodes don't have internet access so when I try to run the program I have an issue with the model as it is not downloaded and not able to be downloaded. I tried to download the model directly from the files, but I'm not too sure what path from the downloaded file should I give the program in order to get it running. The model that I'm trying to run is r941_min_sup_g507, and the file that I downloaded was https://github.com/nanoporetech/medaka/blob/master/medaka/data/r941_min_sup_g507_model.tar.gz.

Any help on how to get medaka running in offline mode will be appreciated.

malvaradol avatar Mar 03 '24 19:03 malvaradol

Medaka caches models that it downloads in your home directory. So if your HPC nodes mount at HOME directory the same as a computer where you do have internet access, just run medaka there first.

Failing that it's possible to simply give the tar.gz as the model argument on the command-line.

cjw85 avatar Mar 03 '24 19:03 cjw85

First one didn't work, just FYI if it helps I installed the program through pip in a conda environment.

Regarding the second one, I did provide the tar.gz as the model argument including the whole path, yet I still get an error. Here's the code line:

medaka_consensus -i ON_reads -d flye_assembly.fasta -o output_medaka -t 64 -m /model/r941_min_sup_g507_model.tar.gz

/model/r941_min_sup_g507_model.tar.gz is at the same level with the lsf file that contains the previous code line.

malvaradol avatar Mar 04 '24 21:03 malvaradol

Could you please show the error you get while running the above command.

cjw85 avatar Mar 04 '24 21:03 cjw85

Here's the output for the command:

Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.
Cannot import pyabpoa, some features may not be available.
Failed to interpret '/model/r941_min_sup_g507_model.tar.gz' as a basecaller model.
Traceback (most recent call last):
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/site-packages/medaka/medaka.py", line 36, in __call__
    model_fp = medaka.models.resolve_model(val)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/site-packages/medaka/models.py", line 46, in resolve_model
    raise ValueError(
ValueError: Model /model/r941_min_sup_g507_model.tar.gz is not a known model or existant file.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/bin/medaka", line 8, in <module>
    sys.exit(main())
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/site-packages/medaka/medaka.py", line 801, in main
    args = parser.parse_args()
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1825, in parse_args
    args, argv = self.parse_known_args(args, namespace)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1858, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 2049, in _parse_known_args
    positionals_end_index = consume_positionals(start_index)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 2026, in consume_positionals
    take_action(action, args)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1935, in take_action
    action(self, namespace, argument_values, option_string)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1214, in __call__
    subnamespace, arg_strings = parser.parse_known_args(arg_strings, None)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1858, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 2049, in _parse_known_args
    positionals_end_index = consume_positionals(start_index)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 2026, in consume_positionals
    take_action(action, args)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1935, in take_action
    action(self, namespace, argument_values, option_string)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1214, in __call__
    subnamespace, arg_strings = parser.parse_known_args(arg_strings, None)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1858, in parse_known_args
    namespace, args = self._parse_known_args(args, namespace)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 2067, in _parse_known_args
    start_index = consume_optional(start_index)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 2007, in consume_optional
    take_action(action, args, option_string)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/argparse.py", line 1935, in take_action
    action(self, namespace, argument_values, option_string)
  File "/hpc/users/hernad36/.conda/envs/triatomine_genomics_v2/lib/python3.9/site-packages/medaka/medaka.py", line 39, in __call__
    raise RuntimeError(msg.format(self.dest, str(e)))
RuntimeError: Error validating model from '--model' argument: Model /model/r941_min_sup_g507_model.tar.gz is not a known model or existant file..

malvaradol avatar Mar 05 '24 17:03 malvaradol

Are you entirely sure that /model/r941_min_sup_g507_model.tar.gz is the path where you have saved the model file, that it is readable by your user, and is not a broken symbolic link? The error:

ValueError: Model /model/r941_min_sup_g507_model.tar.gz is not a known model or existant file.

suggests at least one of these is not true.

cjw85 avatar Mar 07 '24 14:03 cjw85

I was able to finally get it running, and you were correct, my mistake was not providing the absolute path but a relative one, that did the trick. Now I want to take advantage of the issue to seek help with a new error I got after the program runned for a couple of hours, here's the final lines of the error output:

File "/sc/arion/projects/MML/conda/envs/polishing_tools/bin/medaka", line 11, in <module>
    sys.exit(main())
  File "/sc/arion/projects/MML/conda/envs/polishing_tools/lib/python3.10/site-packages/medaka/medaka.py", line 814, in main
    args.func(args)
  File "/sc/arion/projects/MML/conda/envs/polishing_tools/lib/python3.10/site-packages/medaka/prediction.py", line 188, in predict
    model = model_store.load_model(time_steps=None)
  File "/sc/arion/projects/MML/conda/envs/polishing_tools/lib/python3.10/site-packages/medaka/datastore.py", line 199, in load_model
    self.model.load_weights(weights).expect_partial()
  File "/sc/arion/projects/MML/conda/envs/polishing_tools/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/sc/arion/projects/MML/conda/envs/polishing_tools/lib/python3.10/site-packages/tensorflow/python/training/py_checkpoint_reader.py", line 31, in error_translator
    raise errors_impl.NotFoundError(None, None, error_message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /tmp/tmpqde07q4d/model/variables/variables

I spent some time looking on blogs but could not find anything useful.

Thanks for your help with all this stuff :)

malvaradol avatar Mar 08 '24 17:03 malvaradol

This seems like the model is not being upacked correctly at runtime, or is not a valid model tar.gz.

Can you trying untarring the model file you have outside of medaka and report the contents?

cjw85 avatar Mar 11 '24 09:03 cjw85

So this is what I got when decompressing the file:

tar -xvzf r941_min_sup_g507_model.tar.gz
model/
model/variables/
model/variables/variables.data-00001-of-00002
model/variables/variables.index
model/variables/variables.data-00000-of-00002
model/meta.pkl
model/assets/
model/saved_model.pb

malvaradol avatar Mar 11 '24 21:03 malvaradol

That seems correct. I asked because:

tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /tmp/tmpqde07q4d/model/variables/variables

suggests something was currupt about the tar.gz. At this point I'm at a loss as to what has happened. The process here is that medaka sees that you have provided a tar.gz file, and unpacks it in a temporary location on your system in order for tensorflow to read. That location is determined by Python, not by code in medaka.

I would talk to your HPC admins and ask if they know why files in /tmp appear to not be readable.

cjw85 avatar Mar 12 '24 10:03 cjw85

So far the only thing that has worked is to run medaka on a login node, but of course I can't just run the whole job in that node. My question is, if I run medaka in the login node, cancel the job and then re-send it again on a computing node, is the model stored somewhere so that it will run normally? If so, how long should I run medaka in the login node before reaching the stage where the model is saved on the system?

Just some ideas I guess, HPC admins take forever to reach back...

malvaradol avatar Mar 14 '24 16:03 malvaradol

My question is, if I run medaka in the login node, cancel the job and then re-send it again on a computing node, is the model stored somewhere so that it will run normally?

The model when downloaded is always stored as a tar.gz. It isn't cached as the expanded archived -- the untarring always happens at runtime. That is to say even if you cache the model by running the program once (as suggested in this comment), the effect is no different from what you are doing by providing the tar on the command-line.

I'd really like to understand why the unpacking of the tar into /tmp is apparently going awry.

cjw85 avatar Mar 14 '24 16:03 cjw85