bonito icon indicating copy to clipboard operation
bonito copied to clipboard

Bonito Convert

Open addyblanch opened this issue 5 years ago • 16 comments

I've been through the Taiyaki pipeline to create a hdf5 file which I plan to convert into a Bonito model. I seem to have hit a snag any suggestions on what the issue is?

$ bonito convert --chunks 1000000 HQ_mapped.hdf5 model/
Traceback (most recent call last):
  File "/bonenv/bin/bonito", line 33, in <module>
    sys.exit(load_entry_point('ont-bonito==0.3.5', 'console_scripts', 'bonito')())
  File "/bonenv/lib/python3.6/site-packages/bonito/__init__.py", line 39, in main
    args.func(args)
  File "/bonenv/lib/python3.6/site-packages/bonito/cli/convert.py", line 109, in main
    reads = h5py.File(args.chunkify_file, 'r')['Reads']
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/bonenv/lib/python3.6/site-packages/h5py/_hl/group.py", line 288, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'Reads' doesn't exist)"

addyblanch avatar Mar 02 '21 11:03 addyblanch

Hey @addyblanch

The converter expects a top level key Reads can your file in Python with -

>>> import h5py
>>> f = h5py.File('HQ_mapped.hdf5', 'r')                                       
>>> list(f.keys())
['Reads']

iiSeymour avatar Mar 04 '21 09:03 iiSeymour

Hi @iiSeymour, did you mean print output from?

This is what I get:

import h5py f = h5py.File('HQ_mapped.hdf5', 'r') list(f.keys()) ['Batches', 'read_ids']

addyblanch avatar Mar 04 '21 12:03 addyblanch

Was HQ_mapped.hdf5 output by prepare_mapped_reads.py because you should have ended up with hdf5 file like this?

iiSeymour avatar Mar 04 '21 12:03 iiSeymour

Yes it was, but I did the alignment step minimap2 rather than guppy_aligner. Would that cause this issue?

addyblanch avatar Mar 04 '21 12:03 addyblanch

Which version of Taiyaki are you using @addyblanch?

iiSeymour avatar Mar 12 '21 13:03 iiSeymour

Based on the changeling in the directory v5.3.0?

addyblanch avatar Mar 12 '21 13:03 addyblanch

I'm also having this issue with v5.3.0 and have the same top level keys: ['Batches', 'read_ids']

jackwadden avatar Mar 24 '21 19:03 jackwadden

I'm also having this issue with v5.3.0 and have the same top level keys: ['Batches', 'read_ids']

Hopefully they are working on a solution.

addyblanch avatar Mar 25 '21 10:03 addyblanch

@addyblanch I was able to downgrade to Taiyaki 5.0.0 and it worked. The issue seems to stem from this Taiyaki change in 5.2 linked to by @iiSeymour

The batched variant of the HDF5 mapped signal format was introduced in version 5.2. This variant replaces the Reads group with a Batches group. Each group within the Batches group contain the same set of attributes and datasets listed in the table above, but these values for a set of reads are concatenated together into one dataset per batch.

I might take a stab at trying to fix it this weekend and will send a fork along if I manage to get it working before @iiSeymour

jackwadden avatar Mar 25 '21 13:03 jackwadden

Thanks @jackwadden for the heads up, that would be great work around short term.

addyblanch avatar Mar 25 '21 15:03 addyblanch

This turned out to be a small bug in Taiyaki, and was an easy fix. I've submitted a pull request with the fix here.

jackwadden avatar Mar 27 '21 15:03 jackwadden

Thats amazing thanks @jackwadden! I've made the edit on my end and set it to rerun. Fingers crossed.

addyblanch avatar Mar 29 '21 08:03 addyblanch

I'm having issues with bonito now that were resolved by downgrading back to Taiyaki 5.0.0. The specific error was thrown by parasail. Let me know if you get a similar error. I'm back in the territory where it's most likely a problem with my code, but would be nice to know if you run into something similar.

jackwadden avatar Mar 29 '21 13:03 jackwadden

Hi @jackwadden unfortunately no dice. Same error as before minus the last line

KeyError: "Unable to open object (object 'Reads' doesn't exist)"

Is there a fix in the work @iiSeymour if not I'll downgrade Taiyaki and ty again soon.

addyblanch avatar Apr 01 '21 08:04 addyblanch

@addyblanch the fact that the 'Reads' directory doesn't exist means that Taiyaki (probably) still isn't emitting the non-batched version. Are you seeing the same output from list(f.keys())? You might have to re-install Taiyaki? Maybe pop a print("changed") in main() to see if your changes are actually being adopted.

Another option might be to just use bonito end-to-end. I don't know what your use-case is, but you might be able to use this method to prepare reads and train a model. Just omit the --pretrained <model> option when you train.

Good luck.

jackwadden avatar Apr 01 '21 13:04 jackwadden

Hi @jackwadden, yes same output from list(f.keys()) will have a go with version 5.0.0 in the coming weeks.

I have tried the end-to-end bonito model training but it didn't solve our issues (made the assemblies slightly worse), so I was following this (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03856-0) as they seemed to have some success. I work on streptococcus and any genome we try and sequence seems to end up inflated in size and includes an awful number of pseudogenes (we suspect due to errors causing erroneous start and stop codons).

addyblanch avatar Apr 01 '21 14:04 addyblanch