pybedtools icon indicating copy to clipboard operation
pybedtools copied to clipboard

make file_type a settable attribute

Open daler opened this issue 9 years ago • 4 comments

Imagine we have two bed files, and we do this:

z = x.intersect(y, wao=True)

The resulting file could look like this, is incorrectly guessed to be SAM format:

chr1  0      11447  0  none  0  0  0  chr1  0      11447  0  11447
chr1  11447  11502  1  1     1  0  0  chr1  11447  11502  2  55
chr1  11502  11675  0  none  0  0  0  chr1  11502  11675  0  173
chr1  31291  31431  0  none  0  0  0  .     -1     -1     .  0

When we try to iterate over it, we get an OverflowError. Currently the fix is to make the name field a non-integer before doing the intersection:

def fix(f):
    f.name = f.name + '.'
    return f

z = x.each(fix).intersect(y, wao=True)

While we could get more fancy with detecting SAM, I don't want to go the route of checking against a regex for every field in every line of a file for pathological cases like this. Instead, it would be useful to set the filetype on the BedTool object and use that to short-circuit the create_interval_from_fields heuristics. So then you could do this:

z = x.intersect(y, wao=True)
z.file_type = 'bed'
print(z)  # no longer raises OverflowError

daler avatar Apr 27 '16 00:04 daler

Why not take a biopython SeqIO approach (http://biopython.org/wiki/SeqIO)

Where the constructor for the bedtool has an optional file_type arg, and you just inherit down through modifications?

Something like:

pybedtools.Bedtool(fn, file_type="bed")

Might take slightly more engineering for a very rare edge case though.

Gabriel Pratt Bioinformatics Graduate Student, Yeo Lab University of California San Diego

On Tue, Apr 26, 2016 at 5:57 PM, Ryan Dale [email protected] wrote:

Imagine we have two bed files, and we do this:

z = x.intersect(y, wao=True)

The resulting file could look like this, is incorrectly guessed to be SAM format:

chr1 0 11447 0 none 0 0 0 chr1 0 11447 0 11447 chr1 11447 11502 1 1 1 0 0 chr1 11447 11502 2 55 chr1 11502 11675 0 none 0 0 0 chr1 11502 11675 0 173 chr1 31291 31431 0 none 0 0 0 . -1 -1 . 0

When we try to iterate over it, we get an OverflowError. Currently the fix is to make the name field a non-integer before doing the intersection:

def fix(f): f.name = f.name + '.' return f

z = x.each(fix).intersect(y, wao=True)

While we could get more fancy with detecting SAM, I don't want to go the route of checking against a regex for every field in every line of a file for pathological cases like this. Instead, it would be useful to set the filetype on the BedTool object and use that to short-circuit the create_interval_from_fields heuristics. So then you could do this:

z = x.intersect(y, wao=True) z.file_type = 'bed'print(z) # no longer raises OverflowError

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/daler/pybedtools/issues/170

gpratt avatar Apr 27 '16 01:04 gpratt

I like that idea, and should probably add it. For it to work for this example, the results from each BedTool method would inherit the file type of self. Is that always true though? Actually now that I think about it, it's not -- take for example intersect(bed=True), where the input is BAM but the output is BED. To handle cases like that, you'd have to keep track of the kwargs and which ones result in which kinds of files. I'd be worried about that getting out of date with actual BEDTools commands.

But the case where you have that problematic file already on disk and then want to make a BedTool out of it, that's where the constructor would come in handy.

daler avatar Apr 27 '16 01:04 daler

Hmm... I was about to agree with you chasing down all the places were you'll need to manually define changes in type is going to be a huge pain.

I actually just ran into a bug that would break your proposed solution. I'm using an intersect bed command -wo, which long story short outputs the bedline which pybedtools thinks is a sam file.

chr1 876577 876589 0 -0.963772824612645 + chr1 876524 876686 ENSG00000187634.6 0 + 12

Both the parent bed files are fine, but the child breaks on implicit construction. Can you define the type of the object after creation, but before running though the create_interval_from_list function?

gpratt avatar Apr 28 '16 01:04 gpratt

Yep, that's what the original proposal was: you could set the file_type after creation but before create_interval_from_list.

daler avatar Apr 28 '16 02:04 daler