sequenceTubeMap icon indicating copy to clipboard operation
sequenceTubeMap copied to clipboard

BED files should be interpreted as 0-based end-exclusive

Open WallBPG opened this issue 4 months ago • 4 comments

Bed files are 0-based, half-open intervals [start, end), but are visualized as closed [start, end] when using custom BED regions.

See BED format: here.

WallBPG avatar Aug 14 '25 16:08 WallBPG

I don't think that we do anything different for BED intervals vs. for intervals typed in the box.

I think this might be that the region indicators have a 3 pixel offset for some reason, so the little yellow circles indicating the requested region are hard to relate to the actual bases.

If I make them a bit smaller and center them, I get this for looking at the ref:0-2 region in the included cactus.vg.

Image

Here we have the left edge of the interval before the base at index 0, and the right edge of the interval before the base at index 2. Those are the right places, right?

Or are you referring to something about the visualization other than the indicators that show the location of the requested region? Is there a way I can replicate it?

I always see it rendering some nodes past the region you (or the BED) actually requests, because one haplotype or another we are tracing will be shorter and will pull in a few more nodes within the distance we trace it. Which is why we have the yellow circle indicators, because we don't really cut the graph to exactly the requested reference path region.

adamnovak avatar Aug 14 '25 18:08 adamnovak

This explains it then; yes, regions in BED files are half open by definition, and the text box expects closed intervals. To visualize BED files properly, sequenceTubeMap needs to subtract 1 from the end column when rendering what goes into the text box.

A workaround is simply to modify the BED files to be closed, but that goes against the file standard...

WallBPG avatar Aug 14 '25 19:08 WallBPG

It looks like when we are doing external chunk directory preparation, we take the requested region and pass it straight to vg chunk: https://github.com/vgteam/sequenceTubeMap/blob/ad62f98f4a404609f127f5ec4a8176fe4efe0bc8/scripts/prepare_chunks.sh#L70

And we also put its coordinates into the BED: https://github.com/vgteam/sequenceTubeMap/blob/ad62f98f4a404609f127f5ec4a8176fe4efe0bc8/scripts/prepare_chunks.sh#L130-L132

But that's wrong, because vg chunk apparently takes "0-based inclusive" regions.

When we pull out a region on the server with vg chunk, we pass whatever's in the box to vg chunk, so vg chunk interprets it as a 0-based end-inclusive region. Then the region marker drawing code draws a 0-based end-exclusive pair of markers. So the markers don't actually match what was asked for, at the end position.

And then the Lancet example graphs started with 1-based, end-inclusive coordinates in the Lancet output data, and we didn't adjust them at all: https://github.com/vgteam/sequenceTubeMap/blob/ad62f98f4a404609f127f5ec4a8176fe4efe0bc8/scripts/prepare_lancet_output.sh#L125

So they went into the Lancet "BED" files with the 1-based values. So when rendered, the coordinates are 1-based, not 0-based like they would be if we were interacting with an actual full-genome graph, and also we draw the region indicators end-exclusive when the actual graph displayed ends sharply at the end-inclusive position. So all the Lancet examples look like the last base isn't really in the range.

I'm not 100% sure what we really want to do here. If we want the coordinates in the text box to be 1-based, end-inclusive (like the UCSC genome browser), then they need to be converted to 0-based, end-inclusive before being passed to vg chunk, and then you can't type in the same coordinates as you would use with the vg command line tools. I think the Lancet examples really wanted to be shown using 1-based coordinates in the box, because they correspond to regions in genome assemblies that are usually described with 1-based coordinates.

If we want the coordinates in the box to be 0-based, end-inclusive, then we might want to stop using things like ref:1-100 as example, because that suggests they're 1-based. Then we wouldn't have to convert them for vg chunk, and we'd use the same coordinate system vg uses on the command line, but we'd be using a different coordinate system than the UCSC genome browser UI.

We could also make the coordinates in the box 0-based, end-exclusive. That's what they are now when a BED is in use, if the region indicator we draw and the BED parsing is taken as accurate. Then we'd need to convert them when passing them to vg chunk.

This is related to #449.

adamnovak avatar Aug 14 '25 20:08 adamnovak

I think I have a simple fix here; #489 just subtract 1 from the END column when reading BED files?

But, I'm a novice at js, and I haven't tested it yet...

I think it's important to stay consistent with UCSC, but document clearly.

WallBPG avatar Aug 14 '25 20:08 WallBPG