adam icon indicating copy to clipboard operation
adam copied to clipboard

BAM/BED to parquet

Open darked89 opened this issue 3 years ago • 3 comments

Hello,

Would it be possible to provide a minimal example be it in Scala/python/CLI, how to convert say BAM to an ADAMs parquet? Same with a canonical 6 columns BED.

DK

darked89 avatar Aug 10 '22 14:08 darked89

Command line

$ adam-submit transformAlignments sample.bam sample.alignments.adam
$ adam-submit transformFeatures annotation.bed annotation.features.adam

Scala

import org.bdgenomics.adam.ds.ADAMContext._

val alignments = sc.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

val features = sc.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Python

from bdgenomics.adam.adamContext import ADAMContext
ac = ADAMContext(sc)

alignments = ac.loadAlignments("sample.bam")
alignments.saveAsParquet("sample.alignments.adam")

features = ac.loadFeatures("annotation.bed")
features.saveAsParquet("annotation.features.adam")

Hope this helps!

heuermh avatar Aug 10 '22 21:08 heuermh

Thank you very much for such a quick answer.

Bit of a follow up: the resulting .adam files are in a parquet format readable by say arrow?

darked89 avatar Aug 11 '22 05:08 darked89

Yes, I've never had any issues with Parquet in Apache Arrow. There was a mis-specification between the JVM Parquet and the C++ Parquet with regards to LZ4 compression at some point, I don't know if that is still a problem. Other compression algorithms should be fine.

I did have some issues with incomplete support for Parquet via DuckDB, details here https://github.com/heuermh/bdg-formats-duckdb

As of that effort, DuckDB did not support Parquet enums or nested schema, both features that we use in bdg-formats/ADAM.

heuermh avatar Aug 11 '22 22:08 heuermh

Hello,

I can confirm that so far I have no issues reading parquet files created by ADAM using python polars. The only a bit confusing thing was with a test RNA-Seq BAM produced by STAR (2x 150bp reads) where somehow I got min insert size= -911256.0. Is it a true insert size or a location offset of a second read in the pair?

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

Well, this should let me start experimenting with ADAM after getting back from vacations.

Many thanks for your help

Darek Kedra

darked89 avatar Aug 12 '22 15:08 darked89

As for the .bed to adam/parquet, I noticed that the 6 column bed got transformed into 26 column parquet with obviously empty columns for values not in the input. Not a problem, just a note that the parquets created from BED files contain such extra slots.

We use rather rich schema for all the various genomic data types, defined in Avro at https://github.com/bigdatagenomics/bdg-formats

The Feature schema was designed to support all of GFF2/GTF, GFF3, BED, Genbank, NarrowPeak, and IntervalList formats. A chart with attribute mappings can be found at https://github.com/heuermh/bdg-formats/blob/docs/docs/source/features.md

heuermh avatar Aug 12 '22 18:08 heuermh