Reproduce the Enformer's input sequences split

Open sararb opened this issue 2 years ago • 1 comments

I would like to regenerate the input sequences for Enformer/Basenji2 (using basenji_data.py), and for this purpose, I am using the following command line:

python basenji_data.py -g hg38.gaps.bed -u umap_k36_t10_l32_hg38.bed -b hg38.blacklist.rep.bed -l 131072 -crop_bp 8192 -break_t 786432 -s 65599 -t .1 -v .1 -w 128 -o data/input_mseqs -p 8 targets.txt

However, I am observing differences when compared to the sequences.bed file stored here

Can you please confirm if I am using the right options to generate the same sequence split?

Mar 05 '24 18:03 sararb

Hi Sara, can you say a little more about your goal? It'll influence how I can best help. It'd be a little tricky for me to track down the exact parameters and basenji_data.py has changed over the years. Is it OK if the recipe is equivalent in quality, but different due to minor tweaks and random number seeds?

Mar 09 '24 01:03 davek44