methplotlib
methplotlib copied to clipboard
bgzip, tabix and sorting
Hi @wdecoster
I have 2 nanopolish output tsv files (calls and frequencies). I would like to use them as input for methplotlib. I have tried to sort as you wrote in an example but when i want to use tabix it gives an unsorted error.
cat <(head -n1 sample8_methylation_calls_2.tsv) <(tail -n +2 sample8_methylation_calls_2.tsv | sort -k2,2 -k3,3) | bgzip > sample8_sort_calls.tsv.gz
tabix -S1 -s1 -b3 -e4 sample8_methylation_calls_2.tsv.gz
Looking at the sorted and decompressed tsv file i see that the problem is between the plus and minus strand:

Should it only be sorted by the 3 column and how can this be done?
When i sort like this and use tabix afterwards the file does not work in methplotlib
cat <(head -n1 sample8_methylation_calls_2.tsv) <(tail -n +2 sample8_methylation_calls_2.tsv | sort -k 3) | bgzip > sample8_sort_calls.tsv.gz
(methplotlib_caroline_python3) root@srvodeappgsq02:/longread/G128-2021-Imprinting-disorders/20220315_1228_1G_PAI63326_dd13a3c5# methplotlib -m sample8_sort2_calls.tsv.gz -n calls -w chr15:24,500,000-25,500,000 -f /longread/resources/reference_genomes/GRCh38_Hg38/hg38.fa Traceback (most recent call last): File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 142, in pandas._libs.index.IndexEngine.get_loc TypeError: '85285 False 85303 True 85322 True 85341 True 85358 True ... 13851 True 13873 True 13895 False 13917 False 13939 True Name: log_lik_ratio, Length: 166259, dtype: bool' is an invalid key
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/series.py", line 1086, in setitem self._set_with_engine(key, value) File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/series.py", line 1147, in _set_with_engine loc = self.index.get_loc(key) File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3628, in get_loc self._check_indexing_error(key) File "/root/miniconda3/envs/methplotlib_caroline_python3/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5637, in _check_indexing_error raise InvalidIndexError(key) pandas.errors.InvalidIndexError: 85285 False 85303 True 85322 True 85341 True 85358 True ... 13851 True 13873 True 13895 False 13917 False 13939 True Name: log_lik_ratio, Length: 166259, dtype: bool
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/methplotlib_caroline_python3/bin/methplotlib", line 8, in
For the calls file you should sort on the first (chromosome), third (start), and fourth (end) columns.
Sorry for my ignorance, but how is this done?
And should it also be done on the frequency file?
I think
cat <(head -n1 sample8_methylation_calls_2.tsv) <(tail -n +2 sample8_methylation_calls_2.tsv | sort -k1,1n -k3,3 -k4,4) | bgzip > sample8_sort_calls.tsv.gz
The frequency file has chromosome, start and end in columns 1,2,3 so that would probably look like
cat <(head -n1 sample8_methylation_freqs.tsv) <(tail -n +2 sample8_methylation_freqs.tsv | sort -k1,1n -k2,2 -k3,3) | bgzip > sample8_sort_freqs.tsv.gz
Please create a separate issue for this and also provide the version of methplotlib that you are using.