chip-atlas
chip-atlas copied to clipboard
getting reads counts from bigWig file.
Dear ChIP-atlas team,
Thank you for the dev and maintenance of this fantastic resource.
I was wondering if more details can be given regarding the bigWig files provided e.g. for this entry. It would be relevant for members of our laboratory to get the number of reads per position. I understand values are currently normalized. https://chip-atlas.org/view?id=SRX018625
Could there be a way to infer the read counts from the normalized ones, using this bigWig file and recalling the normalization step?
Kind regards, Ignacio
Hi Ignacio,
The web page of an SRX (eg SRX018625) shows the total number of reads, together with the ratio of mapped and removed duplicates, as shown in attached screen shot. Thus, the number of reads used to normalize, SRX018625 BigWig (hg19) as an example, is calculated as follows: 9,231,367 × 0.899 × (1-0.038) ≃ 7,983,637
Best, Shinya
@shinyaoki, question about this. When I try to de-normalize the values like this, I get much more integer looking values if I don't correct for duplicates. Any idea what's up with that? Here's what I've run:
code to retrieve sequencing metadata
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import pyranges
import pandas as pd, numpy as np
def parse_record(element) -> dict:
"""Retrieve record info from element."""
from html import unescape
out = {}
first_pass = dict(zip(
[x.text for x in element.findAll("dt")],
[unescape(x.text) for x in element.findAll("dd")]
))
out["Number of total reads"] = int(first_pass["Number of total reads"])
out["Reads aligned (%)"] = float(first_pass["Reads aligned (%)"])
out["Duplicates removed (%)"] = float(first_pass["Duplicates removed (%)"])
out["Number of peaks"] = first_pass["Number of peaks"]
return out
def get_record_elements(soup):
return [x.parent.parent for x in soup.findAll(string="Number of total reads")]
def get_sequencing_metadata_from_page(soup):
metadata_element = soup.findAll(attrs={"class": "mdata"})[-1]
records = [parse_record(x) for x in get_record_elements(metadata_element)]
keys = [x.text for x in metadata_element.findAll(attrs={"class": "small"})]
return dict(zip(keys, records))
def get_sequencing_metadata(chip_atlas_id: str) -> dict:
url = f'https://chip-atlas.org/view?id={chip_atlas_id}'
req = requests.get(url, 'html.parser')
return get_sequencing_metadata_from_page(BeautifulSoup(req.text, 'html'))
def n_mapped_reads(metadata: dict) -> float:
n_reads = metadata["Number of total reads"]
pct_aligned = metadata["Reads aligned (%)"]
pct_dupes_removed = metadata["Duplicates removed (%)"]
# TODO looks way more like integers without removing dupes
return n_reads * (pct_aligned / 100) # * (1 - pct_dupes_removed / 100)
metadata = get_sequencing_metadata("SRX018625")["hg38"]
metadata
{'Number of total reads': 9231367,
'Reads aligned (%)': 91.4,
'Duplicates removed (%)': 3.1,
'Number of peaks': '6122 (qval < 1E-05)'}
n_reads = metadata["Number of total reads"]
pct_aligned = metadata["Reads aligned (%)"]
pct_dupes_removed = metadata["Duplicates removed (%)"]
bw = pyranges.read_bigwig(str(chip_atlas_pth / "bigwig/hg38/SRX018625.bw"))
(bw.Value * n_reads * (pct_aligned / 100) / 1_000_000).value_counts()
1.000431 7946154
2.000853 1049526
3.001284 201514
4.001714 97020
5.002137 64783
...
69.029470 43
67.028692 40
70.029896 28
71.030329 18
72.030763 7
Name: Value, Length: 72, dtype: int64
(bw.Value * n_reads * (pct_aligned / 100) * (1 - pct_dupes_removed / 100) / 1_000_000).value_counts()
0.969417 7946154
1.938827 1049526
2.908244 201514
3.877661 97020
4.847071 64783
...
66.889557 43
64.950802 40
67.858969 28
68.828389 18
69.797809 7