chip-atlas getting reads counts from bigWig file.

Dear ChIP-atlas team,

Thank you for the dev and maintenance of this fantastic resource.

I was wondering if more details can be given regarding the bigWig files provided e.g. for this entry. It would be relevant for members of our laboratory to get the number of reads per position. I understand values are currently normalized. https://chip-atlas.org/view?id=SRX018625

Could there be a way to infer the read counts from the normalized ones, using this bigWig file and recalling the normalization step?

Kind regards, Ignacio

Aug 05 '22 08:08 ilibarra

Hi Ignacio,

The web page of an SRX (eg SRX018625) shows the total number of reads, together with the ratio of mapped and removed duplicates, as shown in attached screen shot. Thus, the number of reads used to normalize, SRX018625 BigWig (hg19) as an example, is calculated as follows: 9,231,367 × 0.899 × (1-0.038) ≃ 7,983,637

Best, Shinya

fig

Aug 05 '22 13:08 shinyaoki

@shinyaoki, question about this. When I try to de-normalize the values like this, I get much more integer looking values if I don't correct for duplicates. Any idea what's up with that? Here's what I've run:

code to retrieve sequencing metadata

import requests
from bs4 import BeautifulSoup
from pathlib import Path
import pyranges
import pandas as pd, numpy as np

def parse_record(element) -> dict:
    """Retrieve record info from element."""
    from html import unescape
    out = {}
    first_pass = dict(zip(
        [x.text for x in element.findAll("dt")],
        [unescape(x.text) for x in element.findAll("dd")]
    ))
    
    out["Number of total reads"] = int(first_pass["Number of total reads"])
    out["Reads aligned (%)"] = float(first_pass["Reads aligned (%)"])
    out["Duplicates removed (%)"] = float(first_pass["Duplicates removed (%)"])
    out["Number of peaks"] = first_pass["Number of peaks"]

    return out

def get_record_elements(soup):
    return [x.parent.parent for x in soup.findAll(string="Number of total reads")]

def get_sequencing_metadata_from_page(soup):
    metadata_element = soup.findAll(attrs={"class": "mdata"})[-1]
    records = [parse_record(x) for x in get_record_elements(metadata_element)]
    keys = [x.text for x in metadata_element.findAll(attrs={"class": "small"})]
    return dict(zip(keys, records))

def get_sequencing_metadata(chip_atlas_id: str) -> dict:
    url = f'https://chip-atlas.org/view?id={chip_atlas_id}'
    req = requests.get(url, 'html.parser')
    return get_sequencing_metadata_from_page(BeautifulSoup(req.text, 'html'))


def n_mapped_reads(metadata: dict) -> float:
    n_reads = metadata["Number of total reads"]
    pct_aligned = metadata["Reads aligned (%)"]
    pct_dupes_removed = metadata["Duplicates removed (%)"]

    # TODO looks way more like integers without removing dupes
    return n_reads * (pct_aligned / 100) # * (1 - pct_dupes_removed / 100)

metadata = get_sequencing_metadata("SRX018625")["hg38"]

metadata

{'Number of total reads': 9231367,
 'Reads aligned (%)': 91.4,
 'Duplicates removed (%)': 3.1,
 'Number of peaks': '6122 (qval < 1E-05)'}

n_reads = metadata["Number of total reads"]
pct_aligned = metadata["Reads aligned (%)"]
pct_dupes_removed = metadata["Duplicates removed (%)"]

bw = pyranges.read_bigwig(str(chip_atlas_pth / "bigwig/hg38/SRX018625.bw"))

(bw.Value * n_reads * (pct_aligned / 100) / 1_000_000).value_counts()

1.000431     7946154
2.000853     1049526
3.001284      201514
4.001714       97020
5.002137       64783
              ...   
69.029470         43
67.028692         40
70.029896         28
71.030329         18
72.030763          7
Name: Value, Length: 72, dtype: int64

(bw.Value * n_reads * (pct_aligned / 100) * (1 - pct_dupes_removed / 100) / 1_000_000).value_counts()

0.969417     7946154
1.938827     1049526
2.908244      201514
3.877661       97020
4.847071       64783
              ...   
66.889557         43
64.950802         40
67.858969         28
68.828389         18
69.797809          7

Sep 20 '22 09:09 ivirshup

chip-atlas chip-atlas copied to clipboard

getting reads counts from bigWig file.

chip-atlas
chip-atlas copied to clipboard