pairtools Fragment-level analysis

Discussion to address https://github.com/mirnylab/distiller-nf/issues/96. Since there is pairsamtools restrict, we can use that as a starting point for adding fragment-level stats. But after some thinking, it seems more complicated to implement properly than I thought.

So, how about we add another column (pair_frags_type?) to the pairs file with the classification of the pairs according to this. And then maybe the actual counting can be a part of the regular stats? Then in distiller restrict should be called before stats, if we include it there (and I think it should be there since people expect this kind of stats)?

Here is a list of the stats that need to be implemented, basing off of hiclib. What did I forget?

Same-fragment pairs
- Dangling ends: pairs oriented towards each other that map to the same restriction fragment.
- Self-circles: the same, except they face in opposite directions.
- Self-ligations: the same, but oriented in the same direction (afaiu was not implemented in hiclib, but seems logical to include it)
- hiclib also has a category "error" here, what does that mean? @mimakaev
Undigested pairs - pairs that probably came from undigested or simply re-ligated fragments, so they face towards each other, map to different fragments, but the whole molecule is shorter than a certain threshold, which is the upper size of fragment sizes in the library. I'm presuming this is the "extraDandlingEnds" in hiclib?
- Can we estimate the fragment size from the pairs, by the way? E.g. as the longest dangling end? Something to think about.
Nothing from the above - "Distal"?
Did I forget something?

May 18 '18 17:05 Phlya

As an alternative to all this, provide a way to easily plot pair frequency by distance by read direction?

May 18 '18 17:05 Phlya

Here's an example using the dask-adapter for pairix in bioframe.

It certainly doesn't require either dask or pairix, but would need to be accumulated chunkwise. The functions to compute "contact areas" and pairs scalings were recently moved to cooltools, but they are quite short and reusable: https://github.com/mirnylab/cooltools/blob/master/cooltools/expected.py#L53

May 18 '18 18:05 nvictus

I think we just need to tweak the output of stats a bit so that we don't don't need to do anything on the raw pairs - it already saves number of contacts at different distance with different orientations. I can make a plot like this from stats:

I suppose, it lacks normalisation over the areas for each bin... (This is all for mouse - with no centromeres, whether not counting over-centromeres pairs in stats directly is feasible I am not sure)

Also, if there are any ideas what went wrong with the "Bad" sample, I would be happy to hear your thoughts...

May 18 '18 18:05 Phlya

OK, using @nvictus 's suggestion I can quite easily make this using the areas functions from cooltools and counts from stats.

Still not sure how to interpret the difference between samples, though (the bad one was not sequenced deeply, so the curves are not smooth).

Code, just in case it's useful.

def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])):
    bins = np.append(np.unique(d['Start'].values), [np.inf])
    areas = np.zeros(len(bins)-1)
    for i in range(len(chroms)):
        areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i]))
    return areas

def get_scaling_by_orienation(stats):
    d = stats.loc[stats.index.str.startswith('dist_freq')]
    d = d.reset_index()
    d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:]
    d = d.drop('index', axis=1)
    d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x])
    d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2)
    d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']]
    return d
def plot_scaling_by_orienation(stats, ax):
    d = get_scaling_by_orienation(stats)
    areas = get_areas(d)
    for orientation in '+-', '-+', '++', '--':
        sd = d[d['Orientation']==orientation]
        sd['Value']/= areas
        ax.plot(sd['Distance'], sd['Value']/d['Value'].sum(), label=orientation)

May 19 '18 11:05 Phlya

Are they the same cell type etc? If so, there may be many ligation in trans in the "bad" dataset, by looking at how the P(s) bends more flat around 10^6. If this is the case & you plot the average trans level normalized appropriately, you can see the cis reaching this "noise floor", as we show in MicroC-XL supFig3 ( https://media.nature.com/original/nature-assets/nmeth/journal/v13/n12/extref/nmeth.4025-S1.pdf )

On Sat, May 19, 2018, 4:29 AM Ilya Flyamer [email protected] wrote:

OK, using @nvictus https://github.com/nvictus 's suggestion I can quite easily make this using the areas functions from cooltools and counts from stats.

[image: image] https://user-images.githubusercontent.com/2895034/40268105-8082e3d8-5b5f-11e8-97ba-84f90190d944.png

Still not sure how to interpret the difference between samples, though (the bad one was not sequenced deeply, so I suppose the lack of smoothness is expected).

Code, just in case it's useful.

def get_areas(d, chroms=pd.read_table('/home/s1529682/eddie/ilia/genomes/mm9/chrfile.txt', names=['chrom', 'length'])): bins = np.append(np.unique(d['Start'].values), [np.inf]) areas = np.zeros(len(bins)-1) for i in range(len(chroms)): areas = areas + expected.contact_areas(bins, (0, chroms['length'][i]), (0, chroms['length'][i])) return areas

def get_scaling_by_orienation(stats): d = stats.loc[stats.index.str.startswith('dist_freq')] d = d.reset_index() d[['Distance', 'Orientation']] = d['index'].str.split('/', expand=True).iloc[:, 1:] d = d.drop('index', axis=1) d[['Start', 'End']] = d['Distance'].str.split('-', expand=True).apply(lambda x: [int(i.replace('+', '')) if i is not None else np.inf for i in x]) d['Distance'] = np.exp((np.log(d['Start'])+np.log(d['End']))/2) d = d[['Orientation', 'Distance', 'Start', 'End', 'Value']] return d

def plot_scaling_by_orienation(stats, ax): d = get_scaling_by_orienation(stats) areas = get_areas(d) for orientation in '+-', '-+', '++', '--': sd = d[d['Orientation']==orientation] sd['Value']/= areas ax.plot(sd['Distance'], sd['Value'], label=orientation)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390398720, or mute the thread https://github.com/notifications/unsubscribe-auth/ASvzZbzd68_JwlOj82Ail5TqhMb4ICxRks5t0AIpgaJpZM4UFGjp .

May 20 '18 16:05 gfudenberg

Thank @gfudenberg ! Yes, same cell type. This is just an example, I recently have problems like that with all libraries... Yeah, there is a lot of trans (60-70%), and yeah, scaling is flat on the right, clearly a lot of random ligations or something like that - what I mean is there are also differences on the left side, and I thought maybe they can hint at what exactly is going wrong? Because I just have no clue what is going wrong now...

May 20 '18 17:05 Phlya

On Sun, May 20, 2018 at 10:17 AM, Ilya Flyamer [email protected] wrote:

Thank @gfudenberg https://github.com/gfudenberg ! Yes, same cell type. Yeah, there is a lot of trans (60-70%), and yeah, scaling is flat on the right, clearly a lot of random ligations or something like that - what I mean is there are also differences on the left side, and I thought maybe they can hint at what exactly is going wrong? Because I just have no clue what is going wrong now...

Hmm, I don't know if anyone has systematically looked into different failure modes & how they relate to short range orientation P(s). In part because I'd guess that a bunch of experiments that didn't work well never make it to GEO...

I've never seen a bump at 30kb before, so if that's real (and not just e.g. noise from not sequencing deeply enough) my first guess would be something went wrong with the restriction step, leaving a bunch of large fragments which then have trouble ligating. Anything funky with the size of 3C products on the gel?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390497243, or mute the thread https://github.com/notifications/unsubscribe-auth/ASvzZZaSH5A3s83S1bbsZnAvZ8F4nlNeks5t0aVBgaJpZM4UFGjp .

May 20 '18 17:05 gfudenberg

I realize that, I just thought since you guys have been analyzing other people's data maybe you encountered something like that.

Do you mean 3 kb? Digestion works fine though... Size shift after Hi-C with blunt-end ligation is hard to see on the gel sometimes, so last week I just did 3C same way as I do Hi-C to check that the reagents are working fine, and the gel looks great. (Perhaps not the best place for an extended discussion like this...)

May 20 '18 17:05 Phlya

On Sun, May 20, 2018 at 10:53 AM, Ilya Flyamer [email protected] wrote:

I realize that, I just thought since you guys have been analyzing other people's data maybe you encountered something like that.

Do you mean 3 kb?

Ah, yes, meant 3kb

Digestion works fine though... Size shift after Hi-C with blunt-end ligation is hard to see on the gel sometimes, so last week I just did 3C same way as I do Hi-C to check that the reagents are working fine, and the gel looks great.

I see... yes, it would be interesting to systematically look at cis/total ratio vs. various short-distance orientation statistics (e.g. ratio of self-circles to dangling ends, etc.) to see if there's some useful troubleshooting signals.

May 20 '18 18:05 gfudenberg

There's an enrichment in the supposed "self-circle" fraction < 10kb with a strange "double hump". I'm not sure how unusual it is, but I would have guessed some issue with fill-in or ligation. Your gel suggests that ligation is not the problem though. Maybe it's downstream stuff like library prep or PCR.

Have you looked at @gibcus's bootcamp slides? There might be some QC ideas to try there, like digesting PCR products. He might be able to chime in too.

May 20 '18 18:05 nvictus

Hey Nezar et al., We found self-circles a lot when we were still doing dilute Hi-C. If if the ligation is less inefficient, or there are some cross-linking issues, you may find that T4 ligase can only really ligate fractions to themselves. Do you also have high %trans interactions?

Johan

From: Nezar Abdennur [email protected] Sent: Sunday, May 20, 2018 2:33 PM To: mirnylab/pairsamtools [email protected] Cc: Gibcus, Johan [email protected]; Mention [email protected] Subject: Re: [mirnylab/pairsamtools] Fragment-level analysis (#68)

There's an enrichment in the supposed "self-circle" fraction < 10kb with a strange "double hump". I'm not sure how unusual it is, but I would have guessed some issue with fill-in or ligation. Your gel suggests that ligation is not the problem though. Maybe it's downstream stuff like library prep or PCR.

Have you looked at @gibcushttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_gibcus&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=_Tuyre5TFOw_SJcU3xLeVYyYNnuNeN-SrrigTN9CCXw&e='s bootcamp slideshttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hms-2Ddbmi_hic-2Ddata-2Danalysis-2Dbootcamp_blob_master_HiC-2DProtocol.pdf&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=YS-NJOr53PQLzx00HjMJaLC2CaFSy4rfxywjicrA2CE&e=? There might be some QC ideas to try there, like digesting PCR products. He might be able to chime in too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mirnylab_pairsamtools_issues_68-23issuecomment-2D390501699&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=dWYeFxrnViV32ivmhu_GTfXSQhLKA_FDHOZF7qmjFD0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIsEqHzgimAf-5FSVu4ZlM4jl9mbOrPis0ks5t0bbVgaJpZM4UFGjp&d=DwMFaQ&c=WJBj9sUF1mbpVIAf3biu3CPHX4MeRjY_w4DerPlOmhQ&r=58olbnbITZR_6zBWImEHoDJH8XsNp5L7t6MHgnxC5TQ&m=yMgaf_dUKYwvYxxXtfFEE6qGLCyHO7ZpMHUWtW3OasQ&s=94X1wuvrLhJQCID-uYwqT3YarZ1lrXgWUNR5DnhBrWU&e=.

May 20 '18 19:05 gibcus

Hi @nvictus , @gibcus

Thanks for your ideas! Yeah, it's possible fill-in is inefficient, but I am not sure how to test this...

Trans-fraction is very high. Crosslinking is not the problem because this has happened with cells crosslinked by different people and even in different labs.

May 20 '18 20:05 Phlya

@gfudenberg you have a good eye! The small peak at 3kb is real and present in a few other new bad data, but never in the old good ones. Here is an extremely bad dataset where it's really obvious:

And by looking at a few of them the height of the peak clearly correlates with how flat the scaling is on the right end! WTF is this?!

May 21 '18 13:05 Phlya

wow, this looks so weird!! Btw, what are the mapping stats, i.e. fraction of duplicates, unmapped and multimappers? Does it differ much from the "good" libraries?

On 21 May 2018 at 09:53, Ilya Flyamer [email protected] wrote:

@gfudenberg https://github.com/gfudenberg you have a good eye! The small peak at 3kb is real and present in a few other new bad data, but never in the old good ones. Here is an extremely bad dataset where it's really obvious: [image: image] https://user-images.githubusercontent.com/2895034/40310980-6c4f5c04-5d06-11e8-9724-8ef35e88ee4b.png

And by looking at a few of them the height of the peak clearly correlates with how flat the scaling is on the right end! WTF is this?!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390660044, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgeCgVltpuhg_l8iNHj43zaEdJD0ks5t0sbIgaJpZM4UFGjp .

May 21 '18 14:05 golobor

Yeah, there are differences - values as a fraction of total reads for this extremely bad one.

May 21 '18 14:05 Phlya

huh! that's very concerning, actually! This suggests that the issue might be downstream of ligation, in sequencing. Did you check fastqc of the "bad" samples? Specifically, I'd check if you still have adapters

On 21 May 2018 at 10:32, Ilya Flyamer [email protected] wrote:

Yeah, there are differences - values as a fraction of total reads.

[image: image] https://user-images.githubusercontent.com/2895034/40312765-b8848716-5d0b-11e8-84f3-a5c40b26709c.png [image: image] https://user-images.githubusercontent.com/2895034/40312930-2365c96e-5d0c-11e8-9aea-8d093d18b6ca.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390671911, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uChhip7a_N__ipzbNMVXvopqYsjPHks5t0tARgaJpZM4UFGjp .

May 21 '18 14:05 golobor

No, I haven't checked, actually... Why do you think it might be adaptors? distiller can do that for me, right, if I specify do_fastqc: True?

May 21 '18 14:05 Phlya

it's just a guess, thinking of what could bump multimappers and nulls... yes, do_fastqc does fastqc :)

On 21 May 2018 at 10:40, Ilya Flyamer [email protected] wrote:

No, I haven't checked, actually... Why do you think it might be adaptors? distiller can do that for me, right, if I specify do_fastqc: True?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390674234, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgKYM02GOkPNYb0lQeYkqjc-zKUqks5t0tHdgaJpZM4UFGjp .

May 21 '18 14:05 golobor

No adaptors, but some weird stuff is going on!!! Unequal level of A and T across the whole read! And other things... If you want to have a look, here is a zip with the output, serum-RI3t is good, RI4t is bad (the first plot here on top is from these).

fastqc.zip

May 21 '18 14:05 Phlya

also, an intense GC bias...

On 21 May 2018 at 10:52, Ilya Flyamer [email protected] wrote:

No adaptors, but some weird stuff is going on!!! Unequal level of A and T across the whole read! And other things... If you want to have a look, here is a zip with the output, serum-RI3t is good, RI4t is bad (the first plot here on top is from these).

fastqc.zip https://github.com/mirnylab/pairsamtools/files/2023009/fastqc.zip

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390678179, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCkSQ2XzZWjLaWC6ytKbzM129aMrtks5t0tTJgaJpZM4UFGjp .

May 21 '18 14:05 golobor

(I have to say, they were sequenced on different machines)

May 21 '18 14:05 Phlya

Also GATC signature in the beginning is way more pronounced.

May 21 '18 14:05 Phlya

So no ideas from this?..

May 21 '18 17:05 Phlya

any chance you can bug the sequencing facility? This strongly seems like an issue of sequencing, not HiC

On 21 May 2018 at 13:13, Ilya Flyamer [email protected] wrote:

So no ideas from this?..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390720024, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCt4qPihzg6KAvd6uHoAk2CQQeHSoks5t0vWygaJpZM4UFGjp .

May 21 '18 18:05 golobor

If you mean there is something wrong on the technical side with the sequencing, some bad samples were sequenced in one place, and some - in another...

But I might ask them for their opinion on what could cause such issues though.

May 21 '18 19:05 Phlya

hm, i'm confused now : (a) are there "bad" samples that don't have fastqc issues? (b) are there "good" samples that were sequenced in the same facility as the "bad" ones?

On 21 May 2018 at 15:10, Ilya Flyamer [email protected] wrote:

If you mean there is something wrong on the technical side with the sequencing, some samples were sequenced in one place, and some - in another...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390752587, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgCnZAx4M1eCTLAuNPRQ1HtO1ef9ks5t0xESgaJpZM4UFGjp .

May 21 '18 19:05 golobor

a) I haven't run fastqc on all of them b) yes

I can try to compare them more comprehensively tomorrow

May 21 '18 19:05 Phlya

my point is that all observations you showed so far are consistent with a single issue behind "bad" samples - faulty sequencing; is there any piece of evidence that contradicts this theory?

On 21 May 2018 at 15:14, Anton Goloborodko [email protected] wrote:

hm, i'm confused now : (a) are there "bad" samples that don't have fastqc issues? (b) are there "good" samples that were sequenced in the same facility as the "bad" ones?

On 21 May 2018 at 15:10, Ilya Flyamer [email protected] wrote:

If you mean there is something wrong on the technical side with the sequencing, some samples were sequenced in one place, and some - in another...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mirnylab/pairsamtools/issues/68#issuecomment-390752587, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3uCgCnZAx4M1eCTLAuNPRQ1HtO1ef9ks5t0xESgaJpZM4UFGjp .

May 21 '18 19:05 golobor

Both good and bad data have been produced in two different facilities, and at one point a few months ago all libraries, wherever we sequence them, started having similar issues.

May 21 '18 19:05 Phlya

So, I ran fastqc for many of the good and bad libraries, and the only mostly consistent problem with the bad ones is very high over-representation of a particular group of K-mers in a particular place in the forward reads. This sequence looks like a part of the adaptor sequence, why it is so often found in a particular place, and only in the forward read is completely enigmatic to me, but their actual counts are pretty low and overall adaptor sequences are not very frequent, so I don't think this can be the source of these problems.

May 22 '18 12:05 Phlya

pairtools pairtools copied to clipboard

Fragment-level analysis

pairtools
pairtools copied to clipboard