bnlearn icon indicating copy to clipboard operation
bnlearn copied to clipboard

Working with continuous expression data?

Open stevenagl12 opened this issue 1 year ago • 10 comments

I have a potentially dumb question. So, as I understand it, we need to discretize the data to work with this package on continuous biological data, such as gene expression or cytometry data. The inbuilt function for bn. discretize however takes in a build graph as an input though. With our data, we can't infer which nodes and edges we have to start a random graph. How can we use this package with such continuous data? As I understand it, in the R bnlearn library, it came with the iamb, and hatermink discretization options, but I don't see that in this package.

stevenagl12 avatar Feb 19 '24 20:02 stevenagl12

When you only have data, and want to start without a structure, try the structure learning. However the methods in bnlearn does require data to be discrete.

Two suggestions how to approach this:

  1. Discritize your data based on your domein knowledge and/or in combination with other statistics. For example, for your gene expression profiles you could do a t-test between a control group and set a threshold (alpha is 0.05) with or without multiple test correction. This would return three states for each gene (up, baseline, down). If you dont have a control group, try fitting the distribution to a theoretical distribution (checkout distfit) and make a cut on the 95%CII or so. Do both sides of the distribution and you would again have three states per gene. This comes close to constrain based: https://erdogant.github.io/bnlearn/pages/html/Structure%20learning.html#constraint-based

  2. Try using the built on functionality of bnearn to automatically discritize and create states based on the continuous expression profiles. This is again a starting point towards structure learning. See documentation for more details.

https://erdogant.github.io/bnlearn/pages/html/Continuous%20Data.html

No methods like iamb. However, checkout what’s available is pgmpy. If there is something what could help you, I am open to merge commits.

Asking questions makes you smart btw. Keep it up 👍🏻

erdogant avatar Feb 21 '24 15:02 erdogant

So, while I understand the first part, I was wondering about the second. Using that discretize function it takes the argument for DAG. This DAG in the example is created by priors of the connections between the variables. How do we create one without knowing what variables might be connected?

On Wed, Feb 21, 2024, 10:30 AM Erdogan @.***> wrote:

When you only have data, and want to start without a structure, try the structure learning. However the methods in bnlearn does require data to be discrete.

Two suggestions how to approach this:

Discritize your data based on your domein knowledge and/or in combination with other statistics. For example, for your gene expression profiles you could do a t-test between a control group and set a threshold (alpha is 0.05) with or without multiple test correction. This would return three states for each gene (up, baseline, down). If you dont have a control group, try fitting the distribution to a theoretical distribution and make a cut on the 95%CII or so. Do both sides of the distribution and you would again have three states per gene. 2.

Try using the built on functionality of bnearn to automatically discritize and create states based on the continuous expression profiles. This is again a starting point towards structure learning. See documentation for more details.

https://erdogant.github.io/bnlearn/pages/html/Continuous%20Data.html https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ferdogant.github.io%2Fbnlearn%2Fpages%2Fhtml%2FContinuous%2520Data.html&data=05%7C02%7Csalewis%40g-mail.buffalo.edu%7C22398428127f4cc2e56708dc32f20c75%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638441262383142903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=N0E28utjYRE%2BfHgAxU9%2ByW4xifn7NvLSMCZFz1%2Fkj84%3D&reserved=0

Asking questions makes you smart btw. Keep it up 👍🏻

— Reply to this email directly, view it on GitHub https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ferdogant%2Fbnlearn%2Fissues%2F94%23issuecomment-1956955962&data=05%7C02%7Csalewis%40g-mail.buffalo.edu%7C22398428127f4cc2e56708dc32f20c75%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638441262383142903%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=KvtJCA5TF5he7U8qteRhqO5aJ21m%2FzU3r1qP%2BSGATEg%3D&reserved=0, or unsubscribe https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAHHKCLJNSQQ2COY45GBIDZDYUYHJVAVCNFSM6AAAAABDP7Z36CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWHE2TKOJWGI&data=05%7C02%7Csalewis%40g-mail.buffalo.edu%7C22398428127f4cc2e56708dc32f20c75%7C96464a8af8ed40b199e25f6b50a20250%7C0%7C0%7C638441262383299149%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=5MQttHvvakcCWVX41hiMmc3ZNUSG6r8NsUgQODe5CYw%3D&reserved=0 . You are receiving this because you authored the thread.Message ID: @.***>

stevenagl12 avatar Feb 21 '24 15:02 stevenagl12

You are right. The second part does need a DAG at start. Unfortunately there is no other implementation yet.

erdogant avatar Feb 26 '24 23:02 erdogant

Hi,

By continuous biological data, did you mean continuous data like various numbers (for ex 103.2, 102, 99, 2.5, etc) or time-series data? If it ain't any of these, could you please explain what the data you have mentioned, loos like?

Also, if it is different, is this package applicable fr continuous data like the one I have mentioned above?

akshatakarjun avatar Jul 16 '24 21:07 akshatakarjun

I was talking about various numbers of RNAseq fold changes.

stevenagl12 avatar Jul 16 '24 22:07 stevenagl12

If you would like to know some comparison with other causal packages, you can read it in my blog over here. The last time I checked, only CausalImpact can model continuous values but that is for time series data. So, it is not applicable when you are using RNAseq data.

erdogant avatar Jul 22 '24 16:07 erdogant

I also have a dumb question:

I have a dataset that mixes continuous and discrete data. I noticed the bn.discretize function takes a lot of time (my dataset is 11000 points roughly, 9 columns, among which 4 are continuous).
Is there a possibility to discretize outside of bnlearn or is this not compatible ?

I tried using the pandas functions to circumvent the issue and generate Interval Indexes in my dataset but with very little success.

Loominarty avatar Jul 25 '24 07:07 Loominarty

Unsure what kind of continuous data you have but If possible, you can manually put them into a discrete range. For example, if a feature called BloodPressure has various values, then we know what values of BP is considered as normal, high BP and low BP. You can do a if loop, if the value falls in this range, replace all those rows value with the categorical value you want.

Just a thought!!

akshatakarjun avatar Jul 25 '24 11:07 akshatakarjun

Hi @akshatakarjun ,

I found something that works alright, but is not very convenient in terms of user comfort. I have discretized outside of the library and used bn.df2onehot to encode the indexes into integers. Then I just translate my new incoming data into one of these numbers.

Loominarty avatar Jul 26 '24 01:07 Loominarty

You can indeed manipulate your data as you wish. The df2onehot was included in bnlearn to provide one of the steps from start-to-results. So you are right, it brings some comfort but at the same time it is generally slow.

erdogant avatar Aug 02 '24 08:08 erdogant