ASTRAL
ASTRAL copied to clipboard
Inferred tree does not have the highest quartet score
Dr. Mirarab,
I am running ASTRAL on several large datasets that include multi-copy gene families. Overall, results have been very consistent, but I'm getting one result that is surprising. If I infer a tree from one dataset which includes 60 species, 5594 gene trees, and ~388,000 gene copies, the tree is rather different than trees inferred from single-copy datasets from this same group, or using ASTRAL-Pro. Interestingly, if I use FASTRAL, the tree I infer from this dataset looks much more reasonable. I have tried the following to attempt to understand this behavior:
- I added the ASTRAL tree to the FASTRAL search space, and this did not change results (FASTRAL still finds a reasonable tree).
- I ran ASTRAL on random subsets of the data, and this did change results (ASTRAL infers a reasonable tree in this case).
- I calculating the quartet scores for two trees: a reasonable tree from ASTRAL-Pro and the weird tree inferred using ASTRAL. I found that the quartet score was higher for the ASTRAL-Pro tree (12425072354, normalized = 0.5377) than for the ASTRAL tree (12380755710, normalized = 0.5358), when calculated using the version of ASTRAL and the dataset where ASTRAL inferred the weird ASTRAL tree.
I'm messaging to see if you might have any insight into this. I have attached the gene trees here, in case that is helpful to answering this question. I suspect that this could be related to the large search space ASTRAL is encountering in this case, but I am unsure. Thanks for your time.
Sincerely, Megan Smith
Megan, thanks for the message. Seems like an interesting case. I will get to this next week in more detail. In the meantime, one question and one point.
Question: Can you compare the quartet score of FASTRAL and ASTRAL? My guess is that FASTRAL is getting a better quartet score, just as A-Pro does. That would explain some of the oddities.
Comment: Even though ASTRAL-multi can handle multi-copy datasets and has all sorts of nice theoretical guarantees under duploss, we never designed it for multi-copy. In particular, we never tried to make the search space nice given multi-copy. In simulations, we never found ASTRAL-multi to be super accurate on multi-copy data. In contrast, A-Pro was much more accurate. So, I am in general less inclined to suggest anyone should use ASTARL-multi for multi-copy. If anything, I remain surprised at the fact that it works at all :)
Ok, one more comment. We have a new C++ implementation of ASTRAL (paper to be on arxiv in a couple of days) that uses a completely new search space strategy. It may work better for your case. If you feel you may want to try it, I will share the link here.
Hi Siavash,
Thanks for the quick reply.
The score for the FASTRAL tree is 12425305589, so it is receiving a better quartet score than the ASTRAL (12380755710) tree as well, as you suspected.
Your comment about A-Pro being more appropriate is well-taken. We are comparing several different inference methods on a variety of datasets (including ASTRAL and A-Pro), so I was hoping to keep the comparisons consistent and run both on this dataset as well.
Running the C++ implementation on this dataset seems like it would potentially be really informative, and I’d love to try it out.
Best, Megan
Megan Smith NSF Postdoctoral Fellow Department of Biology Indiana University, Bloomington, IN
On Feb 15, 2022, at 3:08 PM, Siavash Mirarab @.***> wrote:
Megan, thanks for the message. Seems like an interesting case. I will get to this next week in more detail. In the meantime, one question and one point.
Question: Can you compare the quartet score of FASTRAL and ASTRAL? My guess is that FASTRAL is getting a better quartet score, just as A-Pro does. That would explain some of the oddities.
Comment: Even though ASTRAL-multi can handle multi-copy datasets and has all sorts of nice theoretical guarantees under duploss, we never designed it for multi-copy. In particular, we never tried to make the search space nice given multi-copy. In simulations, we never found ASTRAL-multi to be super accurate on multi-copy data. In contrast, A-Pro was much more accurate. So, I am in general less inclined to suggest anyone should use ASTARL-multi for multi-copy. If anything, I remain surprised at the fact that it works at all :)
Ok, one more comment. We have a new C++ implementation of ASTRAL (paper to be on arxiv in a couple of days) that uses a completely new search space strategy. It may work better for your case. If you feel you may want to try it, I will share the link here.
— Reply to this email directly, view it on GitHub https://github.com/smirarab/ASTRAL/issues/82#issuecomment-1040741234, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DGNU7EGRUK4AJUSCTZJ3U3KXFLANCNFSM5OMKJHOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you authored the thread.
Hi Megan,
So we can confirm that on the multi-copy dataset, ASTRAL-multi has not been able to build a good search space. FASTRAL has created a better search space and its results supersedes that of ASTRAL-multi (since they seek to solve the same problem). I hope our C++ implementation is also good though we have not tested it for multi-copy input: https://github.com/chaoszhang/ASTER
As a short explanation, the ASTRAL-multi search space is optimized assuming the we have multiple alleles. I don't think its search space strategies are so good for multi-copy input. Perhaps based on the results of your study, we will get some incentive to produce a version of ASTRAL-multi specifically focused on multi-copy input. Please let me know how the testing with the C++ implementation goes.
The C++ implementation is described here (fresh out of press): https://www.biorxiv.org/content/10.1101/2022.02.19.481132v1
Thanks Siavash