Feature/enrich structure options
- [x] Still need to conceal the changes to model.cpp to the actual changes.
- [ ] Handling alternate location selection is tricky, especially in the absence of a chosen alternate location (usually given by a letter). In all cases it seems to require looking at the level of the residue what the possible alternate locations are. It is further complicated by handling possible equality between occupancy and b_factor. Without this constraint, the StructureOpenOptions could likely just return a "static" condition. I would like to discuss any better way to implement this.
- [ ] Retrieve the default behavior when not providing a StructureOpenOptions object.
What is the status of this pull request, are you still working on it? Do you need help?
What is the status of this pull request, are you still working on it? Do you need help?
Dear @mhekkel ,
The PR would clearly benefit from your help :
- Do you see some interest in the PR expected outcomes (filtering models) ? Is there any other way to achieve the same outcomes ?
- Do you have an idea as to how we could handle alternate locations in a clever fashion ?
The PR would clearly benefit from your help :
* Do you see some interest in the PR expected outcomes (filtering models) ? Is there any other way to achieve the same outcomes ?
Filtering models, if I want to do this I filter the underlying cif datablocks first. A cif::mm::model is only a flyweight, a view on the underlying data. But I can see some use for it.
* Do you have an idea as to how we could handle alternate locations in a clever fashion ?
Yes, I think I do. I will create an alternative implementation for you to review. Give me a couple of days.
The PR would clearly benefit from your help :
* Do you see some interest in the PR expected outcomes (filtering models) ? Is there any other way to achieve the same outcomes ?Filtering models, if I want to do this I filter the underlying cif datablocks first. A cif::mm::model is only a flyweight, a view on the underlying data. But I can see some use for it.
I used to filter the datablocks by erasing atom entries that I did not want to end up in the model, but the erasure was too costly.
* Do you have an idea as to how we could handle alternate locations in a clever fashion ?Yes, I think I do. I will create an alternative implementation for you to review. Give me a couple of days.
So nice ! Thanks.
One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.
If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.
I can also give you ideas on how to remove alternates more efficiently, if that's what you actually want.
One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.
If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.
At the time being, we convert the cifpp model into our own molecular structure (the mapping being pretty straightforward). But if we were to try to write our own structure back to mmcif format (for instance as we compute new conformations of the molecule), this could be a bottleneck I assume (we currently only support writing back to PDB files).
I can also give you ideas on how to remove alternates more efficiently, if that's what you actually want.
As an example, this is what we currently do to filter the tables :
// Filter hydrogen atoms
if (!m_loaded_hydrogens) {
m_atm_discarded[s_i] += db["atom_site"].erase(cif::key("type_symbol") == "H" or cif::key("type_symbol") == "D");
}
// Filter high "temperature" atoms
m_atm_discarded[s_i] += db["atom_site"].erase(cif::key("B_iso_or_equiv") > m_loaded_b_factor_limit);
// Filter atoms based on model id
if(!m_loaded_models.empty()) {
if (!m_loaded_models[s_i].empty()) {
cif::condition condition;
for (std::size_t m_i : m_loaded_models[s_i]) {
cif::condition temp = operator not(cif::key("pdbx_PDB_model_num") == m_i);
condition = operator and(std::move(condition), std::move(temp));
}
m_atm_discarded[s_i] += db["atom_site"].erase(std::move(condition));
}
}
And for the record, there is an actual gain with the PR on our en (not erasing the atoms by constructing the model on a filtered representation of the atoms).
Hello @mhekkel,
Is there anything I can try to further improve the PR ?
Hello @mhekkel,
Is there anything I can try to further improve the PR ?
Ah, yes, sorry... I was busy with another project. But your message is a helpful reminder, I'll start working on this next week.
The most important is that creating a condition like you did is not the best approach, it is not really scalable. But Ill try to come up with an alternative.
Hello @mhekkel, Is there anything I can try to further improve the PR ?
Ah, yes, sorry... I was busy with another project. But your message is a helpful reminder, I'll start working on this next week.
The most important is that creating a condition like you did is not the best approach, it is not really scalable. But Ill try to come up with an alternative.
Fantastic, looking forward to seeing the alternative approach.
One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.
If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.
At the time being, we convert the cifpp model into our own molecular structure (the mapping being pretty straightforward). But if we were to try to write our own structure back to mmcif format (for instance as we compute new conformations of the molecule), this could be a bottleneck I assume (we currently only support writing back to PDB files).
OK, so you want to write out some modified atoms. But what about the atoms you filtered out while loading the structure? Filtering in mm::structure only means the atoms are not used in the higher level API but they remain in the data file.
If you want do discard atoms from the cif::file, you should do that at the lower level, using category::erase.
After deleting the atoms you could create an mm::structure with this pruned data.
Of course that means that if you write out that data, those deleted atoms are gone. Don't know if that is a problem?
I guess that's almost what you're doing right now, in your example you're also erasing atoms.
One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.
If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.
At the time being, we convert the cifpp model into our own molecular structure (the mapping being pretty straightforward). But if we were to try to write our own structure back to mmcif format (for instance as we compute new conformations of the molecule), this could be a bottleneck I assume (we currently only support writing back to PDB files).
OK, so you want to write out some modified atoms. But what about the atoms you filtered out while loading the structure? Filtering in mm::structure only means the atoms are not used in the higher level API but they remain in the data file.
If you want do discard atoms from the cif::file, you should do that at the lower level, using category::erase.
After deleting the atoms you could create an mm::structure with this pruned data.
Of course that means that if you write out that data, those deleted atoms are gone. Don't know if that is a problem?
It is currently not a problem.
I guess that's almost what you're doing right now, in your example you're also erasing atoms.
It is exactly what I was doing in my "client" code, I was removing the atoms from the db. But it was inefficient due to "erase" function calls.
You mentioned the fact using conditions was not the most efficient way to proceed ?
You mentioned the fact using conditions was not the most efficient way to proceed ?
I saw code where you created very complex conditions, and those are inefficient.
What could help when simply removing some atoms is to set the validator for the atom_site category to null before the erase and restoring the validator afterwards. That will avoid the cascading code, code that will check if the deletion of an atom should have consequences elsewhere.
This is one area where I should optimise my code, btw.
You mentioned the fact using conditions was not the most efficient way to proceed ?
I saw code where you created very complex conditions, and those are inefficient.
Understood, nonetheless it seems to me that handling alternate locations in a determinate manner is complex.
What could help when simply removing some atoms is to set the validator for the atom_site category to null before the erase and restoring the validator afterwards. That will avoid the cascading code, code that will check if the deletion of an atom should have consequences elsewhere.
Great, I will try this whenever I can.
This is one area where I should optimise my code, btw.
This might be an alternative to your code to filter alternates based on occupancy. I would go for code like this.
This example removes the highest occupancy alternate.
#include <cif++.hpp>
#include <iostream>
#include <cassert>
int main(int argc, char * const argv[])
{
using namespace cif::literals;
cif::file f(argv[1]);
auto &db = f.front();
auto &atom_site = db["atom_site"];
std::map<std::tuple<std::string,int>, std::map<std::string, float>> alts;
for (const auto &[asym_id, seq_id, alt_id, occupancy] :
atom_site.find<std::string, int, std::string, float>(
"label_alt_id"_key != cif::null,
"label_asym_id", "label_seq_id",
"label_alt_id", "occupancy"
))
{
auto key = std::make_tuple(asym_id, seq_id);
if (auto i = alts.find(key); i != alts.end())
i->second[alt_id] += occupancy;
else
alts[key][alt_id] = occupancy;
}
for (const auto &[key, value] : alts)
{
const auto &[asym_id, seq_id] = key;
// select highest occupancy for this residue's alternates
std::string alt_id;
float occupancy = 0;
for (const auto &[alt_key, alt_value] : value)
{
if (occupancy < alt_value)
{
alt_id = alt_key;
occupancy = alt_value;
}
}
cif::condition cond = "label_asym_id"_key == asym_id and "label_alt_id"_key == alt_id;
if (seq_id == 0)
cond = std::move(cond) and "label_seq_id"_key == cif::null;
else
cond = std::move(cond) and "label_seq_id"_key == seq_id;
auto removed = atom_site.erase(std::move(cond));
}
std::cout << f;
return 0;
}
Of course you need to extend/alter this code to suit your needs.
I merged your code into a new pull request here: https://github.com/PDB-REDO/libcifpp/pull/68
Could you please have a look at that code, if it does what you want?
From what I gathered this is not exactly what you were looking for, since writing out the data will reveal the hidden atoms again. Perhaps I will have to implement some push back option