libcifpp icon indicating copy to clipboard operation
libcifpp copied to clipboard

Feature/enrich structure options

Open clebreto opened this issue 10 months ago • 7 comments

clebreto avatar Mar 07 '25 14:03 clebreto

  1. [x] Still need to conceal the changes to model.cpp to the actual changes.
  2. [ ] Handling alternate location selection is tricky, especially in the absence of a chosen alternate location (usually given by a letter). In all cases it seems to require looking at the level of the residue what the possible alternate locations are. It is further complicated by handling possible equality between occupancy and b_factor. Without this constraint, the StructureOpenOptions could likely just return a "static" condition. I would like to discuss any better way to implement this.
  3. [ ] Retrieve the default behavior when not providing a StructureOpenOptions object.

clebreto avatar Mar 07 '25 14:03 clebreto

What is the status of this pull request, are you still working on it? Do you need help?

mhekkel avatar Mar 31 '25 10:03 mhekkel

What is the status of this pull request, are you still working on it? Do you need help?

Dear @mhekkel ,

The PR would clearly benefit from your help :

  • Do you see some interest in the PR expected outcomes (filtering models) ? Is there any other way to achieve the same outcomes ?
  • Do you have an idea as to how we could handle alternate locations in a clever fashion ?

clebreto avatar Apr 02 '25 08:04 clebreto

The PR would clearly benefit from your help :

* Do you see some interest in the PR expected outcomes  (filtering models) ? Is there any other way to achieve the same outcomes ?

Filtering models, if I want to do this I filter the underlying cif datablocks first. A cif::mm::model is only a flyweight, a view on the underlying data. But I can see some use for it.

* Do you have an idea as to how we could handle alternate locations in a clever fashion ?

Yes, I think I do. I will create an alternative implementation for you to review. Give me a couple of days.

mhekkel avatar Apr 08 '25 12:04 mhekkel

The PR would clearly benefit from your help :

* Do you see some interest in the PR expected outcomes  (filtering models) ? Is there any other way to achieve the same outcomes ?

Filtering models, if I want to do this I filter the underlying cif datablocks first. A cif::mm::model is only a flyweight, a view on the underlying data. But I can see some use for it.

I used to filter the datablocks by erasing atom entries that I did not want to end up in the model, but the erasure was too costly.

* Do you have an idea as to how we could handle alternate locations in a clever fashion ?

Yes, I think I do. I will create an alternative implementation for you to review. Give me a couple of days.

So nice ! Thanks.

clebreto avatar Apr 08 '25 12:04 clebreto

One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.

If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.

I can also give you ideas on how to remove alternates more efficiently, if that's what you actually want.

mhekkel avatar Apr 08 '25 14:04 mhekkel

One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.

If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.

At the time being, we convert the cifpp model into our own molecular structure (the mapping being pretty straightforward). But if we were to try to write our own structure back to mmcif format (for instance as we compute new conformations of the molecule), this could be a bottleneck I assume (we currently only support writing back to PDB files).

I can also give you ideas on how to remove alternates more efficiently, if that's what you actually want.

As an example, this is what we currently do to filter the tables :

 // Filter hydrogen atoms
          if (!m_loaded_hydrogens) {
            m_atm_discarded[s_i] += db["atom_site"].erase(cif::key("type_symbol") == "H" or cif::key("type_symbol") == "D");
          }

          // Filter high "temperature" atoms
          m_atm_discarded[s_i] += db["atom_site"].erase(cif::key("B_iso_or_equiv") > m_loaded_b_factor_limit);

          // Filter atoms based on model id
          if(!m_loaded_models.empty()) {
            if (!m_loaded_models[s_i].empty()) {
              cif::condition condition;
              for (std::size_t m_i : m_loaded_models[s_i]) {
                cif::condition temp = operator not(cif::key("pdbx_PDB_model_num") == m_i);
                condition = operator and(std::move(condition), std::move(temp));
              }
              m_atm_discarded[s_i] += db["atom_site"].erase(std::move(condition));
            }
          }

And for the record, there is an actual gain with the PR on our en (not erasing the atoms by constructing the model on a filtered representation of the atoms).

clebreto avatar Apr 08 '25 14:04 clebreto

Hello @mhekkel,

Is there anything I can try to further improve the PR ?

clebreto avatar Jun 03 '25 13:06 clebreto

Hello @mhekkel,

Is there anything I can try to further improve the PR ?

Ah, yes, sorry... I was busy with another project. But your message is a helpful reminder, I'll start working on this next week.

The most important is that creating a condition like you did is not the best approach, it is not really scalable. But Ill try to come up with an alternative.

mhekkel avatar Jun 05 '25 05:06 mhekkel

Hello @mhekkel, Is there anything I can try to further improve the PR ?

Ah, yes, sorry... I was busy with another project. But your message is a helpful reminder, I'll start working on this next week.

The most important is that creating a condition like you did is not the best approach, it is not really scalable. But Ill try to come up with an alternative.

Fantastic, looking forward to seeing the alternative approach.

clebreto avatar Jun 05 '25 06:06 clebreto

One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.

If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.

At the time being, we convert the cifpp model into our own molecular structure (the mapping being pretty straightforward). But if we were to try to write our own structure back to mmcif format (for instance as we compute new conformations of the molecule), this could be a bottleneck I assume (we currently only support writing back to PDB files).

OK, so you want to write out some modified atoms. But what about the atoms you filtered out while loading the structure? Filtering in mm::structure only means the atoms are not used in the higher level API but they remain in the data file.

If you want do discard atoms from the cif::file, you should do that at the lower level, using category::erase.

After deleting the atoms you could create an mm::structure with this pruned data.

Of course that means that if you write out that data, those deleted atoms are gone. Don't know if that is a problem?

I guess that's almost what you're doing right now, in your example you're also erasing atoms.

mhekkel avatar Jun 10 '25 09:06 mhekkel

One more question though. What is your goal. I mean, when you skip atoms when constructing a model, these atoms will remain in the underlying data store. So if you then write out the data, these skipped atoms will be written as well.

If you only want to inspect the data, this is not a problem, but if you want to write out mmCIF files based on the data you read, this is a problem of course.

At the time being, we convert the cifpp model into our own molecular structure (the mapping being pretty straightforward). But if we were to try to write our own structure back to mmcif format (for instance as we compute new conformations of the molecule), this could be a bottleneck I assume (we currently only support writing back to PDB files).

OK, so you want to write out some modified atoms. But what about the atoms you filtered out while loading the structure? Filtering in mm::structure only means the atoms are not used in the higher level API but they remain in the data file.

If you want do discard atoms from the cif::file, you should do that at the lower level, using category::erase.

After deleting the atoms you could create an mm::structure with this pruned data.

Of course that means that if you write out that data, those deleted atoms are gone. Don't know if that is a problem?

It is currently not a problem.

I guess that's almost what you're doing right now, in your example you're also erasing atoms.

It is exactly what I was doing in my "client" code, I was removing the atoms from the db. But it was inefficient due to "erase" function calls.

You mentioned the fact using conditions was not the most efficient way to proceed ?

clebreto avatar Jun 10 '25 09:06 clebreto

You mentioned the fact using conditions was not the most efficient way to proceed ?

I saw code where you created very complex conditions, and those are inefficient.

What could help when simply removing some atoms is to set the validator for the atom_site category to null before the erase and restoring the validator afterwards. That will avoid the cascading code, code that will check if the deletion of an atom should have consequences elsewhere.

This is one area where I should optimise my code, btw.

mhekkel avatar Jun 10 '25 10:06 mhekkel

You mentioned the fact using conditions was not the most efficient way to proceed ?

I saw code where you created very complex conditions, and those are inefficient.

Understood, nonetheless it seems to me that handling alternate locations in a determinate manner is complex.

What could help when simply removing some atoms is to set the validator for the atom_site category to null before the erase and restoring the validator afterwards. That will avoid the cascading code, code that will check if the deletion of an atom should have consequences elsewhere.

Great, I will try this whenever I can.

This is one area where I should optimise my code, btw.

clebreto avatar Jun 10 '25 10:06 clebreto

This might be an alternative to your code to filter alternates based on occupancy. I would go for code like this.

This example removes the highest occupancy alternate.

#include <cif++.hpp>

#include <iostream>

#include <cassert>

int main(int argc, char * const argv[])
{
	using namespace cif::literals;

	cif::file f(argv[1]);
	auto &db = f.front();
	auto &atom_site = db["atom_site"];

	std::map<std::tuple<std::string,int>, std::map<std::string, float>> alts;

	for (const auto &[asym_id, seq_id, alt_id, occupancy] :
		atom_site.find<std::string, int, std::string, float>(
			"label_alt_id"_key != cif::null,
			"label_asym_id", "label_seq_id",
			"label_alt_id", "occupancy"
		))
	{
		auto key = std::make_tuple(asym_id, seq_id);

		if (auto i = alts.find(key); i != alts.end())
			i->second[alt_id] += occupancy;
		else
			alts[key][alt_id] = occupancy;
	}

	for (const auto &[key, value] : alts)
	{
		const auto &[asym_id, seq_id] = key;

		// select highest occupancy for this residue's alternates
		std::string alt_id;
		float occupancy = 0;
		for (const auto &[alt_key, alt_value] : value)
		{
			if (occupancy < alt_value)
			{
				alt_id = alt_key;
				occupancy = alt_value;
			}
		}

		cif::condition cond = "label_asym_id"_key == asym_id and "label_alt_id"_key == alt_id;
		
		if (seq_id == 0)
			cond = std::move(cond) and "label_seq_id"_key == cif::null;
		else
			cond = std::move(cond) and "label_seq_id"_key == seq_id;
		
		auto removed = atom_site.erase(std::move(cond));
	}

	std::cout << f;

	return 0;
}

Of course you need to extend/alter this code to suit your needs.

mhekkel avatar Jun 10 '25 12:06 mhekkel

I merged your code into a new pull request here: https://github.com/PDB-REDO/libcifpp/pull/68

Could you please have a look at that code, if it does what you want?

From what I gathered this is not exactly what you were looking for, since writing out the data will reveal the hidden atoms again. Perhaps I will have to implement some push back option

mhekkel avatar Jun 11 '25 11:06 mhekkel