libsbml
libsbml copied to clipboard
Make `getElementBySId` and `getElementByMetaId` O(1) lookups
Hi all,
Two of the key functions to write algorithms and utilities based on libsbml are the getElementBySId and getElementByMetaId functions. This functions run in O(n) with n being the number of elements in the SBML model.
Many people assume that this is a very fast lookup of O(1), which it should be. Consequently, many code and algorithm is substantially slowed down by this functions.
The correct/fast way is to create a HashMap in your algorithm from getListOfAllElements and then use the HashMap without ever touching getElementBySId.
There should be an internal HashMap which provides O(1) lookups by SId and MetaId. As far as I remember this is how JSBML is implementing this functions. Creating the HashMap costs O(n) which is as much as a single call got the getElementBy* functions
As a consequence a lot of code using libsbml would see major speedups, probably even core algorithms such as comp flattening.
Best Matthias
LibSBML already provides a way to return alternative structures. So if you'd rather have a hashmap, you could get one as follows:
- create a subclass from ElementFilter, with the hashmap of your choice as member variable.
- overwrite the filter method, it should return false (so no space is wasted in the list that will be created later), and will be called once with each SBase object. Add it to your map (if you are interested in the kind of object) (or even multiple maps, if you want one for different kind of things, it might speed things up having one for just species, one for other things).
- now call getAllElements with your elementFilter, ignoring the result, as you are not really interested in the return value of it.
- after the call, your elementFilter object will contain the hash map (or maps) in whatever format you like.
I would be against changing the return type of the existing functions (ids might not be unique across all sbase objects / plugins)
I like Frank's solution better than anything else, because libsbml documents can be edited, which could invalidate any internal hashmap any time any object is added or removed, and when setId or setMetaId is called. You also have to somehow deal with invalid SBML document objects where there are multiple objects with the same SId or MetaId.
But if you do what Frank suggests, you have control of the SBML document object, and know when it's been edited.
@fbergmann and @luciansmith Thanks for the feedback. This sounds like the solution I was looking for. Perhaps the documentation/doc strings of the functions getElementBySId and getElementByMetaId could be updated to clearly indicate that this is a very costly operation. Currently this is not obvious from the documentation and many users will just use the functions without understanding/realizing the performance problems this can generate (especially if using in loops).
Currently the python doc string only reads the following without indicating the O(n) runtime:
def getElementBySId(self, id):
"""
getElementBySId(SBMLDocument self, string id) -> SBase
Returns the first child element found that has the given 'id' in the
model-wide SId namespace, or 'None' if no such object is found.
Parameter 'id' is string representing the id of the object to find.
Returns pointer to the first element found with the given 'id'.
"""
Alternatively a solution could be to provide two new helper functions getElementBySIdMap and getElementByMetaIdMap. These would use the lists generated by getAllElementIdList with the same limitations, i.e., a call to populateAllElementIdList would be necessary to update the hashmap.