sisl icon indicating copy to clipboard operation
sisl copied to clipboard

I/O from and to (online) databases

Open pfebrer opened this issue 2 years ago • 28 comments

Just as there is a very good standardized framework to read and write files (i.e. siles), I think it would be highly benefitial for sisl to have a standardized framework to interact with databases.

There is already some tendency to use databases instead of files, and in the future databases could be the main way people take input and store output for their simulations (who knows).

Not only geometries are stored in databases but all sorts of data (bands, dos, etc...).

pfebrer avatar Feb 19 '22 13:02 pfebrer

Good idea, however, if these databases provide their own interface API in Python, I think we should not put it in here.

zerothi avatar Feb 19 '22 20:02 zerothi

Yes, I had also thought this. However standardized methods like read_geometry to get a Geometry directly still seemed like a good idea 🤔

You don't think so?

pfebrer avatar Feb 19 '22 21:02 pfebrer

Yes, I had also thought this. However standardized methods like read_geometry to get a Geometry directly still seemed like a good idea thinking

You don't think so?

I do, I have nothing against this! :) I just think we shouldn't duplicate databases works.

zerothi avatar Feb 19 '22 21:02 zerothi

Is there any particular database you had in mind?

zerothi avatar Feb 20 '22 14:02 zerothi

https://pubchem.ncbi.nlm.nih.gov/compound/241

Seems like a good starting point! Perhaps we should think about async io in this case? Or storing files in the sisl_temp folder for faster subsequent access?

zerothi avatar Feb 20 '22 14:02 zerothi

I had in mind the typical ones: materials project, pdb... But my point is that I don't have much knowledge about them and some may be trendy now but not relevant in the future, that's why to me it is more important to have the framework instead of just supporting a particular database.

The one you proposed seems good!

I think by default it should not be async, since most users will just want to get a structure and do something with it, so they'll have to wait anyway. However it is definitely nice to have async requests as an option. One potential use of async i/o would be to request multiple structures from potentially different databases and wait until you have them all.

I'm not sure about the sisl_temp folder. Maybe the database APIs already have some kind of caching so there's no need for it. Also it might be not straightforward to implement caching for all databases. I don't know, I guess it depends on what the APIs do, I'm not familiar with any of them 😅

pfebrer avatar Feb 20 '22 14:02 pfebrer

I definitely think we should have a place to put downloaded database files, I hardly think apis would do this by its own.. Could be wrong though. Probably not in temp, but in datadir or something.

zerothi avatar Feb 20 '22 17:02 zerothi

Hmm ok, but how would you do it? Would you first convert to sisl objects and then store? Or implement a storage (if there is none implemented) for each database and each particular query to that database?

I think the second option is not feasible since it is very hard to implement and maintain.

pfebrer avatar Feb 20 '22 17:02 pfebrer

Hmm ok, but how would you do it? Would you first convert to sisl objects and then store? Or implement a storage (if there is none implemented) for each database and each particular query to that database?

I think the second option is not feasible since it is very hard to implement and maintain.

I don't know, it depends on the format that the database can ship the data in. For instance the molecule database has a json output file, that entire file per molecule could easily be stored, I think keeping data as provided is the safest, as that makes it easy to compare against a new version provided by the database.

Secondly, the reason for providing an on-disk mirror of certain database items is that some supercomputers may not allow downloading from the internet, and in such a case it is necessary to retrieve the data by some other means. This is why I think it is a somewhat major task, and unless there is a high demand, I am probably not going to pursue it directly, it is quite a task to do this, generally. Especially since the local mirror is required. Also, consider the lag for fetching the geometry from a remote thing? Well, interacting with web-pages/online databases is quite the task! :)

zerothi avatar Feb 20 '22 19:02 zerothi

I think the lag of fetching a structure might not be relevant in most cases, the convenience of having a molecule's coordinates will by far outweight it. And if anything, this lag will get smaller in the future.

Regarding supercomputers, I also don't see it as a very big problem. You can fetch whatever you want outside the supercomputer and then compute inside it. Anyway, if people start using online databases they will probably at some point allow connections to those particular addresses.

Also, not all supported databases need to be online, we could also have offline databases (similar to ASEs g2 database).

I really think there should be a way in sisl to easily build structures using coordinates that are already available in databases instead of being limited by which ones are implemented in the code.

pfebrer avatar Feb 20 '22 20:02 pfebrer

I still agree, I am not against it!

But, I think you are underestimating the problems I foresee ;) I could on the other hand also be wrong ;)

Feel free to come up with some code, perhaps a sileremote should be created.

zerothi avatar Feb 20 '22 20:02 zerothi

On the other hand, this seems very similar to pymatgen, and perhaps sisl should not move onto this boundary, but rather do something a little different?

zerothi avatar Feb 20 '22 21:02 zerothi

On the other hand, this seems very similar to pymatgen, and perhaps sisl should not move onto this boundary, but rather do something a little different?

Yeah, it is indeed similar in the sense of fetching data from databases.

But well, I/O from files is also similar to any other scientific python package that implements reading and writing from and to files. Actually I think it would be very beneficial for the scientific community if we had a python package just dedicated to input/output that everybody would use and contribute to instead of implementing the same things in different packages (and each package then misses of course some implementations). Do you think people would be open to this (including you)?

pfebrer avatar Feb 20 '22 21:02 pfebrer

On the other hand, this seems very similar to pymatgen, and perhaps sisl should not move onto this boundary, but rather do something a little different?

Yeah, it is indeed similar in the sense of fetching data from databases.

But well, I/O from files is also similar to any other scientific python package that implements reading and writing from and to files. Actually I think it would be very beneficial for the scientific community if we had a python package just dedicated to input/output that everybody would use and contribute to instead of implementing the same things in different packages (and each package then misses of course some implementations). Do you think people would be open to this (including you)?

I agree on this last point, I think pymatgen is the closests to this as one can get. So I think efforts should be put there... No need to reinvent the wheel?

zerothi avatar Feb 21 '22 07:02 zerothi

No, sorry, openbabel is the one I had in mind, http://openbabel.org/wiki/Main_Page

zerothi avatar Feb 21 '22 07:02 zerothi

Yes, I also thought about openbabel. But it is just yet another tool that does lots of things and they have their own input/output implementations. Plus their support for python doesn't seem to be very clear.

I would love to see a library specific for parsing and writing files (that would be its sole purpose) that everyone would contribute to and use. It would return/accept very raw datatypes (e.g. python built-ins and numpy arrays) and each package would use them as they wish. Then if you have e.g. a DFT code you could just implement the parsers on that centralized library and make the files parsable by all packages automatically.

The parsers are already written, it would just be a matter of reorganizing them. Of course, this is only meaningful if people agree to have it as a common source. Do you think sisl could use something like this? If so, it's just a matter of talking to people and see if they are open to it.

pfebrer avatar Feb 21 '22 09:02 pfebrer

Yes, I also thought about openbabel. But it is just yet another tool that does lots of things and they have their own input/output implementations. Plus their support for python doesn't seem to be very clear.

Agreed...

I would love to see a library specific for parsing and writing files (that would be its sole purpose) that everyone would contribute to and use. It would return/accept very raw datatypes (e.g. python built-ins and numpy arrays) and each package would use them as they wish. Then if you have e.g. a DFT code you could just implement the parsers on that centralized library and make the files parsable by all packages automatically.

The parsers are already written, it would just be a matter of reorganizing them. Of course, this is only meaningful if people agree to have it as a common source. Do you think sisl could use something like this? If so, it's just a matter of talking to people and see if they are open to it.

I think this would be very good, but I have a feeling it will be very hard to push, it requires several years of full commitment and keeping the IO databases up to date + ensuring that various versions of files are allowed. If the API was simple it is definitely something sisl should be using. But I have to say that this is a major undertaking since it requires a thorough knowledge of all DFT codes input/output mechanisms.

Lastly, I can see on sisl's side that not that many people are using it to convert between different file formats. While it can read/write POSCAR files, people tend to use vasp4py or some designated packages for the DFT codes they use. This will make adoption even more difficult. If anything, it requires onboarding the maintainers of the python DFT codes to accept and join forces, abi4py, vasp4py, siesta (sisl), bigdft, CP2K, etc. If they can be convinced of this, then we have something ;)

zerothi avatar Feb 21 '22 09:02 zerothi

Nice! Well, to me the hardest part is to convince the maintainers of python packages that already have an input/output framework to move to a common solution.

Regarding the DFT codes (or any other scientific computing strategy) to me it looks like there would be not much arguments to be against it. If you know that this is the only library that all packages use, you just have to maintain your parsers in this library. E.g. for SIESTA you would basically do the maintenance that you already do in sisl but in this common package. If you don't keep it up to date, it is your problem (SIESTA's problem), basically SIESTA as a code would be giving up on providing python support.

Lastly, I can see on sisl's side that not that many people are using it to convert between different file formats. While it can read/write POSCAR files, people tend to use vasp4py or some designated packages for the DFT codes they use. This will make adoption even more difficult.

I don't see this as an argument against the library. People that use vasp4py do it because it has much better support for VASP, not necessarily because vasp4py is a better library. Same for SIESTA and sisl. No one will implement input/output for SIESTA better than you, and your implementations for VASP will never be as good/wide as vasp4py. If there is a common library, then the choice of the python package to use is no longer because of how it plays with your particular DFT code. That means you have more options, which is good :).

pfebrer avatar Feb 21 '22 18:02 pfebrer

Nice! Well, to me the hardest part is to convince the maintainers of python packages that already have an input/output framework to move to a common solution.

Regarding the DFT codes (or any other scientific computing strategy) to me it looks like there would be not much arguments to be against it. If you know that this is the only library that all packages use, you just have to maintain your parsers in this library. E.g. for SIESTA you would basically do the maintenance that you already do in sisl but in this common package. If you don't keep it up to date, it is your problem (SIESTA's problem), basically SIESTA as a code would be giving up on providing python support.

Lastly, I can see on sisl's side that not that many people are using it to convert between different file formats. While it can read/write POSCAR files, people tend to use vasp4py or some designated packages for the DFT codes they use. This will make adoption even more difficult.

I don't see this as an argument against the library. People that use vasp4py do it because it has much better support for VASP, not necessarily because vasp4py is a better library. Same for SIESTA and sisl. No one will implement input/output for SIESTA better than you, and your implementations for VASP will never be as good/wide as vasp4py. If there is a common library, then the choice of the python package to use is no longer because of how it plays with your particular DFT code. That means you have more options, which is good :).

The argument against it is that those maintainers will most like just maintain their own parsers, not worrying too much about a separate library. Having a dependency on a library that forces them to release that library, is very often not so productive. Just take sisl and your sisl-gui, correlation of releases, or at least releasing the IO library should be done when other dependencies request it. If not, they will not contribute. Since they have nothing to loose to maintain and keep their own versions up to date, it will be very hard to adopt such things. I think, even for sisl. :)

So it requires some planning, and initial commitment. If not, I have to say, it will be doomed to fail... :( Sorry about my pessimism! ;)

zerothi avatar Feb 21 '22 19:02 zerothi

Hmm, you are right about releases.

But that could be solved simply by packaging the io library with your release, right? :thinking:

pfebrer avatar Feb 21 '22 19:02 pfebrer

Then you can do it whenever you want.

pfebrer avatar Feb 21 '22 19:02 pfebrer

Releasing frequently (say every week or two) and reliably (always on schedule) could also be a solution for this.

pfebrer avatar Feb 21 '22 19:02 pfebrer

Hmm, you are right about releases.

But that could be solved simply by packaging the io library with your release, right? 🤔

But fixes to the code should still be accepted! It puts a high load on maintenance, and those who are maintaining will burn out. There needs to be many more than a single person to do this... Just a single mess up that does not satisfy a dependency package will make them leave, simply because it is extra work on their end (they should provide fixes etc).

zerothi avatar Feb 21 '22 19:02 zerothi

I guess each code could have a representative as a maintainer, then they can push to their modules whenever they want (unless it is some drastic change). Or that would mean that there are too many maintainers?

pfebrer avatar Feb 21 '22 19:02 pfebrer

And accepting code is not as easy in a standardised package because it requires abi and api consistency, and only the maintainers will truly care about this (just take my code reviews against your prs ;)) And then you still need to really convince that this package is better than doing packagea - > ase - > packageb. Ase is already standard in DFT codes for its geometry class.

zerothi avatar Feb 21 '22 19:02 zerothi

And then you still need to really convince that this package is better than doing packagea - > ase - > packageb. Ase is already standard in DFT codes for its geometry class.

You need to convince ASE that it is better to have a shared io library basically. Then if people want to use ASE they will just use ASE with no extra packages.

And accepting code is not as easy in a standardised package because it requires abi and api consistency, and only the maintainers will truly care about this (just take my code reviews against your prs ;))

You are right. My idea to make this smoother was that the code-specific parsers would implement as few python code as possible. The core API would be based on some parser generator (e.g. https://github.com/erikrose/parsimonious/) and the particular implementations would be just specifying the grammar for that particular file format, which is neat. I just don't know how many edge cases can this lead to (i.e. can everything be parsed with this approach?) and if it is effective in terms of speed.

pfebrer avatar Feb 21 '22 19:02 pfebrer

And accepting code is not as easy in a standardised package because it requires abi and api consistency, and only the maintainers will truly care about this (just take my code reviews against your prs ;))

You are right. My idea to make this smoother was that the code-specific parsers would implement as few python code as possible. The core API would be based on some parser generator (e.g. https://github.com/erikrose/parsimonious/) and the particular implementations would be just specifying the grammar for that particular file format, which is neat. I just don't know how many edge cases can this lead to (i.e. can everything be parsed with this approach?) and if it is effective in terms of speed.

So you want code responsibles to implement the grammar using PEG? Sorry, but I don't think you can convince people to write this ;)

zerothi avatar Feb 22 '22 07:02 zerothi

Another database: https://materialsproject.org/materials/mp-22862/

zerothi avatar Feb 28 '22 09:02 zerothi