datamol icon indicating copy to clipboard operation
datamol copied to clipboard

Add dm.read_records in io module

Open MichelML opened this issue 3 years ago • 7 comments

This would simplify things when working in the context of querying a rest api where the json response contains a list of molecules. Quick REST API examples:

  • Molport
  • Chemspace
  • Mcule
  • CDD Vault
  • Dotmatics
  • PubChem
  • Etc...

MichelML avatar Sep 19 '22 12:09 MichelML

Similarly add dm.real_yaml for convinience.

zhu0619 avatar Sep 19 '22 13:09 zhu0619

@zhu0619 can you log this in another issue providing context explaining the use cases of read_yaml ?

MichelML avatar Sep 19 '22 13:09 MichelML

Having to maintain those API layers could be quite time-consuming as they tend to change over time.

Also by experience working with some of them, it can be tricky to get a unified datamol API given the difference in returned data in between all the above providers (while being nice, this is not necessarily an important point here).

hadim avatar Sep 26 '22 12:09 hadim

Sorry maybe the examples were overwhelming here, in any case I think I can make a PR myself for this.

The idea is simply to replicate this case:

>>> data = [{'col_1': 3, 'col_2': 'a'},
...         {'col_1': 2, 'col_2': 'b'},
...         {'col_1': 1, 'col_2': 'c'},
...         {'col_1': 0, 'col_2': 'd'}]
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Since this case is tied with what is usually received from most rest APIs (a list of dicts), remaining would be to add the logic for a smiles_column param like other io methods.

There is definitely a way to have a simple v1 while making sure to specify its limitations imho.

In any case, not urgent or a priority, and something I can add myself when I'll need it for a production case.

MichelML avatar Sep 26 '22 13:09 MichelML

Ok and sorry I think I misunderstood here xD

I use list of dict to create df all the time and you can simply do pd.DataFrame(list_of_dict). That being said I am not sure what logic you want to add on datamol related to this. Usually a simple workflow is:

df = pd.DataFrame(list_of_dict)
df["mol"] = df[smiles_column].apply(dm.to_mol)

But feel free to post more examples or open a PR if you have something else in mind.

hadim avatar Sep 26 '22 13:09 hadim

that's exactly what I want, as a single liner @hadim :smile: , thus why I say it's not a priority, just convenient and another case datamol can handle!

df = dm.from_records(list_of_dict, smiles_column="smiles")

no more or less complex than dm.read_csv already implemented here https://github.com/datamol-org/datamol/blob/main/datamol/io.py#L27

MichelML avatar Sep 26 '22 16:09 MichelML

Note, that in a lot of cases, you might want to use pd.json_normalize instead.

maclandrol avatar Sep 26 '22 17:09 maclandrol

Closing here. It's not clear to me whether we need this in datamol.

Please re-open if needed.

hadim avatar Apr 17 '23 11:04 hadim