datamol Add dm.read_records in io module

This would simplify things when working in the context of querying a rest api where the json response contains a list of molecules. Quick REST API examples:

Molport
Chemspace
Mcule
CDD Vault
Dotmatics
PubChem
Etc...

Sep 19 '22 12:09 MichelML

Similarly add dm.real_yaml for convinience.

Sep 19 '22 13:09 zhu0619

@zhu0619 can you log this in another issue providing context explaining the use cases of read_yaml ?

Sep 19 '22 13:09 MichelML

Having to maintain those API layers could be quite time-consuming as they tend to change over time.

Also by experience working with some of them, it can be tricky to get a unified datamol API given the difference in returned data in between all the above providers (while being nice, this is not necessarily an important point here).

Sep 26 '22 12:09 hadim

Sorry maybe the examples were overwhelming here, in any case I think I can make a PR myself for this.

The idea is simply to replicate this case:

>>> data = [{'col_1': 3, 'col_2': 'a'},
...         {'col_1': 2, 'col_2': 'b'},
...         {'col_1': 1, 'col_2': 'c'},
...         {'col_1': 0, 'col_2': 'd'}]
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Since this case is tied with what is usually received from most rest APIs (a list of dicts), remaining would be to add the logic for a smiles_column param like other io methods.

There is definitely a way to have a simple v1 while making sure to specify its limitations imho.

In any case, not urgent or a priority, and something I can add myself when I'll need it for a production case.

Sep 26 '22 13:09 MichelML

Ok and sorry I think I misunderstood here xD

I use list of dict to create df all the time and you can simply do pd.DataFrame(list_of_dict). That being said I am not sure what logic you want to add on datamol related to this. Usually a simple workflow is:

df = pd.DataFrame(list_of_dict)
df["mol"] = df[smiles_column].apply(dm.to_mol)

But feel free to post more examples or open a PR if you have something else in mind.

Sep 26 '22 13:09 hadim

that's exactly what I want, as a single liner @hadim :smile: , thus why I say it's not a priority, just convenient and another case datamol can handle!

df = dm.from_records(list_of_dict, smiles_column="smiles")

no more or less complex than dm.read_csv already implemented here https://github.com/datamol-org/datamol/blob/main/datamol/io.py#L27

Sep 26 '22 16:09 MichelML

Note, that in a lot of cases, you might want to use pd.json_normalize instead.

Sep 26 '22 17:09 maclandrol

Closing here. It's not clear to me whether we need this in datamol.

Please re-open if needed.

Apr 17 '23 11:04 hadim

datamol datamol copied to clipboard

Add dm.read_records in io module

datamol
datamol copied to clipboard