biopandas icon indicating copy to clipboard operation
biopandas copied to clipboard

column 'line_idx' gets 'object' d_type for empty frames.

Open ZivBA opened this issue 5 years ago • 0 comments

If a PDB file has no records of a certain type (for instance no HETATM or no ATOM), then the (empty) dataframe is created with the 'line_idx' column as type 'object' (default for pandas?) I've noticed that there's no 'line_idx' record in the 'pdb_atomdict' (engines.py). suggested fix would either to add it to that dict, removing the need for this 'hack', (starred) in 'pandas_pdb.py', line 363:

            df = pd.DataFrame(r[1], columns=[c['id'] for c in
                                             pdb_records[r[0]]] ** + ['line_idx']** )

unfortunately I have no idea if this will have a cascading effect, as I'm certain this was done on purpose.

Another quick and dirty workaround would be to add the (starred) line:

            for c in pdb_records[r[0]]:
                try:
                    df[c['id']] = df[c['id']].astype(c['type'])
                except ValueError:
                # expect ValueError if float/int columns are empty strings
                    df[c['id']] = pd.Series(np.nan, index=df.index)
            **df['line_idx'] = df['line_idx'].astype(int)**

after the d_type correction loop right after the above code in line 363.

This is an incredibly minor issue, but has caused some unexpected glitches for me when fetching the columns with type 'object' and then converting them to string in both ATOM and HETATM frames, as one frame would have the wrong datatype and conversion would crash.

ZivBA avatar Jun 04 '20 13:06 ZivBA