django-pandas Safely saving pandas DataFrame to SQL table via Django

I recently wrote the following for a project, but django-pandas actually seems like a better home for it. The idea is basically to implement to_sql via the Django ORM to get all the validation goodness of the ORM. (A related feature would be another set of tools to validate that a DataFrame is consistent with the Django model schema. That is the 2nd example below)

Is anyone here interested in a PR for this?

def to_django(df, DjangoModel, if_exists="fail"):
    """Uses bulk_create to insert data to Django table
    if_exists: see pd.DataFrame.to_sql API

    Ref: https://www.webforefront.com/django/multiplemodelrecords.html
    """
    import numpy as np

    if if_exists not in ["fail", "replace", "append"]:
        raise Exception("if_exists must be fail, replace or append")

    if if_exists == "replace":
        DjangoModel.objects.all().delete()
    elif if_exists == "fail":
        if DjangoModel.objects.all().count() > 0:
            raise ValueError("Data already exists in this table")
    else:
        pass

    dct = df.replace({np.nan: None}).to_dict(
        "records"
    )  # replace NaN with None since Django doesn't understand NaN

    bulk_list = []
    for x in dct:
        bulk_list.append(DjangoModel(**x))
    DjangoModel.objects.bulk_create(bulk_list)
    print("Successfully saved DataFrame to Django table.")

def check_dataframe_columns(
    df: pd.DataFrame, DjangoModel: Type[models.Model], strict: bool = False
):
    """Raises KeyError if DataFrame doesn't match a Django model

    Parameters
    ----------
    df: A pandas DataFrame object
    DjangoModel: A Django model
    strict: If strict is True, then the DataFrame must contain the exact set of columns
        as the Django mode
    """
    # todo: add column type checking
    error_list = []
    dataframe_columns = set(df.columns)
    django_columns = set(map(lambda x: x.name, DjangoModel._meta.fields))
    if not dataframe_columns.issubset(django_columns):
        unknown_columns = list(dataframe_columns - django_columns)
        unknown_columns.sort()  # to be repeatable
        error_list.append(
            f"DataFrame contains unknown column(s): {', '.join(unknown_columns)}"
        )
    if strict and not dataframe_columns == django_columns:
        missing_columns = list(django_columns - dataframe_columns)
        missing_columns.sort()  # to be repeatable
        error_list.append(
            f"With strict=True, DataFrame is missing column(s): {', '.join(missing_columns)}"
        )

    if error_list:
        raise KeyError(" | ".join(error_list))

Aug 13 '20 07:08 pandichef

@pandichef this is cool! But how do we cover the case when we create the the pandas dataframe based on a join (related field)

Aug 17 '20 16:08 chrisdev

In my use case, I have a single foreign key called pool. So I set the value manually i.e.,

df['pool_id'] = 9

If you want to handle choices fields or foreign key fields more formally, I think it can be done. For model choices, you just need to check that the value is valid or throw an exception. For foreign keys, I'd add a boolean parameter to to_django named something like autogen_fk_object. If True, to_django will create a new foreign key object if the key doesn't already exist; if False, it throws an error. Something like Factory Boy's SubFactory class comes to mind.

Aug 17 '20 18:08 pandichef

This is an amazing code. Thanks a lot as it helped me with a project. Great job mate!

Dec 03 '21 06:12 selfcontrol7

How can we add some logic to avoid duplicate entries for the incoming data. For example I am able to save the excel file in my database using to_sql. I have to validate that no two duplicate entries can be created for a column in this case I have title column and validate if two products have same title or name, the entry does not get added to the database.

Jan 31 '22 12:01 venqics

This works wonders, thank you for posting!

Oct 06 '23 12:10 PJaramilloV

django-pandas django-pandas copied to clipboard

Safely saving pandas DataFrame to SQL table via Django

django-pandas
django-pandas copied to clipboard