sqlmodel icon indicating copy to clipboard operation
sqlmodel copied to clipboard

Upsert of 2 table with a one to many relation onto them: fastest implementation (single vs batch vs one commit vs ?)

Open tepelbaum opened this issue 3 years ago • 2 comments

First Check

  • [X] I added a very descriptive title to this issue.
  • [X] I used the GitHub search to find a similar issue and didn't find it.
  • [X] I searched the SQLModel documentation, with the integrated search.
  • [X] I already searched in Google "How to X in SQLModel" and didn't find any information.
  • [X] I already read and followed all the tutorial in the docs and didn't find an answer.
  • [X] I already checked if it is not related to SQLModel but to Pydantic.
  • [X] I already checked if it is not related to SQLModel but to SQLAlchemy.

Commit to Help

  • [X] I commit to help with one of those options 👆

Example Code

class Lei(LeiBase, table=True):
    """
    lei db. Start from csv taken here
    https://www.gleif.org/fr/lei-data/lei-mapping/download-isin-to-lei-relationship-files/
    Could be completed by direct gleif api calls
    https://documenter.getpostman.com/view/7679680/SVYrrxuU?version=latest#3e013a79-b5f6-46a7-b9e9-299fde0b3a03
    """
    id: Optional[int] = Field(default=None, primary_key=True)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    isins: List['Isin'] = Relationship(back_populates='lei')

class Isin(SQLModel, table=True):
    """
    isin db. Start from csv taken here
    https://www.gleif.org/fr/lei-data/lei-mapping/download-isin-to-lei-relationship-files/
    Could be completed by direct gleif api calls
    https://documenter.getpostman.com/view/7679680/SVYrrxuU?version=latest#3e013a79-b5f6-46a7-b9e9-299fde0b3a03
    """
    id: Optional[int] = Field(default=None, primary_key=True)
    created_at: datetime = Field(default_factory=datetime.utcnow)
    isin: str = Field(index=True)
    lei_id: Optional[int] = Field(default=None, foreign_key='lei.id')
    lei: Optional['Lei'] = Relationship(back_populates='isins')


def create_or_update_isins_leis(grouped_isins: Dict[str, List[str]], session: Session):
    """
    Code suggestion, to be improved. isin_lei_models is not provided but 
    """
    for lei, isins in grouped_isins.items():
        isin_lei_models(lei, isins, session)
    session.commit()

def isin_lei_models(lei: str, isins: List[str], session: Session):
    """
    Possible way to upsert lei and isins
    """
    in_db_lei = session.exec(select(Lei).where(col(Lei.lei) == lei)).first() or Lei(lei=lei)
    isins_to_insert = list(set(isins) - {isin.isin for isin in in_db_lei.isins})
    for isin in isins_to_insert:
        if in_db_isin := session.exec(select(Isin).where(col(Isin.isin) == isin)).first():
            in_db_isin.lei = lei
            session.add(in_db_isin)
            return
        session.add(Isin(isin=isin, lei=lei))

Description

I simplified the code as much as to focus on my interrogation. I have a simple one to many relation between Lei and Isin. Imagine that I am given new {lei: isins} values and I have to either insert or update the lei. My question is simple: should I commit once at the end or in the for loop just above ? or in the lei loop? Should I batch if I have 10M leis? Is the answer the same in Create and Upsert mode?

I found this related stackoverflow thread https://stackoverflow.com/questions/24377193/best-way-to-update-millions-of-records-in-sql-table-with-foreign-keys, which describes well what I am trying to accomplish, but the solution seems far fetched (buy maybe it is not?)

I can experiment all suggested ideas if needs be :)

Operating System

Linux

Operating System Details

python:3.10-slim docker image

Requirements.in (then converted to txt via pip-compile)

bcrypt==3.2.0
certifi==2021.5.30
cryptography==3.4.8
dash==2.6.0
dash-auth==1.4.1
dash_core_components==2.0.0
dash_bootstrap_components==1.2.0
fastapi==0.80.0
jupyter==1.0.0
jupyter-dash==0.4.2
nltk==3.7
numpy==1.23.1
openpyxl==3.0.10
pandas==1.4.3
passlib==1.7.4
pillow==9.2.0
plotly==5.9.0
psycopg2==2.9.1
pydeps==1.10.22
python-dateutil==2.8.2
python-dotenv==0.19.0
python-editor==1.0.4
python-jose==3.3.0
python-multipart==0.0.5
requests==2.26.0
sqladmin==0.3.0
sqlalchemy==1.4.35 # Needed, otherwise relationship are not working. See https://github.com/tiangolo/sqlmodel/issues/315
sqlmodel==0.0.6
strsimpy==0.2.1
tqdm==4.64.0
uvicorn==0.18.2

SQLModel Version

0.0.6

Python Version

3.10

Additional Context

In creation mode, a csv with 8 millions LEI/ISINS line takes less thant 30 minutes to insert. In upsert I gave up, it was too long with my naive implentation :)

tepelbaum avatar Sep 05 '22 11:09 tepelbaum

With all due respect, what does this question have to do with the SQLModel project? The docs make it very clear that this is a thin layer around the SQLAlchemy ORM. SQLModel changes nothing whatsoever about the underlying database query logic of SQLAlchemy. So if anything, this question might be related to SQLAlchemy.

However, I would argue that this question is maybe not even ORM-related at all. You seem to have a very database-centered question and you even found a related question+answer completely independent of any underlying ORM.

If I misunderstood your question, please do clarify, but I honestly don't think this is an appropriate place to ask this.

My general advice: Abstraction. Boil down your example even further (stripping it of all specifics, unrelated fields etc., call the models/fields Foo.bar or X.y), set up a fresh testing script with that reduced example, get rid of the SQLModel dependency (i.e. implement it with the SQLAlchemy ORM), and test a few approaches with some dummy data fixtures. If that does not give you (satisfactory) results, post your question with that reduced example to Stackoverflow. The reason I would advise to get rid of SQLModel for your tests is that it is much more likely that you find people who can help/advise you with that directly. The SQLModel "layer" around it may just distract/confuse people because it has (as I said) nothing to do with your question.

daniil-berg avatar Sep 06 '22 10:09 daniil-berg

I am no python ORM expert (faaaar from it), and therefore was wondering if anyone had some feedback to share on the topic, and since session is a SQLModel object the question seemed natural to ask here. I started to email @tiangolo on the topic, who suggested me to post here :). If the matter seems completely irrelevant to SQLModel (but good practices to do single/batch/all insertion with it seems like a natural topic to me), feel free to close the issue

tepelbaum avatar Sep 06 '22 13:09 tepelbaum