dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Streamlit app cache is invalid & load info crashes

Open willi-mueller opened this issue 10 months ago • 8 comments

dlt version

0.4.8

Describe the problem

When the schema changes by either a) dropping a table manually b) unselecting a resource while having the write_disposition='replace'

the streamlit app does not show the changed schema. Screenshot 2024-04-23 at 13 06 40

Also, when I delete a row it is still shown when I click "Show Data".

Further, the "Number of loaded rows" in the load info tab only shows a run-time error and stack trace but no row count.

Screenshot 2024-04-23 at 12 57 05

Expected behavior

  1. When I drop a table the streamlit app should not show it
  2. When I deselect a resource and dlt removes the table from the destination the streamlit app should not show it
  3. The streamlit app should show the row count in the load info tab
  4. Different topic: The tables created by the rest_api should not be filed in dlt under a non-existing schema.

Steps to reproduce

See screencast: https://www.loom.com/share/700e9f4a1cbe48a5988f55f27c022588?sid=b937409a-329a-4770-85b6-b65afe05aa51

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

rest_api

dlt destination

DuckDB

Other deployment details

No response

Additional information

My test code:

import dlt

from rest_api import rest_api_source

pokemon_config = {
    "client": {
        "base_url": "https://pokeapi.co/api/v2/",
    },
    "resource_defaults": {
        "write_disposition": "replace",
        "endpoint": {
            "params": {
                "limit": 1000,
            },
        },
    },
    "resources": [
        {
          "name": "berries",
          "endpoint": {
            "path": "berry"
          },
          # "selected": False
        },
        "pokemon",
    ],
}

pokemon_source = rest_api_source(pokemon_config)

pipeline = dlt.pipeline(
    pipeline_name="pokemon_pipeline",
    destination="duckdb",
    dataset_name="pokemon",
    progress="log",
)

load_info = pipeline.run(pokemon_source)
print(load_info)

willi-mueller avatar Apr 23 '24 11:04 willi-mueller

Tried to reproduce this by running pipeline with the example code multiple times and then trying out

a) dropping a table manually b) unselecting a resource while having the write_disposition='replace'

https://github.com/dlt-hub/dlt/assets/354868/fa16a861-72aa-453c-8702-7a10683dd308

I was unable to do so, can you please try maybe with the latest version?

sultaniman avatar Apr 25 '24 07:04 sultaniman

Thanks for testing! Indeed, with v 0.49 it's better. But when I click on "Show data" after the second run after just having dropped the table, streamlit still shows data from the cache and does not query the DB and thus does not see that the table does not exist anymore.

How to reproduce:

  1. python pokemon_pipeline.py # with berry selected
  2. dlt pipeline pokemon_pipeline show
  3. in streamlit: "show data" on berries resource
  4. in duckDBL use pokemon; drop table berries;
  5. Refresh streamlit
  6. in streamlit: "show data" on berries resource. It still shows the deleted data
  7. python pokemon_pipeline.py # with berry unselected
  8. in streamlit: "show data" on berries resource. It still shows the deleted data
  9. dlt pipeline pokemon_pipeline drop berries
  10. streamlit: Now, berries are no longer visible

willi-mueller avatar Apr 26 '24 09:04 willi-mueller

Also, after executing the pipeline even multiple times I get the blue message in streamlit: "pokemon is missing resource state". I wonder when this would disappear. Same for berries

willi-mueller avatar Apr 26 '24 09:04 willi-mueller

@willi-mueller I think this is suboptimal ux, and probably we just need to hide it if there is no resource state found

sultaniman avatar Apr 26 '24 09:04 sultaniman

@willi-mueller I was able to reproduce the issue with caching and it is due to the TTL we have set when users query the data. I am thinking about proper invalidation mechanism.

sultaniman avatar Apr 29 '24 14:04 sultaniman

Awesome, I feel understood :)

On 29 Apr 2024, at 16:18, Sultan Iman @.***> wrote:

@willi-mueller I was able to reproduce the issue with caching and it is due to the TTL we have set when users query the data. I am thinking about proper invalidation mechanism. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

willi-mueller avatar Apr 29 '24 16:04 willi-mueller

@willi-mueller I adjusted caching lifetime and disabled caching schema in the streamlit session state store please see the attached screen recording. Historically it was not anticipated that users might keep streamlit open and alter the state of pipeline and data from separate console or tools.

The PR with fix is right above this comment.

https://github.com/dlt-hub/dlt/assets/354868/9dd4f640-1976-4abb-9da6-2f630bbc288b

sultaniman avatar May 07 '24 09:05 sultaniman

@willi-mueller can you please check this with the latest version, the issue should be gone now?

sultaniman avatar May 14 '24 12:05 sultaniman

Thank you for the fix!

On 27 May 2024, at 18:02, rudolfix @.***> wrote:

Closed #1264 as completed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

willi-mueller avatar May 27 '24 13:05 willi-mueller