dlt
dlt copied to clipboard
Streamlit app cache is invalid & load info crashes
dlt version
0.4.8
Describe the problem
When the schema changes by either
a) dropping a table manually
b) unselecting a resource while having the write_disposition='replace'
the streamlit app does not show the changed schema.
Also, when I delete a row it is still shown when I click "Show Data".
Further, the "Number of loaded rows" in the load info tab only shows a run-time error and stack trace but no row count.
Expected behavior
- When I drop a table the streamlit app should not show it
- When I deselect a resource and dlt removes the table from the destination the streamlit app should not show it
- The streamlit app should show the row count in the load info tab
- Different topic: The tables created by the rest_api should not be filed in dlt under a non-existing schema.
Steps to reproduce
See screencast: https://www.loom.com/share/700e9f4a1cbe48a5988f55f27c022588?sid=b937409a-329a-4770-85b6-b65afe05aa51
Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
rest_api
dlt destination
DuckDB
Other deployment details
No response
Additional information
My test code:
import dlt
from rest_api import rest_api_source
pokemon_config = {
"client": {
"base_url": "https://pokeapi.co/api/v2/",
},
"resource_defaults": {
"write_disposition": "replace",
"endpoint": {
"params": {
"limit": 1000,
},
},
},
"resources": [
{
"name": "berries",
"endpoint": {
"path": "berry"
},
# "selected": False
},
"pokemon",
],
}
pokemon_source = rest_api_source(pokemon_config)
pipeline = dlt.pipeline(
pipeline_name="pokemon_pipeline",
destination="duckdb",
dataset_name="pokemon",
progress="log",
)
load_info = pipeline.run(pokemon_source)
print(load_info)
Tried to reproduce this by running pipeline with the example code multiple times and then trying out
a) dropping a table manually b) unselecting a resource while having the write_disposition='replace'
https://github.com/dlt-hub/dlt/assets/354868/fa16a861-72aa-453c-8702-7a10683dd308
I was unable to do so, can you please try maybe with the latest version?
Thanks for testing! Indeed, with v 0.49 it's better. But when I click on "Show data" after the second run after just having dropped the table, streamlit still shows data from the cache and does not query the DB and thus does not see that the table does not exist anymore.
How to reproduce:
- python pokemon_pipeline.py # with berry selected
- dlt pipeline pokemon_pipeline show
- in streamlit: "show data" on berries resource
- in duckDBL
use pokemon; drop table berries;
- Refresh streamlit
- in streamlit: "show data" on berries resource. It still shows the deleted data
- python pokemon_pipeline.py # with berry unselected
- in streamlit: "show data" on berries resource. It still shows the deleted data
- dlt pipeline pokemon_pipeline drop berries
- streamlit: Now, berries are no longer visible
Also, after executing the pipeline even multiple times I get the blue message in streamlit: "pokemon is missing resource state". I wonder when this would disappear. Same for berries
@willi-mueller I think this is suboptimal ux, and probably we just need to hide it if there is no resource state found
@willi-mueller I was able to reproduce the issue with caching and it is due to the TTL we have set when users query the data. I am thinking about proper invalidation mechanism.
Awesome, I feel understood :)
On 29 Apr 2024, at 16:18, Sultan Iman @.***> wrote:
@willi-mueller I was able to reproduce the issue with caching and it is due to the TTL we have set when users query the data. I am thinking about proper invalidation mechanism. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@willi-mueller I adjusted caching lifetime and disabled caching schema in the streamlit session state store please see the attached screen recording. Historically it was not anticipated that users might keep streamlit open and alter the state of pipeline and data from separate console or tools.
The PR with fix is right above this comment.
https://github.com/dlt-hub/dlt/assets/354868/9dd4f640-1976-4abb-9da6-2f630bbc288b
@willi-mueller can you please check this with the latest version, the issue should be gone now?
Thank you for the fix!
On 27 May 2024, at 18:02, rudolfix @.***> wrote:
Closed #1264 as completed. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>