papyri use on insert/remove trigger to track changes.

use on insert/remove trigger to track changes.

Open Carreau opened this issue 2 years ago • 4 comments

So I've been poking at SQL:

[ ] first I added indexes to speed up queries. I don't know what I'm doing so @steff456 can you look at what I did and if it make sens or can be made better.
[ ] I don't know if we can make categorical columns to save us space / speed up queries, from what I can tell we can't in sqlite. If not that's ok.
[ ] Am I also correct that there is single type of index in sqlite ?

And finally: A long term project would be fast/efficient re-rendering of modified pages, for that I need a ways to track all rows that have been updated/created/removed since a specific time. For our purpose "time" is not wallclock time but a sort of "version" integer that is explicitly bumped.

[ ] Can we use trigger updates to have more tables that tracks which rows have been touched ? I'm thinking something like a 2 tables "documents_changes(Foreign key:int, version:int)" and "destinations_changes(Foreign key:int, version:int)" @steff456 is that something you could help me create ?

Mar 17 '22 08:03 Carreau

first I added indexes to speed up queries. I don't know what I'm doing so @steff456 can you look at what I did and if it make sens or can be made better.

I see that you created the indexes, but I don't understand where are you using them. I don't find the queries in the code 😅

I don't know if we can make categorical columns to save us space / speed up queries, from what I can tell we can't in sqlite. If not that's ok.

I'm not following this idea, which columns do you want to categorize?

Am I also correct that there is single type of index in sqlite ?

Yes! You can create an index in a single column, multi column or an unique index (like an id) but there's no more flexibility with that.

And finally: A long term project would be fast/efficient re-rendering of modified pages, for that I need a ways to track all rows that have been updated/created/removed since a specific time. For our purpose "time" is not wallclock time but a sort of "version" integer that is explicitly bumped.

I'm not sure how are we going to deal with certain changes. I understand the idea, but I feel that it is really difficult to implement and maintain and the gain in performance will not be that much. So for instance if we have a clash in the information from one version to another, how are we going to handle the merge? Also storing all the differences will make the table just grow a lot over time, so maybe we will get performance issues if we want to search or go back to a previous version.

Apr 13 '22 10:04 steff456

I see that you created the indexes, but I don't understand where are you using them. I don't find the queries in the code 😅

https://github.com/jupyter/papyri/blob/7c1bcdac9cdf22ac872afc81585d9004ca5c9654/papyri/graphstore.py#L244-L276

and

https://github.com/jupyter/papyri/blob/7c1bcdac9cdf22ac872afc81585d9004ca5c9654/papyri/graphstore.py#L290-L346

Yes I don't use upper case SELECT...

I'm not following this idea, which columns do you want to categorize?

https://github.com/jupyter/papyri/blob/7c1bcdac9cdf22ac872afc81585d9004ca5c9654/papyri/graphstore.py#L129-L137

    """
    CREATE TABLE documents(
    id INTEGER PRIMARY KEY,
    package TEXT NOT NULL,
    version TEXT NOT NULL,
    category TEXT NOT NULL,
    identifier TEXT NOT NULL, unique(package, version, category, identifier))
    """

the package, version, and tuple (package, version) only have a few values, and sometime we are querying only those two, so I don't know if we can make them smaller.

I'm not sure how are we going to deal with certain changes. I understand the idea, but I feel that it is really difficult to implement and maintain and the gain in performance will not be that much. So for instance if we have a clash in the information from one version to another, how are we going to handle the merge? Also storing all the differences will make the table just grow a lot over time, so maybe we will get performance issues if we want to search or go back to a previous version.

Ok, so I have to think about it, it would be fine to assume the tables or way to track if something is changed is regularly reseted. But as I said we can see that later.

Thanks !

Apr 16 '22 16:04 Carreau

Do you have a special package or way to test this that has a LOT of information to process? In that way I can do benchmarks to see if the index is actually helping us.

Apr 22 '22 09:04 steff456

You can clone that https://github.com/Carreau/papyri-gen, and do papyri ingest <path/to/papyri-gen>/data/*, and the papyri render.

without index 46s/51s, with index 42s/52s, so it's indeed not helping a lot anymore. I've also profiled the relevant section and the SQL is only taking ~5% total time so it's not worth optimising for now. To be fair when I though about optimising it was doing a dozen queries per page, so that might have made a difference.

Apr 23 '22 08:04 Carreau

papyri papyri copied to clipboard

use on insert/remove trigger to track changes.

papyri
papyri copied to clipboard