lute-v3 Add multi-key sorting of statuses

From discord link

Sorting by status works great when the book has unknown words but if books have none or the same amount of unknown words then the sort doesn't use next lowest status tier. For example I can't sort this table since the books have no unknowns:

I guess it should sort by unknowns then status 1 through 5

Notes

The book listing is in lute/templates/book/tablelisting.html, with the graph column first getting the UnknownPercent from the datatables query in lute/book/datatables.py, so the unknown percent is used for sorting. The graph is then rendered "on top" of that, with ajax_in_book_stats in the html.

The UnknownPercent is pulled from the table

CREATE TABLE IF NOT EXISTS "bookstats" (
	"BkID" INTEGER NOT NULL  ,
	"distinctterms" INTEGER NULL  ,
	"distinctunknowns" INTEGER NULL  ,
	"unknownpercent" INTEGER NULL  ,
        status_distribution VARCHAR(100) NULL,
	PRIMARY KEY ("BkID"),
	FOREIGN KEY("BkID") REFERENCES "books" ("BkID") ON UPDATE NO ACTION ON DELETE CASCADE
);

with values like this:

sqlite> select * from bookstats;
4|378|220|58|{"0": 220, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "98": 0, "99": 158}

"58" here is the percent of unknown terms.

To add this feature correctly, the code needs to change the single value "58" to a composite key (in this case something like {"0": "048", "1": "023", "2": "011", "3": ... etc }, with three digit places for each element to allow for 100 in one of the slots).

~~Doing all of this in the course of a DB migration would maybe require a python script to run, to load the status_distribution json string and then do calculations. The migration tool currently doesn't handle such a thing. Instead there could be a data clean-up job that runs at startup, that checks the content of the sort string, and if it doesn't match the required pattern it could do some fast processing.~~

Updated spec after discord discussion.

Probably the best way to do this is to have a method calc_sort_keys added to lute/book/stats.py which does the calc for a new status_distribution_percent field if it's null and if the status_distribution is not null. Call this on every call to the datatables function so the key is always updated.

todos I can think of

[x] ~~consider if can be linked somehow to https://github.com/LuteOrg/lute-v3/issues/453~~ -- no don't bother, stick with the current state
[ ] table migration, add new status_distribution_percent, varchar(100). No need to drop the "unknownpercent" column, it's ok to leave that for later.
[ ] lute/book/stats.py, add calc_sort_keys method to load status_distribution_percent. Unit tests: nulls, bad json in the distribution field, empty string, valid distribution, distribution with 100% and 0%
[ ] add unit test for book with no words (e.g. english book, text = "123"
[ ] call calc_sort_keys in the book datatables python
[ ] change the book stats calculation to call calc_sort_keys
[ ] change the datatables to use the new field, template to use sort_key
[ ] graph uses field, old complicated JS code can be removed
[ ] once launched, maybe create a separate task to delete the unknownpercent column as it's no longer used

Jan 17 '25 00:01 jzohrab

The above is my feeling about what needs to happen for this to work; but I could be wrong.

I don't believe it would be possible to calculate the sort index dynamically when calling the datatables method. Reason: datatables needs to get the sort index to do its server-side sorting, so it would really need all of those values present. Having that data cached in the table is the only way to do it, afaict.

Jan 17 '25 01:01 jzohrab

While this change is not simple, it's pretty easy, so I think it could be tackled by any motivated dev who wants to give it a shot.

Jan 17 '25 01:01 jzohrab

@jzohrab I'm definitely motivated and willing to give this one a go if nobody objects - worst case I learn more about the codebase, right?

If so, I'd appreciate your input on the best way to test the migrations - or do you just test the end result?

Thanks!

Jan 20 '25 21:01 parradam

Hey @parradam -- In this case, the migration itself doesn't touch existing data (phew!), it just adds a new field, and then some new code will populate that. That new code needs the tests, not the migration itself. The migration itself will be a single-line text file, something like alter table bookstats add column ..., in lute/db/schema/migrations/. I use inv db.newscript <script_name> to make the stub file b/c I'm lazy. :-)

Jan 20 '25 23:01 jzohrab

You can push things as you go to a Draft PR and I can take a look if you'd like. I don't have a pile of time but can def look. :-)

Jan 20 '25 23:01 jzohrab

Happy to give it a go!

Edit: just to say that I haven't forgotten this, I've just been quite ill (still am!)

Jan 27 '25 21:01 parradam

Hey @jzohrab, I think I'm starting to follow this better after reading through the code in a bit of detail.

Thanks in advance for your patience :-)

My understanding

Updated spec after discord discussion.

Probably the best way to do this is to have a method calc_sort_keys added to lute/book/stats.py which does the calc for a new status_distribution_percent field if it's null and if the status_distribution is not null. Call this on every call to the datatables function so the key is always updated.

If I understand:

the user accesses the start_reading route
this ultimately uses a mark_stale method to remove the book stats (forcing an update)
when the table_stats route is called, get_stats is called, which won't find any stats (as they've been removed) and then _calculate_stats returns a dict with the required info followed by _update_stats to commit to the DB

Questions on steps

[x] table migration, add new status_distribution_percent, varchar(100). No need to drop the "unknownpercent" column, it's ok to leave that for later.
[ ] lute/book/stats.py, add calc_sort_keys method to load status_distribution_percent. Unit tests: nulls, bad json in the distribution field, empty string, valid distribution, distribution with 100% and 0%

I think you are saying that:

the calc_sort_keys method should take the status_distribution values and normalise them to percentages
there needs to be some error handling, or does it just need to degrade gracefully? i.e. return a dict with zero keys for now, or an error status that can be displayed in the bar and a tooltip to show that keys couldn't be retrieved)?

[ ] add unit test for book with no words (e.g. english book, text = "123"
[ ] call calc_sort_keys in the book datatables python

Sorry, not sure I understand this step - reading between the lines, is this perhaps captured in the next two steps (book stats calc and datatables?

[ ] change the book stats calculation to call calc_sort_keys

So _calculate_stats should return something like "distribution_percent": json.dumps(status_distribution_percent)" along with the existing keys, and that should be followed through to _update_stats as well

[ ] change the datatables to use the new field, template to use sort_key

I understand the first part of this to mean that the SQL query should be updated to pull out the status_distribution_percent field

[ ] graph uses field, old complicated JS code can be removed
[ ] once launched, maybe create a separate task to delete the unknownpercent column as it's no longer used

Feb 17 '25 20:02 parradam

Hi @parradam -- no sweat and no rush. I should have mentioned: I think an important part of working on something is to address your own personal wants and needs! :-) There are so many requests for stuff, it's easy to get overwhelmed. Picking things that you yourself want is good motivation. Speaking from experience 😛

Some responses:

"...should take the status_distribution values and normalise them to percentages" - yep, I'd leave the existing field for now
re error handling: sure, sounds good, just a null dict, just in case. This handles things like missing values.
re "call calc_sort_keys in the book datatables python" - there's a lute/books/datatables.py that sets up the query. I figured that a safe fallback would be a call to calc_sort_keys (or calc_percentages or calc_status_distribution_percent or whatever) in there, as a safeguard to ensure that the percentages are set where needed. Might be overkill!
for "change the datatables to use the new field, template to use sort_key" -- the datatables query should just need to return the status_distribution_percent, which should be good enough to use for the sorting in datatables.

Sheesh this is a big hunk of work for something that really doesn't add much real value, even though it's nice. Feel free to hold off on it if it's still confusing!

Feb 17 '25 21:02 jzohrab

lute-v3 lute-v3 copied to clipboard

Add multi-key sorting of statuses

Notes

todos I can think of

My understanding

Questions on steps

lute-v3
lute-v3 copied to clipboard