lute-v3
lute-v3 copied to clipboard
Add multi-key sorting of statuses
From discord link
Sorting by status works great when the book has unknown words but if books have none or the same amount of unknown words then the sort doesn't use next lowest status tier. For example I can't sort this table since the books have no unknowns:
I guess it should sort by unknowns then status 1 through 5
Notes
The book listing is in lute/templates/book/tablelisting.html, with the graph column first getting the UnknownPercent from the datatables query in lute/book/datatables.py, so the unknown percent is used for sorting. The graph is then rendered "on top" of that, with ajax_in_book_stats in the html.
The UnknownPercent is pulled from the table
CREATE TABLE IF NOT EXISTS "bookstats" (
"BkID" INTEGER NOT NULL ,
"distinctterms" INTEGER NULL ,
"distinctunknowns" INTEGER NULL ,
"unknownpercent" INTEGER NULL ,
status_distribution VARCHAR(100) NULL,
PRIMARY KEY ("BkID"),
FOREIGN KEY("BkID") REFERENCES "books" ("BkID") ON UPDATE NO ACTION ON DELETE CASCADE
);
with values like this:
sqlite> select * from bookstats;
4|378|220|58|{"0": 220, "1": 0, "2": 0, "3": 0, "4": 0, "5": 0, "98": 0, "99": 158}
"58" here is the percent of unknown terms.
To add this feature correctly, the code needs to change the single value "58" to a composite key (in this case something like {"0": "048", "1": "023", "2": "011", "3": ... etc }, with three digit places for each element to allow for 100 in one of the slots).
~~Doing all of this in the course of a DB migration would maybe require a python script to run, to load the status_distribution json string and then do calculations. The migration tool currently doesn't handle such a thing. Instead there could be a data clean-up job that runs at startup, that checks the content of the sort string, and if it doesn't match the required pattern it could do some fast processing.~~
Updated spec after discord discussion.
Probably the best way to do this is to have a method calc_sort_keys added to lute/book/stats.py which does the calc for a new status_distribution_percent field if it's null and if the status_distribution is not null. Call this on every call to the datatables function so the key is always updated.
todos I can think of
- [x] ~~consider if can be linked somehow to https://github.com/LuteOrg/lute-v3/issues/453~~ -- no don't bother, stick with the current state
- [ ] table migration, add new
status_distribution_percent, varchar(100). No need to drop the "unknownpercent" column, it's ok to leave that for later. - [ ]
lute/book/stats.py, addcalc_sort_keysmethod to loadstatus_distribution_percent. Unit tests: nulls, bad json in the distribution field, empty string, valid distribution, distribution with 100% and 0% - [ ] add unit test for book with no words (e.g. english book, text = "123"
- [ ] call
calc_sort_keysin the book datatables python - [ ] change the book stats calculation to call
calc_sort_keys - [ ] change the datatables to use the new field, template to use sort_key
- [ ] graph uses field, old complicated JS code can be removed
- [ ] once launched, maybe create a separate task to delete the
unknownpercentcolumn as it's no longer used
The above is my feeling about what needs to happen for this to work; but I could be wrong.
I don't believe it would be possible to calculate the sort index dynamically when calling the datatables method. Reason: datatables needs to get the sort index to do its server-side sorting, so it would really need all of those values present. Having that data cached in the table is the only way to do it, afaict.
While this change is not simple, it's pretty easy, so I think it could be tackled by any motivated dev who wants to give it a shot.
@jzohrab I'm definitely motivated and willing to give this one a go if nobody objects - worst case I learn more about the codebase, right?
If so, I'd appreciate your input on the best way to test the migrations - or do you just test the end result?
Thanks!
Hey @parradam -- In this case, the migration itself doesn't touch existing data (phew!), it just adds a new field, and then some new code will populate that. That new code needs the tests, not the migration itself. The migration itself will be a single-line text file, something like alter table bookstats add column ..., in lute/db/schema/migrations/. I use inv db.newscript <script_name> to make the stub file b/c I'm lazy. :-)
You can push things as you go to a Draft PR and I can take a look if you'd like. I don't have a pile of time but can def look. :-)
Happy to give it a go!
Edit: just to say that I haven't forgotten this, I've just been quite ill (still am!)
Hey @jzohrab, I think I'm starting to follow this better after reading through the code in a bit of detail.
Thanks in advance for your patience :-)
My understanding
Updated spec after discord discussion.
Probably the best way to do this is to have a method
calc_sort_keysadded tolute/book/stats.pywhich does the calc for a newstatus_distribution_percentfield if it's null and if thestatus_distributionis not null. Call this on every call to the datatables function so the key is always updated.
If I understand:
- the user accesses the
start_readingroute - this ultimately uses a
mark_stalemethod to remove the book stats (forcing an update) - when the
table_statsroute is called,get_statsis called, which won't find any stats (as they've been removed) and then_calculate_statsreturns adictwith the required info followed by_update_statsto commit to the DB
Questions on steps
- [x] table migration, add new
status_distribution_percent, varchar(100). No need to drop the "unknownpercent" column, it's ok to leave that for later. - [ ]
lute/book/stats.py, addcalc_sort_keysmethod to loadstatus_distribution_percent. Unit tests: nulls, bad json in the distribution field, empty string, valid distribution, distribution with 100% and 0%
I think you are saying that:
- the
calc_sort_keysmethod should take thestatus_distributionvalues and normalise them to percentages - there needs to be some error handling, or does it just need to degrade gracefully? i.e. return a dict with zero keys for now, or an error status that can be displayed in the bar and a tooltip to show that keys couldn't be retrieved)?
- [ ] add unit test for book with no words (e.g. english book, text = "123"
- [ ] call
calc_sort_keysin the book datatables python
Sorry, not sure I understand this step - reading between the lines, is this perhaps captured in the next two steps (book stats calc and datatables?
- [ ] change the book stats calculation to call
calc_sort_keys
So _calculate_stats should return something like "distribution_percent": json.dumps(status_distribution_percent)" along with the existing keys, and that should be followed through to _update_stats as well
- [ ] change the datatables to use the new field, template to use sort_key
I understand the first part of this to mean that the SQL query should be updated to pull out the status_distribution_percent field
- [ ] graph uses field, old complicated JS code can be removed
- [ ] once launched, maybe create a separate task to delete the
unknownpercentcolumn as it's no longer used
Hi @parradam -- no sweat and no rush. I should have mentioned: I think an important part of working on something is to address your own personal wants and needs! :-) There are so many requests for stuff, it's easy to get overwhelmed. Picking things that you yourself want is good motivation. Speaking from experience 😛
Some responses:
- "...should take the status_distribution values and normalise them to percentages" - yep, I'd leave the existing field for now
- re error handling: sure, sounds good, just a null dict, just in case. This handles things like missing values.
- re "call calc_sort_keys in the book datatables python" - there's a
lute/books/datatables.pythat sets up the query. I figured that a safe fallback would be a call tocalc_sort_keys(orcalc_percentagesorcalc_status_distribution_percentor whatever) in there, as a safeguard to ensure that the percentages are set where needed. Might be overkill! - for "change the datatables to use the new field, template to use sort_key" -- the datatables query should just need to return the
status_distribution_percent, which should be good enough to use for the sorting in datatables.
Sheesh this is a big hunk of work for something that really doesn't add much real value, even though it's nice. Feel free to hold off on it if it's still confusing!