wpm
wpm copied to clipboard
Rewrite entire quote database from scratch
I would like to go back to the original sources and write down every single quote again, from scratch.
As it is now, nearly every quote contains some kind of error/rewording/canary. Also, some of the sources could be more explicit. For example, for very old texts, it would be very nice to note which translation has been used.
To do this, I will need help. Please put it in a JSON format, but add additional information in new tags that are not recognized by WPM (they should all be optional, making WPM ignore them).
So you have
{
"author": "...",
"title": "...", # only canonical title here, short and nothing else
"text": "...",
"translation": "translation info",
"copyright": "if applicable",
"edition": "...",
"url": "if applicable",
}
If you don't have anything to put into the optional fields, then leave them out. It would be very good with a URL to e.g. google books (see below) so things can be double-checked.
The required fields are author, title and text.
How to actually find the quotes?
- Go to Google Books: https://books.google.com
- Find a quote you want to transcribe and search for it
- Try to find the correct one; sometimes you get several hits, choose the most canonical of them.
- Write up the above JSON. Double check that you got everything correct.
Details on submitting new quotes
- Gunzip the existing quotes at
wpm/wpm/data/examples.json.gzto find the quotes and their text IDs. - Add new quotes with the same text id and the URL you found it at in
wpm/wpm/data/rewritten.json(note: no gzip), on a new branchrewritten-quotes - I will make tools available for comparing the two JSON files when needed.
- Remember that it's not enough to just transcribe the quotes as-is. They contain many errors, and should be exactly the same as the source. The best is to double-check with physical books if you have them. If so, please add which edition (and translation etc.) you are using. If you're using the web, add the URL as stated above.
Hey, I'd like to contribute and help resolve this issue. Where can I find the quotes currently being used which need to be rewritten? Are they the same ones as stored in this file?
Yep, those are the ones. Just gunzip it and you can edit the JSON file. It should be under a new name, though (+ probably under a separate branch as well). This is a lot of work, but any help would be greatly appreciated.
If needed, I can code up some tools to check which quotes are missing from the new one (or something like that).
Hey, as mentioned above, I have created the rewritten.json in the specified fashion.
I wrote up a small script to convert the previous database to the format requested, with the exact same data for now.
So essentially, every sub-array in the examples.json.gz of the from ["author-name", "title-name", "text-content", text-id] has been converted to -
{
"author": "author-name",
"title": "title-name",
"text": "text-content".
"text_id": text-id
}
This should serve as a good base to start the rewriting process. If you wish, I can make a PR for this commit to the main repository so that it is easy for anyone contributing to track the amount of work left and contribute in bits and pieces to the database of almost 5k entries.
Just wanted to run this by you, to make sure that this is the right way to go ahead.
Thanks for the effort, but I would prefer to have rewritten.json only contain the rewritten, clean quotes. This because of merging, diffing and so on.