[Feature Request] Change language of a document
Hi there, After uploading a document, it's currently impossible to change the language that was set to it while uploading. Be it because of a simple mistake, the source not supporting it (you wouldn't actually go and set a source for every language, will you?) or because that language didn't exist in Docspell when you uploaded it (wink, wink), I think it's a necessary feature to save the hassle of deleting and reuploading a bunch of document :)
Thanks!
I actually have a source per language. But it's only for two languages :)
You can reprocess the files which will reuse the language from the previous run. If you want to change all files to japanese, you could use a sql query to update the language and then trigger reprocessing. (this as a workaround, because I don't know when I'm going to work on this)
Hey, Could you elaborate on your first statement? What do you mean by "a source per language but only for two languages"?
And actually I have documents written in many languages. Some of them were Japanese and were uploaded before the support was added. I guess I'll have to look into querying the db directly, if this feature is not yet on your roadmap.
Do you have any recommendations on how to do that quick and dirty with docker? :smiling_imp:
Could you elaborate on your first statement? What do you mean by "a source per language but only for two languages"?
Sure. I have several source urls and I have one for English and one for German documents. So it's not a lot of different languages and so it is nice to use different urls for upload. But I can see of course, that this doesn't scale well to many different languages.
guess I'll have to look into querying the db directly, if this feature is not yet on your roadmap.
Do you have any recommendations on how to do that quick and dirty with docker?
Yes, the thing is that I may not find time to work on this. So i thought a quick way to solve it for this one case already gets you somewhere where the feature wouldn't be so urgent anymore. On the database there is a table attachmentmeta which contains a language column. This is the language that was used to process the corresponding document. The id is the id of the attachment.
So, if you want to change all documents from one language (say English) to Japanese, you can use a sql statement like this:
update attachmentmeta set language = 'jpn' where language = 'eng';
If this is not possible, you can now use the dsc tool to search exactly for the ones you want to change and use the result to create these statements for specific attachments. An example:
dsc -f json search 'date<2021-05' |\
jq -r '.groups[].items[].attachments[].id' |\
while read id; do \
echo "update attachmentmeta set language = 'jpn' where id = '$id';"; \
done > /tmp/updates.sql
Then run this sql script (after looking at it and backing up the db ;-)). With docker it should be the same as without: you need to connect to the postgres db and run the script. You could run the postgres container interactively or simply use any other machine with postgres client (psql) installed. I think it is then psql -h host -U user -W -f /tmp/updates.sql - but better check it again.
After all that, you can trigger the route to reprocess the items or attachments. It might make sense to write a quick script for this.
Thanks for the through and awesome explanation. What I ended up doing was:
docker-compose exec <db-container> bash
psql -U <username> -d docspell -c "select attachid from attachment where name = '<filename>'"
psql -U <username> -d docspell -c "update attachmentmeta set language = 'jpn' where attachid = '<attachid-from-last-command>'"
Technically it's possible to do that in one go using sub-query, but it's better to be extra careful when doing these things. Then I restarted docspell, joex and solr containers.
However....
after confirming that it changed in the db and that the file IS the file that's attached to the item I'm looking at, I did a reprocess and... this is what I've got ocrmypdf -l eng .... Very strange, no? Maybe there's something else I need to do? Or maybe there's some cache that needs to be cleaned before this "trick" will work?
oh strange yes! I need to look at the code. it's probably a bug. what is your collective's default language?
Edit: oh, sorry - it's using the language from the collective :( Sorry for all the noise then. But this is a bug and I'll try to fix it asap. It should use the language that was used when processing the file last time. Currently, you need to change your collectives language and then reprocess the item or attachment.
You mean the document language, right? Yes, it's English.
On Thu, Aug 5, 2021, 10:59 eikek @.***> wrote:
oh strange yes! I need to look at the code. it's probably a bug. what is your collective's default language?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/eikek/docspell/issues/974#issuecomment-893249682, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ6HINTU6IC3HI2TVZCDYLT3JAFHANCNFSM5BJ632ZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Sorry, I didn't notice the edit in your last message. I changed the collective language and initiated a reprocess and it worked 🎊 It's somewhat a reasonable compromise to this missing feature, I have to say... Much better than going around fiddling with the db.
I don't know the first thing about Scala (I work with Python and JS) but I think I'm gonna give it a go (provided I'm able to set up the dev environment 😅 ) - it's just adding an option to change the attachment language. Would you like me to open a separate bug report for taking the attachment language for OCR?
I don't know the first thing about Scala (I work with Python and JS) but I think I'm gonna give it a go (provided I'm able to set up the dev environment sweat_smile ) - it's just adding an option to change the attachment language.
Yes, it is not much. But it's also not super easy :-) I think we need an endpoint that accepts an attachment id and a language and then changes the language in the attachmentmeta table. For new documents you can already add the language to the request (or to a source url). So it would only affect existing documents that one wants to reprocess. So for this task, I'm not sure whether a GUI change in the webapp is necessary - maybe the dsc tool could support this only? Not sure. Then it might be good to be able to change the language for a search query not only for single attachments. Another idea is instead of having a new endpoint to simply add the language to the reprocess request. I think this might be even better, because I can't think of any other case where it would make sense to change the language the document was processed with. It's changing the history and to something that's wrong :-). So I think I would favor to supply the language to the reprocessing request. Then there is no additional endpoint needed and it is a little more straight forward to add gui elements. wdyt?
Would you like me to open a separate bug report for taking the attachment language for OCR?
I think this issue here is a good fit actually. It's about changing the language when reprocessing files, right?