docspell
docspell copied to clipboard
[dsc] Ability to provide metadata
When using dsc upload, please allow me to provide correspondents, concering entities, date, document title,. as I might have these data available to me at the time.
Of course, I can also write a script to use the API, but it seems dsc should really be able to do this out of the box.
In addition, a flag like --confirm-metadata would be really good. I have written a script to process a couple thousand of documents for import to Paperless-NGX. The script extracts as much metadata as possible from the filesystem location and the filename of each file, and so the result is "good enough" for historic import.
It would completely overload the UI / make the confirmation process impossible for new documents if all of a sudden, thousands of new items appeared. Rather, I'd like to chuck them into a folder ("Historic import"), and/or tag them accordingly, so that they're all there on a best effort basis, even if they haven't had their metadata confirmed explicitly.
Over time, users will remove the tag when they massaged the metadata, but the expectaction is not that this will be done for all historic files, only for the new ones.
And yes, I know I could just edit the database. ;)
Giving this metadata per upload is a bit difficult. One needs to look up a person or organization first, to identify it. Or perhaps even create it first. It may be a bit unfriendly for a cli. Possible still, of course :-)
To the --confirm--metadata option: maybe for this one-shot import, it could also do with some basic post-processing using the cli. For example: set a tag or folder for each upload and let the files process. Then search them using the tag or folder and do your thing: confirm + add more tags etc.
Another option is to live with the "new" state (that's it's purpose after all :)) and use a bookmark that excludes the historic data (which could be marked with a tag) for processing new documents.
Excellent idea in your last paragraph. I guess I didn't think this way because to me, those documents are not "new" nor "incoming", but they are of course from the point of view of Docspell.
I also didn't really consider the dashboard and search tiles to be my friend at this stage, because I have not found a way to sensibly manage them for the team (#2274), and I am not quite at the stage yet where I'll be massaging the data in PostgreSQL directly.
However, you are absolutely right: those documents are new to Docspell, they should have their metadata vetted still, and thus you're right, they should not be --confirmed at CLI stage.
With respect to the other difficulties you highlight: yes, absolutely, I see those. And based on experience with similar approaches, I see two viable approaches:
- The CLI grows e.g. a correspondent interface, allowing me to define correspondents (I would only accept well-defined JSON, or maybe vCard), allowing me to search correspondents, to get their ID, and then using that in the next CLI call;
- The CLI could attempt a match for whatever I provide, and if there's more than one result in the returned set, then fail.
Please consider this low-prio from my end, because our bulk-import will need to go via the API anyway, and I might even go as far as to write a Python API wrapper for Docspell to do it ;)
Yes, those two points are perfect. The idea was to extend the cli anyways (exactly like that, as a wrapper for the rest api); just the problem with having time exists 😄