paperless-ng icon indicating copy to clipboard operation
paperless-ng copied to clipboard

Custom Metadata

Open APraxx opened this issue 3 years ago • 29 comments

Hi,

thanks for your work on paperless-ng! I really appreciate it.

Do you plan on adding some kind of additional Metadata? Like Keywords, labels or something like a custom metadata type (eg. Invoice Number textfield)

It would help searching for documents where this data as tags wouldn't fit. I don't know if it fits your view of the project but hope it could be.

Greetings

APraxx avatar Jan 04 '21 15:01 APraxx

I thought about this as well and would probably do it in the following way:

  • Allow document types to define additional metadata fields, each with a name and a type (number, string, date, etc)
  • Allow the user to edit metadata fields on documents depending on which document type is set.
  • In addition to that, some way to define metadata fields that are available on all documents, not just documents with a particular type.

This way, you can have 'invoice nr', 'price', 'product name', 'paid on ' for invoices, 'keywords' on all documents, and 'balance before', 'balance after' on bank statements.

However, I'm not entirely sure about the benefit of that. This requires a lot more manual editing of your documents, and I'm not sure this will really help you find documents easier.

Need some additional thoughts on this.

jonaswinkler avatar Jan 04 '21 16:01 jonaswinkler

Maybe an additional multi line text field called "notes" is all we need. I'd add that to the search index, so that you'll find documents when searching for words appearing in their notes.

jonaswinkler avatar Jan 07 '21 13:01 jonaswinkler

I have a similar use case, that might be fulfilled with a general type of custom metadata, but even better by the proposed defined-metadata-fields (price). In my country we have to do rather specific tax declarations. For any given store I need to write:

  1. the tax number of the store
  2. How many invoices I have received from this store (in a specific period)
  3. The total of money spent in this store (in a specific period)

I already get close to that when I search for my "deductible-food" tag and "Total" or any value that might be used by that vendor. But the search doesn't allow me to restrict to a specific vendor (except for extending the search of course, but matching the exact Correspondent is somehow cumbersome, selecting from the dropdown sounds way more fun)

So for me something like the "aws" tags would be perfect. Having a tag "deductible-food" and (in our case optionally) assigning it a "content".

Today I can already use the document view for all the invoices with deductible-food, a single correspondent (which I have assigned to the stores) and thus get the number of invoices. But if there was a way of easily getting one specific field (might be document metadata or tag-content) in that view I could just scroll down my list and take the values out, or even better use the rest api to sum them all up.

But I wouldn't mind adding a custom metadata field for that. As long as there is a specific place where I can later find the total value of the invoice (or possibly two values if the invoice includes two different types of deductibles) I will be happy.

So for my specific use case:

  • general "notes" on Documents would work ok
  • Document-type specific metadata fields would be great (I put price(int) to invoice, length(time duration) to contract, etc.)
  • document specific tag-content would be awesome (although I cannot yet make up the UI for that)

WhiteHatTux avatar Jan 08 '21 23:01 WhiteHatTux

Maybe an additional multi line text field called "notes" is all we need.

This is something that I considered requesting but read this first.

2nd, having the option to add document specific custom titled fields would be useful Doc-type = Invoice Field Name = Value Invoice # = INV1234 Tax Rate = 7.0% Total = $1.99

But I also see the added complexity this might add, both to the front end and back end. Whether a document shows the custom fields based on the type, assigning labels to the fields, how many allowable custom fields, etc. And how much of a use case is there to add all of this. You basically can already list out searchable values by just adding text to the Content section.

I think just adding a multi-line text field called "notes" under the Details tab would be the simplest and provide a well added benefit. My use case for "notes" would be as follows:

  • Review documents tagged "Needs Reviewed"
  • Edit the document
  • Uncertain about certain tags that might need to be applied or uncertain about a date created.
  • Write a note to self in the notes field "document was created between 01/02/2021 to 01/07/2021 refer to an email sent from Sam"
  • Leave the "Needs Reviewed" tag to follow up at a later time.

Adding notes could also be useful in a case where multiple people might review documents, where one person makes edits but doesn't finish, and another person finishes reviewing based on notes left by the 1st reviewer.

d8sychain avatar Jan 11 '21 08:01 d8sychain

Maybe an additional multi line text field called "notes" is all we need. I'd add that to the search index, so that you'll find documents when searching for words appearing in their notes.

I think this would be sufficient and wouldn't add to much complexity.

More fine grained could be used to order, sort or filter for it but adds complexity into the project and database

APraxx avatar Jan 11 '21 08:01 APraxx

I'm not too concerned about complexity of the database and back end (paperless is pretty simple). I'm much more concerned about these new additions getting in the way of users who don't need them / don't want to use them.

I think just adding a multi-line text field called "notes" under the Details tab would be the simplest and provide a well added benefit.

With custom metadata fields, you could also tell paperless to add a "notes" field to all documents. That way, many different groups of users would be happy:

  • Users who don't need metadata fields at all
  • Users who just want the notes field
  • Users who want custom fields on certain document types

I'll consider adding that, but I'll have to think about how to integrate that with certain parts of paperless first (especially the full text search and filtering tools). That's also a pretty big change, so don't expect that to appear soon.

jonaswinkler avatar Jan 11 '21 11:01 jonaswinkler

According to the current voting results, this is the most requested feature. After some head scratching and reading through the comments here again, this is what I'd like to do:

  • Ability to define custom fields. Custom fields have a type (long text for notes, short text for invoice numbers, number for prices, date, yes/no, etc). By assigning custom fields to document types, they'll only show up on documents with that specific type. Custom fields with no document type will show up on all documents.
  • There will be an additional menu entry in the "Manage" section called "Custom fields".
  • On the document edit page, these fields will appear depending on which document type is selected.
  • Full text searching will include custom text fields. I'm not sure whether I'll be able to provide custom searching for specific fields (queries such as notes:"my note" or paid:yes)
  • Document lists will have filters for certain custom fields: text filtering for text fields, and possibly range filters for numbers.

And further down the road:

  • Ability to add custom fields as colums to the document table list.
  • Ability to save the current column setup with saved views. With this, you could add saved views such as "Invoices 2020" and have that display the columns "Title", "Correspondent", "Created", "Total (number)", "Paid (yes/no)".

Impact:

  • If you don't wish to use that, the only thing you'll see is the new "custom fields" menu entry and everything else will remain the same.
  • It's possible to use this feature for notes on all documents.
  • This will also address #223.

jonaswinkler avatar Jan 23 '21 13:01 jonaswinkler

Sounds like a big job, let me know if I can help anywhere.

shamoon avatar Jan 23 '21 16:01 shamoon

Hey @jonaswinkler,

as promised in the discussion here my initial attempt to metadata. diff --git a/src/documents/models.py b/src/documents/models.py

index 86878dd7..5f317b57 100755
--- a/src/documents/models.py
+++ b/src/documents/models.py
@@ -487,3 +487,43 @@ class FileInfo:
                 cls._mangle_property(properties, "created")
                 cls._mangle_property(properties, "title")
                 return cls(**properties)
+
+
+class MetaDataType(models.Model):
+    DATA_TYPES = [
+        (0, _("date")),
+        (1, _("date-time")),
+        (2, _("number")),
+        (3, _("check")),
+        (4, _("text")),
+    ]
+
+    name = models.CharField(_("type name"), max_length=128, primary_key=True)
+    type = models.PositiveSmallIntegerField(_("data type"), choices=DATA_TYPES)
+    precision = models.SmallIntegerField(_("precision"), blank=True, null=True)
+
+    class Meta:
+        verbose_name = _("meta-data type definition")
+        verbose_name_plural = _("meta-data type definitions")
+
+
+class DocumentMetaData(models.Model):
+    name = models.CharField(_("name"), max_length=128)
+    value = models.CharField(_("value"), max_length=128, blank=True, null=True)
+    type = models.ForeignKey(
+        MetaDataType,
+        on_delete=models.CASCADE,
+        related_name="values",
+        verbose_name=_("meta data type")
+    )
+    parent = models.ForeignKey(
+        Document,
+        on_delete=models.CASCADE,
+        related_name="meta_data",
+        verbose_name=_("parent document")
+    )
+
+    class Meta:
+        verbose_name = _("meta-data")
+        verbose_name_plural = _("meta-data")
+        ordering = ("name",)

Not tested yet and no integration, but the idea is to store the meta-data as an EAV. The alternative is to use a JSONField, but my concerns are dates that need to be converted by the frontend anyway. The third alternative I looked into is a custom field-type but then we need to maintain this mess and I'm not sure it's worth the effort.

As I'm neither very experienced with django nor Angular I'm happy to get your opinion on which approach you like best!?

DocLambda avatar Jan 28 '21 10:01 DocLambda

I'm considering to use model inheritance, and define multiple DocumentMetaDataX classes for different types. Not sure yet if that works.

jonaswinkler avatar Jan 28 '21 10:01 jonaswinkler

Interesting. Are you working on this already or does it make sense to look into it from my side? My intention is to let the user define metadata and then allow the manual editing first. Later I want to fill those fields by processors e.g. the one parsing the filename or some text-parsing or text coming from zonal information (i.e. you define a template for a document group telling you where to find which information) like e.g. here.

DocLambda avatar Jan 28 '21 11:01 DocLambda

No code written by me yet, since I'm busy on #415.

My intention is to let the user define metadata and then allow the manual editing first.

Sounds good!

Later I want to fill those fields by processors e.g. the one parsing the filename or some text-parsing or text coming from zonal information

This is an entirely different feature and we'd need to think about how this templating system should work, and how we select the proper template for any given document. However, it's good to keep that in mind when building this metadata thing.

Some notes and directions on implementations regarding this:

  • Please make your progress available somewhere. You can also make (draft) PRs into the feature-custom-metadata branch.
  • I'd really like to have typing on the API (booleans returned as true / false, numbers returned as numbers, dates formatted in the way "created" is formatted as well, etc)
  • I'd also like to have typing on the corresponding data interfaces in the Angular application. (e.g. no parsing of strings in the front end)
  • I'm not sure yet about the best way to implement this. We might have to go through more than one iteration with this.
    • We also might end up using a json field. Honestly, that doesn't sound too bad. I'll have to look into it.
  • Type probably needs to be read only after creating a custom field.
  • Unless absolutely required, no hacks of any sort in the back end.
  • You'll eventually need to unit test the back end code as well.

jonaswinkler avatar Jan 28 '21 12:01 jonaswinkler

Ok playing around a bit with inheritance I found that the problem is to handle the relation to the document. Each child type will get an own related_name (see "Be careful with related_name and related_query_name section" here). So to get all meta-data of a document you will need to collect all those sets (one for each type) and we will have quite some types (text, multi-line text, number, dates, time, dates with times, currency?, booleans).

Alternatively, we can make each meta-data entry one model with X columns (8 currently), one for each supported type (and maybe some for prefix, suffix, formatting hints, etc). If you add a meta-data entry you select the column depending on the type, all others are null. Would this also work for you?

The alternative is to use the JSONField package. There we have to write a custom encoder/decoder for handling dates, but that shouldn't be too hard. The ugly thing here is that you always get a dict (or list in our case) as field value and you have to take care to manage the content in the code. If you can live with that I rather would go with this one as it seems the most extensible approach.

Last but not least, I could create the custom field (sounds like an interesting problem) that is explicitly typed. I probably want to inherit the JsonField above and convert the value to it's original type, hiding the explicit type information in the json not exposed to the user.

Please let me know what sounds acceptable for you, so I can prototype something (which I will share of course). I don't mind starting over, I just want to know your opinion regarding the starting point. :-)

DocLambda avatar Jan 28 '21 13:01 DocLambda

Ok playing around a bit with inheritance I found that the problem is to handle the relation to the document. Each child type will get an own related_name (see "Be careful with related_name and related_query_name section" here). So to get all meta-data of a document you will need to collect all those sets (one for each type) and we will have quite some types (text, multi-line text, number, dates, time, dates with times, currency?, booleans).

If at all, you'd want to use inheritance with a non-abstract base class (read further down in the linked article; there are two modes for mapping inheritance to databases). That way, the base class receives identifiers in the database as well, which we then link to documents.

See https://github.com/jonaswinkler/paperless-ng/commit/5e79429dfb634b6469f26e32f49f82f258b28107 for some initial experiments with working examples.

The thing is:

  • We need to have database-level support for filtering and sorting. Eventually people want to filter the document list for documents containing certain metadata, and the API needs to support that.
    • This means that it must be possible to do Document.objects.filter(....=...) and Document.objects.order_by(...). See the linked commit, that works over there.
    • Reason: The filtering and sorting capabilities of the Django REST framework are very flexible and configurable, but they ulitmately operate on django query sets.
  • Disregard the class names, this is an experiment.

=> I'm not sure whether this is possible with a JSON field. If so, great. If not, we need to use this.

Also, doing this isn't all that much different from just storing the values of different types in different columns in a single database table. (your second suggestion) And yes, that would work out for me. Honestly, this inheritance code seems pretty bad, and the only benefit we gain is that we don't store lots of empty database fields for each metadata field. On the other hand, lots of added relations, and I believe that this will impact performance quite a bit (just a gut feeling).

We need to have some custom serializers for the API anyway, and these deal with presenting the data from the database in our desired format (still needs to be defined).

jonaswinkler avatar Jan 28 '21 15:01 jonaswinkler

Hey @jonaswinkler,

as promised in the discussion here my initial attempt to metadata. diff --git a/src/documents/models.py b/src/documents/models.py

index 86878dd7..5f317b57 100755
--- a/src/documents/models.py
+++ b/src/documents/models.py
@@ -487,3 +487,43 @@ class FileInfo:
                 cls._mangle_property(properties, "created")
                 cls._mangle_property(properties, "title")
                 return cls(**properties)
+
+
+class MetaDataType(models.Model):
+    DATA_TYPES = [
+        (0, _("date")),
+        (1, _("date-time")),
+        (2, _("number")),
+        (3, _("check")),
+        (4, _("text")),
+    ]
+
+    name = models.CharField(_("type name"), max_length=128, primary_key=True)
+    type = models.PositiveSmallIntegerField(_("data type"), choices=DATA_TYPES)
+    precision = models.SmallIntegerField(_("precision"), blank=True, null=True)
+
+    class Meta:
+        verbose_name = _("meta-data type definition")
+        verbose_name_plural = _("meta-data type definitions")
+
+
+class DocumentMetaData(models.Model):
+    name = models.CharField(_("name"), max_length=128)
+    value = models.CharField(_("value"), max_length=128, blank=True, null=True)
+    type = models.ForeignKey(
+        MetaDataType,
+        on_delete=models.CASCADE,
+        related_name="values",
+        verbose_name=_("meta data type")
+    )
+    parent = models.ForeignKey(
+        Document,
+        on_delete=models.CASCADE,
+        related_name="meta_data",
+        verbose_name=_("parent document")
+    )
+
+    class Meta:
+        verbose_name = _("meta-data")
+        verbose_name_plural = _("meta-data")
+        ordering = ("name",)

Not tested yet and no integration, but the idea is to store the meta-data as an EAV. The alternative is to use a JSONField, but my concerns are dates that need to be converted by the frontend anyway. The third alternative I looked into is a custom field-type but then we need to maintain this mess and I'm not sure it's worth the effort.

As I'm neither very experienced with django nor Angular I'm happy to get your opinion on which approach you like best!?

Hey Doc, did you make any progress on adding additional metadata fields? Greetz

tvhdev avatar Apr 30 '21 20:04 tvhdev

Just read through the thread, this is very much what I would love to see in paperless-ng. Don't want to be greedy, but there is another custom meta field type I'd like to ask for: "due date" or "follow-up date".

I understand that I could very well add a custom meta data field of type "date" with label "due on" once #274 is completed, but this is only part of the idea.

The magic would need to be in the search feature, being able to find all documents that have a due date in the next couple of days, or are due today, or are "overdue", ...

Having this feature would allow me to create a view with all invoiced that are not yet paid and are already overdue or will be due in the next five days, for example. Then I could get rid of the follow-up file on my desk :)

e-patrick avatar May 30 '21 07:05 e-patrick

Not sure where this is at, but its exactly what I need to make paperless-ng work for what I'm trying to do. Looking forward to seeing it.

smseidl avatar Jun 06 '21 03:06 smseidl

I don't know if this is the right place, but it would be awesome, if those fields could be filled automatically by for example providing a regex with capture groups

dasbaumwolltier avatar Jul 24 '21 09:07 dasbaumwolltier

I'm just voicing my wish for this as well! It is the thing I miss the most from when I briefly tested out Papermerge. I haven't manage to get it to work very well for me, and I'd much prefer to have it in Paperless instead, in an automated fashion based on rules/ML.

iwconfig avatar Jul 25 '21 16:07 iwconfig

Follow-up date with the Option to show all upcoming Documents on the Startpage would be awesome! So im a fan of this request too!

NoirPi avatar Aug 17 '21 19:08 NoirPi

@jonaswinkler - not trying to be pushy (ok maybe I am - lol). Do you have any idea when this might be delivered? I'm creating lots of tags that I could remove if this feature is implemented? I don't have time for coding, but I would be more than willing to be a beta tester.

smseidl avatar Aug 21 '21 04:08 smseidl

I don't think there has been any progress on this at all, yet.

jonaswinkler avatar Aug 21 '21 10:08 jonaswinkler

Just one use case where custom fields could be important: When you buy things you'll usually have a warranty coming along with it. This info sometimes is only printed on the packaging. Instead of using tags such as "5 year warranty", "10 year warranty" one could have a custom field and filter for it in the future.

ghost avatar Sep 26 '21 10:09 ghost

Looking forward to this. Our digital auditing needs a review date and initials.

ynpmoose avatar Oct 17 '21 21:10 ynpmoose

Hi i hope this Feature gets implemented soon. How is the development going ?

enerschnutz avatar Dec 29 '21 21:12 enerschnutz

As I am knee-deep in taxes, it would be great to have an amount field. Even if I need to manually populate it, it will make my life much easier

4lowki4 avatar Jan 27 '22 23:01 4lowki4

Here is my use case: I have about 40~50 invoices of my physiotherapist. I can send them to my health insurance to get some money back. The Problem is now, i have not marked all of them as claimed to them. So it would be very useful to have a list of the invoices with the invoice Nr as a column / metadata field. Then i could check my insurance app which one of them are claimed already and which i forgot.

Would be amazing to use the implemented recognition algorithm to extract it automagically.

I think papermerge is implementing this feature atm.

hurr1k4ne avatar Feb 03 '22 18:02 hurr1k4ne

Hey @jonaswinkler!

First of all, thank you so much for all your work on paperless-ng. I haven't gotten deep into using it yet, but I am already thouroughly enjoying it.

What you said here

  • Ability to define custom fields. Custom fields have a type (long text for notes, short text for invoice numbers, number for prices, date, yes/no, etc). By assigning custom fields to document types, they'll only show up on documents with that specific type. Custom fields with no document type will show up on all documents.
  • There will be an additional menu entry in the "Manage" section called "Custom fields".
  • On the document edit page, these fields will appear depending on which document type is selected.
  • Full text searching will include custom text fields. I'm not sure whether I'll be able to provide custom searching for specific fields (queries such as notes:"my note" or paid:yes)
  • Document lists will have filters for certain custom fields: text filtering for text fields, and possibly range filters for numbers.

Is exactly what I was just looking for before finding this thread and would like to throw in my 2c about how I would use this feature.

I'm combining paperless-ng with the Johnny Decimal System, and as such use a naming scheme like this:

image

So I can get paperless to file it like this:

PAPERLESS_FILENAME_FORMAT={tag_list}/{document_type}/{correspondent}/{title}

On the one hand, this is great as it mostly meets my needs, but at the same time completely destroys the power of tags by meaning I can only ever assign a single tag.

I hope this feature comes soon as I think it will add a crazy amount of power to everyone using paperless-ng :)

Hellac avatar May 02 '22 08:05 Hellac

Sorry for the feature request bump, but this is exactly what I am looking for. I would love to have metadata fields on specific document types. Here is an example:

Document type: Receipt Additional fields: Total, Tax, Payment method.. etc.

Right now I'm tagging the receipts with payment method and putting the total amount in the title, but this would be very helpful.

Another thing to consider is the ability to populate the meta data from the content. It would be a highly advanced feature but if I could regex match something like:

Tax\s?[:]?\s?[\$]?\s?((\d{1,3}(\,\d{3})*|(\d+))(\.\d{2}))?

It would potentially match the tax value on the receipt and help speed up the process.

ajquick avatar May 19 '22 23:05 ajquick