comicinfo
comicinfo copied to clipboard
New element: Metadata_ID
Where does this comes from?
Because when scrapping the metadata we could find the data in differents webs. Now comicRack put in in "tags" and in personal fields that are not stored in the xml but in the database. Calibre has a filed named "ID" for this purpose and could be various data from different sources.
What is the rationale for adding support for this element?
If i scrap a comic i coud have metadata from comicvine, bedetheque, amazon etc. And of course for manga also, and i would like to have the source of each site.
Is the element already handled by any application or tool?
Like i said before, "Calibre" has it.
Isn't that the purpose of the Web element to store the url of where the metadata came from?
At least for ComicTagger the Web element refers to the user facing webpage and may not contain the ID. ComicVine happens to also store the id in the url but that is purely an accident. Mylar for example typically parses the ID from the notes field
I forgot memtion one of the main ID, the ISBN. This and the other could be usefull to find duplicates easyly and with a high realibility.
I'm in favor of an addition like this. A simple way to achieve could be:
<MetadataId>anilist:32346,cv:dko35235,mal:45345</MetadataId>
This stays in the same scheme that is used in the existing fields, allows free form input so the schema doesn't need to add something new each and every time someone wants X source. Could even change to shorthand_source(url):id, but i think web handles that already.
I forgot memtion one of the main ID, the ISBN. This and the other could be usefull to find duplicates easyly and with a high realibility.
This is already handled by the new GTIN element.
I have read the discussion about GTIN and I have seen some problems:
- Only allows one entry per comic
- Comics that do not have any identification except the one you can get from the websites that classify them are not taken into account.
There are old comics that don't even have an ISBN.
That is why I propose having one/several fields depending on the site that classifies them, that contains the ID or the web of the page that has information about the comic.
And as I made clear in the name of the "Metadata ID" field, the information from where I obtained the metadata would be stored here.
There are a few concepts here which I think may be useful for this discussion:
Online Metadata Database Identification Numbers:
- ComicVine - widely used, often represented in Notes and Tags as [CVDB12456780] or [Issue ID 12345670]
- ComicVine Metadata Alternatives (Metatron or something).
- Comics.org - no api, no scraping allowed, so not widely used.
Trade Identification Numbers:
- GTIN is a supercategory for many types of trade numbers, such as ISBN-10 (US centric), ISBN-13 and others from other countries.
- ASIN Is amazon's product alphanumeric code. Seen in Notes & Tags as [ASIN123467890]
- Comixology had it's own numbers as well, often encountered in Notes & Tags as [CMXDB1234567890], these all redirect to ASIN's now if you look for them with the Comixology url scheme.
The ComicInfo.xml Web Field
This field is meant for URLs, but the most common URLS are ComicVine URLs which are derivable from the CVDB number. e.g.: https://comicvine.gamespot.com/arbitrary-slug/4000-1234567 The "Web" field is from ComicRack and according to the compatibility guidelines there should be only one entry.
The ComicInfo.xml GTIN Field
is new from the Anansi Project. I think is currently limited to one entry by the spec.
Proposed Resolutions
Ideal
In an ideal world there might be an <Identifier type="CVDB">1234567890</Identifier>
field that could have any number of entries and be used for ComicVine numbers, GTIN, ISBN, ASIN, etc. As many as you like. And your client software could easily derive web links for each of them as the respective url formats are all simple.
Simple
We already have the GTIN field, I'd suggest altering the ComicInfo.xml spec to formally allow more than one GTIN entry and overloading it to handle the general concept of identifier. If you wanted to get fancy you could add a type attribute, but people are already encoding type with alphanumeric prefixes so <GTIN>CVDB1234567890</GTIN>
is fine and easy to decode by both humans and software.
<GTIN>CVDB1234567890</GTIN>
this is not a valid GTIN though.
In Readium WebPub Manifest the identifiers are URI, and use the URN format, like urn:isbn:9783161484100
.
That's true. I was suggesting abusing the GTIN field for other means, which is not ideal.
I'm glad you mentioned readium using the urn format. i was unaware of it. I'll use it for a multi-format metadata reader/writer i'm working on.
Because I see Notes fields that look like:
Tagged by Comictagger 1.3.1 on 1970-01-01T12:12:00 [Issue ID 1234567890] [CMXDB45678] [CVDB1234567890] [ASINBC09876]
I'm fairly convinced there should be:
- An Identifier field of some sort that supports multiple possible metadata ids. It feels to me like GTIN is a subset of all possible identifiers, and also a useful superset of most trade identifiers. But having GTIN be for trade identifiers and another field for metadata id's would also work.
- A
<Tagger />
field that tells you what program and version wrote the metadata. Analogous to ComicBookInfo "appID" JSON. In PDF this field is called<pdf:Producer/>
. - An
<UpdatedAt />
or "lastModified" field like ComicBookInfo JSON has. Not entirely necessary because filesystems also have timestamps in inodes, but this is specific to the tagging action.
Only (1.) is relevant to this discussion. But since people are forcing it into Notes already it seems like it would be used. For Codex's own internal metadata database I'm going to be parsing the Notes field myself for this information.
I propose that GTIN be a calculated identifier. For example, you can calculate the "phash" (proportional image hash) of each image in the comic, add the values of all the images and use the result as GTIN. Taking into account that when calculating the phash, each image is converted to black and white and reduced to an 8x8 binary matrix, the calculation is quite fast. Info about phash: https://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html Python library: https://pypi.org/project/ImageHash/
I am using this to compare the images of the comics and it gives a reliability of 99.9%.
@killo3967 GTIN is a specific type of identifier with it's own spec https://www.gs1.org/standards/id-keys/gtin we are not going to abuse it for that purpose.
As for using a perceptual hash as a Metadata ID that's not a great idea as the hash will change by a non-trivial amount if you simply choose a different language to calculate it by, even using the same library you can get different results.
Pillow (what ImageHash uses) for webp uses libwebp which has a minimum of 3 different ways to load the image (into RGB because Pillow always loads into RGB) the primary way that it uses ends up being the "fancy upsampling" methods which is the most complicated but has decidedly different hashes than any of the other methods in the libwebp library, let alone another library. To top it all off webp stores it's color in yuv format in the yuv color format we only need to use the 'y' not the 'uv' and then we have the grayscale image needed for the hash so if we are only calculating hashes we don't need to even do a "conversion", this hash also different from the other hashes by loading the image into RGB. Guess what? resizing the image before the grayscale conversion vs after also results in a different hash. Most of these hashes that are generated are within the same arbitrary hamming distance that we could pick but they are not the same hash which as an ID is unacceptable for them not to be the same when dealing with the same exact set of bytes.
As a comparison of cover images it works well, as a method to search for a comic it also works ok, as an identifier for a comic not great.
Proposed Resolutions
Ideal
In an ideal world there might be an
<Identifier type="CVDB">1234567890</Identifier>
field that could have any number of entries and be used for ComicVine numbers, GTIN, ISBN, ASIN, etc. As many as you like. And your client software could easily derive web links for each of them as the respective url formats are all simple.
This seems like the best choice overall.
I'd really like to see this added to anansi as I'd like to use the spec as a lightweight metadata added to cbz with a primary goal being matching to multiple online (and local in my case) sources for more elaborate metadata (ie stay lightweight locally).
I'm in support of something like <Identifier type="CVDB">1234567890</Identifier>
as @ajslater brought up.
Proposed Resolutions
Ideal
In an ideal world there might be an
<Identifier type="CVDB">1234567890</Identifier>
field that could have any number of entries and be used for ComicVine numbers, GTIN, ISBN, ASIN, etc. As many as you like. And your client software could easily derive web links for each of them as the respective url formats are all simple.This seems like the best choice overall.
I'd really like to see this added to anansi as I'd like to use the spec as a lightweight metadata added to cbz with a primary goal being matching to multiple online (and local in my case) sources for more elaborate metadata (ie stay lightweight locally).
Sound the best solution for me. I'm agree with your idea.
Resoluciones propuestas
Ideal
En un mundo ideal podría haber una
<Identifier type="CVDB">1234567890</Identifier>
campo que podría tener cualquier cantidad de entradas y usarse para números de ComicVine, GTIN, ISBN, ASIN, etc. Tantos como desee. Y su software cliente podría derivar fácilmente enlaces web para cada uno de ellos, ya que los formatos de URL respectivos son todos simples.Esta parece la mejor opción en general. Realmente me gustaría ver esto agregado a anansi, ya que me gustaría usar la especificación como metadatos livianos agregados a cbz con el objetivo principal de hacer coincidir múltiples fuentes en línea (y locales en mi caso) para obtener metadatos más elaborados (es decir, mantenerse liviano a nivel local).
Suena como la mejor solución para mí. Estoy de acuerdo con tu idea.
I re-think about the solution and i think that the GTIN could be a multidata field that could have data from diferent sources.
For example:
<Identifier type="CVDB">url</Identifier> -> From ComicVine
<Identifier type="BDTQ">url</Identifier> -> From Bedetheque
<Identifier type="ISBN">url</Identifier> -> From isbnsearch.org or other
<Identifier type="TBOS">url</Identifier> -> From Tebeosfera
<Identifier type="ASIN">url</Identifier> -> From Amazon/Comixology
etc...
Why this: 1.- Because all sites don't have all the information about all comics. 2.- Because all the people are not English speaker. There are a lot of comics in othe languages. 3.- Because not all the people use CVDB as main scrape, the are a lot of others webs with diferent id's data. 4.- Because Calibre do this during years and works fine with all users.
For Example one book could has this ID's: isbn:9788490623527 barnesnoble:w/2010-arthur-c-clarke/1111814622 google:jtQeBAAAQBAJ