Verba icon indicating copy to clipboard operation
Verba copied to clipboard

JSON File Ingestion – Handling Metadata and Chunking

Open troublesprouter opened this issue 1 year ago • 2 comments

**JSON File Ingestion – Handling Metadata and Chunking **

Description

When uploading a JSON file, I need Verba to properly ingest structured metadata while still generating chunks automatically. Currently, the behavior is unclear, and it seems that the "chunks" field must be predefined, even though Verba can generate chunks for PDFs automatically.

Expected Behavior:

  • Verba should recognize metadata fields without requiring predefined chunks.
  • The "content" field should be processed as document text.
  • Chunking should be handled automatically based on Verba’s settings.

Actual Behavior:

  • The "chunks" field appears necessary, even though I want Verba to generate them dynamically.
  • Metadata structure is unclear—what should be included for proper indexing?

Example JSON File:

{
  "year": 1995,
  "number": "50",
  "title": "Circular Nº 50, del 13 de Diciembre de 1995 (modificada) (aclarada / complementada)",
  "materia": "Crédito Tributario por inversiones en provincias de Arica y Parinacota",
  "url": "https://www.sii.cl/documentos/circulares/1995/circu50.pdf",
  "sin_efecto": false,
  "downloaded_filename": "circu50.pdf",
  "saved_filename": "circular_1995_50_2.pdf",
  "content": "Modificada por Circular Nº 45, del 3 de septiembre de 2008 \n\nModificadas por Circular Nº 64, del 6 de noviembre de 1996 \n\nComplementada por Circular Nº 64, del 6 de noviembre de 1996 \n\nCIRCULAR Nº 50, DEL 13 DE D ETC ETC etc",
  "modificada": true,
  "aclarada_complementada": true
}

Installation

  • [ ] pip install goldenverba

If you installed via pip, please specify the version:

Weaviate Deployment

  • [ ] Local Deployment

Steps to Reproduce

  1. Go to the dashboard.
  2. Upload a JSON file with structured metadata.
  3. Metadata doesnt load, not does the title of the document, etc.

Additional Context

  • Do I need to structure metadata differently for proper indexing?
  • Should Verba automatically generate chunks even when metadata is present?
  • If so, how should metadata fields be formatted?
  • Is there a recommended JSON structure for structured documents without manually defining chunks?

@thomashacker Any guidance on this?

troublesprouter avatar Jan 31 '25 22:01 troublesprouter

Thanks for the issue, you're raising good points 🚀 Right now, it's not possible to ingest custom structured JSON files with custom fields that Verba can access later.

Verba initially checks whether the imported JSON can be converted directly into a Verba Documentobject, which looks like this

{
    "title": "string", # The title of the document
    "content": "string", # The content of the document
    "extension": "string", # The extension of the document (Optional)
    "fileSize": "number", # The size of the document in bytes (Optional)
    "labels": "array", # The labels of the document (can be empty, used for filtering)
    "source": "string", # The source of the document (can be an URL, optional)
    "meta": "object", # The meta data of the document used internally
    "metadata": "string" # Metadata information of the document, will be used in the embedding process
}

If the imported JSON has custom fields, it will simply dump the whole json into the content field, you could later use the JSON Chunker to chunk each field individually. The meta field is used internally to store the different configurations that were used to create the Document. The metadata field is a string that is used when embedding the chunks and when generating answers on questions.

I created a new issue to allow Verba to work with custom JSONs in the future! What you could do right now is this:

{
    "title": "Circular Nº 50, del 13 de Diciembre...", 
    "content": "Modificada por Circular Nº 45, del 3....", 
    "extension": "json",
    "fileSize": "0", 
    "labels": "["document"]", # You could choose whatever you want
    "source": "", # URL if needed
    "meta": "{}", 
    "metadata": "All metadata information that you'd like to add as a string" 
}

Let me know if this works for you, and if not, what functionalities are missing to accomplish your goal!

thomashacker avatar Feb 03 '25 13:02 thomashacker

Thank you, Thomas. I ended up re-creating the project in txtai over the weekend in an effort to make things work with metadata, but if I come across a dead end there I will return to your solution.

In principle just using the "metadata" field should work perfectly.

troublesprouter avatar Feb 03 '25 13:02 troublesprouter