datashare icon indicating copy to clipboard operation
datashare copied to clipboard

No documents appearing in "See disk usage"

Open MatiasConTilde opened this issue 4 years ago • 9 comments

Describe the bug Even though there are more than 4M documents indexed, they dont appear in the "See disk usage" button or the "Number of documents by creation date" panel in the Insights tab.

To Reproduce Steps to reproduce the behavior:

  1. Go to insights tab
  2. Click on the buttons

Expected behavior A view of the documents and directories to appear.

Screenshots image image

Desktop (please complete the following information):

  • OS: Ubuntu 20.04
  • Browser: Firefox, but irrelevant
  • Version: 7.9.5

Additional context

  • Logs don't say anything relevant.
  • It also didn't appear when we had fewer documents indexed.
  • The path filter in search documents does work:

image

MatiasConTilde avatar Sep 23 '20 14:09 MatiasConTilde

This should be resolved by @pirhoo's latest developments.

Soliine avatar Nov 23 '20 10:11 Soliine

Hi @MatiasConTilde, a fix for this has been publish in version 8.2.1 :)

pirhoo avatar Nov 30 '20 16:11 pirhoo

Unfortunately it still doesn't work for us. This is the request that gets sent when clicking the "See disk usage" button, maybe it's useful:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "term": {
          "dirname.tree": "/home/ocr/originales/out"
        }
      },
      "must": [
        {
          "match": {
            "extractionLevel": 0
          }
        },
        {
          "match": {
            "type": "Document"
          }
        }
      ]
    }
  },
  "aggs": {
    "byDirname": {
      "terms": {
        "field": "dirname.tree",
        "include": "/home/ocr/originales/out/.*",
        "exclude": "/home/ocr/originales/out/.*/.*",
        "size": 50,
        "order": {
          "contentLength": "desc"
        }
      },
      "aggs": {
        "contentLength": {
          "sum": {
            "field": "contentLength"
          }
        },
        "bucket_truncate": {
          "bucket_sort": {
            "size": 50,
            "from": 0
          }
        }
      }
    },
    "totalContentLength": {
      "sum": {
        "field": "contentLength"
      }
    }
  }
}

Response:

{
  "took": 53,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "totalContentLength": {
      "value": 0
    },
    "byDirname": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

Because of this, it is more likely a backend problem. The directory actually contains around 200.000 directories in the first level and more than 8 million PDFs in total. I also noticed that on the "Number of documents by creation date" panel, if I click on the "Filter by folder" button and directly click on "Select folder" (even though no folders appear), the histogram does appear!

MatiasConTilde avatar Dec 08 '20 12:12 MatiasConTilde

Hi @MatiasConTilde, thanks for the detailed report. We'll have al look soon to sort this out!

pirhoo avatar Dec 08 '20 12:12 pirhoo

So if after "Extract[ing] text" and asking the program to locate/identify people, entities, and locations, and nothing comes up, is that a result of this problem not actually being resolved, or an error on my part?

JCBerger avatar Oct 04 '21 23:10 JCBerger

Hi @JCBerger, can you elaborate on your configuration and the steps you did so we can create a proper issue? Thanks!

pirhoo avatar Oct 05 '21 07:10 pirhoo

Hello @JCBerger I just tested the NER extraction (locate/extract ppl/org/locations) and it went well with the latest version 9.20.2

bamthomas avatar Oct 05 '21 13:10 bamthomas

@pirhoo @bamthomas Hey, thanks for being so on top of things! I downloaded the "DataShareStandalone.pkg" for Mac, and it installed 9.20.1. Let me see if 9.20.2 works.

JCBerger avatar Oct 05 '21 19:10 JCBerger

I just downloaded the Mac Standalone version (9.20.3), and I'm having the same issue. When I extract the text, I can see some of the files from the "Datashare" folder listed/mentioned in the Terminal, but it never "locates people" and it still says no documents included under "Disk usage." Some other context: based on the time it takes to "extract text" after running it previous times leads me to believe that the documents are not being re-indexed.

I'm basing this on minimal knowledge of the application (so please tell me if I'm offtrack), but that means the program knows where to check the index (to determine whether something has already been indexed), but something about my configuration is causing the "Locate people..." process (and potentially where the application is checking for "disk usage") to look somewhere else for the index.

If I'm not connected to Docker, are there changes I should make to the configuration to run it locally on my Mac?

Edit: And apologies @pirhoo — I didn't mean to ignore your question about my configuration. I've attached screenshots. Datashare_settings_part_1_of_2 Datashare_settings_part_2_of_2

JCBerger avatar Oct 06 '21 00:10 JCBerger

Datashare calculates directories size using their children documents and sub-directories. If a directory has no documents it won't show up. This might explain why your the "see disk usage" modal is showing nothing.

pirhoo avatar Sep 02 '22 16:09 pirhoo