datashare
datashare copied to clipboard
No documents appearing in "See disk usage"
Describe the bug Even though there are more than 4M documents indexed, they dont appear in the "See disk usage" button or the "Number of documents by creation date" panel in the Insights tab.
To Reproduce Steps to reproduce the behavior:
- Go to insights tab
- Click on the buttons
Expected behavior A view of the documents and directories to appear.
Screenshots
Desktop (please complete the following information):
- OS: Ubuntu 20.04
- Browser: Firefox, but irrelevant
- Version: 7.9.5
Additional context
- Logs don't say anything relevant.
- It also didn't appear when we had fewer documents indexed.
- The path filter in search documents does work:
This should be resolved by @pirhoo's latest developments.
Hi @MatiasConTilde, a fix for this has been publish in version 8.2.1
:)
Unfortunately it still doesn't work for us. This is the request that gets sent when clicking the "See disk usage" button, maybe it's useful:
{
"size": 0,
"query": {
"bool": {
"filter": {
"term": {
"dirname.tree": "/home/ocr/originales/out"
}
},
"must": [
{
"match": {
"extractionLevel": 0
}
},
{
"match": {
"type": "Document"
}
}
]
}
},
"aggs": {
"byDirname": {
"terms": {
"field": "dirname.tree",
"include": "/home/ocr/originales/out/.*",
"exclude": "/home/ocr/originales/out/.*/.*",
"size": 50,
"order": {
"contentLength": "desc"
}
},
"aggs": {
"contentLength": {
"sum": {
"field": "contentLength"
}
},
"bucket_truncate": {
"bucket_sort": {
"size": 50,
"from": 0
}
}
}
},
"totalContentLength": {
"sum": {
"field": "contentLength"
}
}
}
}
Response:
{
"took": 53,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"aggregations": {
"totalContentLength": {
"value": 0
},
"byDirname": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": []
}
}
}
Because of this, it is more likely a backend problem. The directory actually contains around 200.000 directories in the first level and more than 8 million PDFs in total. I also noticed that on the "Number of documents by creation date" panel, if I click on the "Filter by folder" button and directly click on "Select folder" (even though no folders appear), the histogram does appear!
Hi @MatiasConTilde, thanks for the detailed report. We'll have al look soon to sort this out!
So if after "Extract[ing] text" and asking the program to locate/identify people, entities, and locations, and nothing comes up, is that a result of this problem not actually being resolved, or an error on my part?
Hi @JCBerger, can you elaborate on your configuration and the steps you did so we can create a proper issue? Thanks!
Hello @JCBerger I just tested the NER extraction (locate/extract ppl/org/locations) and it went well with the latest version 9.20.2
@pirhoo @bamthomas Hey, thanks for being so on top of things! I downloaded the "DataShareStandalone.pkg" for Mac, and it installed 9.20.1. Let me see if 9.20.2 works.
I just downloaded the Mac Standalone version (9.20.3), and I'm having the same issue. When I extract the text, I can see some of the files from the "Datashare" folder listed/mentioned in the Terminal, but it never "locates people" and it still says no documents included under "Disk usage." Some other context: based on the time it takes to "extract text" after running it previous times leads me to believe that the documents are not being re-indexed.
I'm basing this on minimal knowledge of the application (so please tell me if I'm offtrack), but that means the program knows where to check the index (to determine whether something has already been indexed), but something about my configuration is causing the "Locate people..." process (and potentially where the application is checking for "disk usage") to look somewhere else for the index.
If I'm not connected to Docker, are there changes I should make to the configuration to run it locally on my Mac?
Edit: And apologies @pirhoo — I didn't mean to ignore your question about my configuration. I've attached screenshots.
Datashare calculates directories size using their children documents and sub-directories. If a directory has no documents it won't show up. This might explain why your the "see disk usage" modal is showing nothing.