open-semantic-search icon indicating copy to clipboard operation
open-semantic-search copied to clipboard

"Import Status: Running file import" stuck.

Open ZeroCool940711 opened this issue 4 years ago • 19 comments

Seems like OpenSemanticSearch is stuck extracting and analyzing some files, it's been more than a few days and its still showing the same message when searching, even after rebooting it still stuck on the same files. It doesn't seem to be indexing anything new as the total document count still the same as it was before and there doesn't seem to be anything else OSS is doing.

image

ZeroCool940711 avatar Apr 20 '20 11:04 ZeroCool940711

Same, wait for a long time but seems not working.. and flower have no active session

wAikAp avatar Apr 27 '20 09:04 wAikAp

I think after some time it just stops working, in my case after 75 billion documents it doesnt index anything or process anything even though the CPU and RAM is not been used at all in my server, seems like there is some internal limit or something is broken, nothing is logged so its hard to tell what's going on.

ZeroCool940711 avatar Apr 27 '20 09:04 ZeroCool940711

But I just indexing 6 files, seems 1 .ppt file can't do the OCR task, and I wait for 2 days, the import status still " Running file import (still 1 documents to process) "

wAikAp avatar Apr 28 '20 10:04 wAikAp

I am also experiencing this issue testing out open-semantic-search 20.02.08. Is there a service that needs to be restarted, or how is this issue resolved?

Adding start/stop instructions for services in addition to "solr" will be helpful...as well as the order of operations. https://www.opensemanticsearch.org/doc/admin/cmd

srich avatar May 01 '20 15:05 srich

Is it slow or is it stuck? - I set up a new instance on a laptop (with really too little RAM, so there will be a lot of swapping), and it seemed stuck on 3 files. After maybe 2 days it was down to 2 files. So not stuck, but slow due to swapping...

Though I will also say that the User Interface is not ideal as it would be nice to know which files are missing...

DetlevCM avatar May 24 '20 09:05 DetlevCM

In my case its completely stuck, its not a RAM problem as it has a lot of RAM on the server im running it, I think it might have something to do with images been deleted before it can process them, if im right images are not downloaded to the server but instead they are used directly from the website where they were indexed, so, could be that an image was deleted or moved before it could be processed, also could be that it doesn't have access to the image or something, it could be trying the same files over and over and because they are not accessible the process can not be completed.

ZeroCool940711 avatar May 24 '20 19:05 ZeroCool940711

I do have the same problem: Indexed a small folder via "opensemanticsearch-index-dir" leads to message "Running file import (still 77 documents to process)". CLI shows Indexing new file: ....but index creation seems to stuck. The folder does only contain simple textfiles w/o any images.

Any hints to get the root cause? logs?

Edit# indexing a single file with opensemanticsearch-index-file within the same folder is running fine.

olli0815 avatar Jun 14 '20 19:06 olli0815

Mine is similar, it has looked this way since February, there have been a bunch a reboots and crashes. I am running the 20.01.17 release. I was thinking of downloading 20.04.17 to see if it made any difference.

It would be nice if there was a timeout, have it skip the current document, and move on. Let it come back to it on the next pass

Import status: Running file import (still 5071601 documents to process)

Because of yet running and open tasks like text extraction and analysis maybe not all results were found yet, since at the moment of this search 5071601 file(s) could be only searched, overviewed and filtered by their file names only, not yet by their content and/or content based facets/filters!

 Previous Newest 10 of 5339085 documents 

mbanks850 avatar Jul 02 '20 11:07 mbanks850

If anybody wants to do some testing, I wonder if the problem does not stem from an interaction of components (it might be too early to tell just now on my end):

I decided to "clean up" and start with a new freshly configured instance of OpenSemantic Search. (Side note: after updating Debian, I immediately had some corrupted files in /var/lib/dpkg/info ... - I wonder why and how.)

In order to reduce the computational cost and also because I am not sure it adds value in my specific use case, I disabled both the Named Entity Recognition (Spacy) and the Graph DB (neo4j). So far it seems that the import is running fast without any problems. At present it is OCRing the documents. Add to that, significantly fewer files are written to /tmp (I had something daft like 200.000 files or so before...) So far I see about 500 - the pages from the document.

I guess I will see in "a while" (whenever...) if this helps. Incidentally, my previous installation of OpenSemantic Search never calmed down and seemed to continue working indefinitely... (I am using it as a local search engine for my document library. I don't need more than a search engine, so all the machine learning and the semantics support are not important to me.)

DetlevCM avatar Aug 09 '20 12:08 DetlevCM

What steps did you use to disabled Named Entity Recognition and Graph DB? Do you know what we would loose by disabling those features?

mbanks850 avatar Aug 10 '20 11:08 mbanks850

@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options.

Both entries have some descriptions. The graph database deals with relationships between documents and the named entity recognition tries to understand the document based on machine learning principles.

Given that the project (based on the description) was developed to deal with data dumps for journalists, such tools may be very, very useful. Using it as a local document search engine, the relations (graph database) become less interesting. The Named Entity Recognition could be useful, but is possibly not well tuned to for example technical documents. It may also be that the Named Entity Recognition deals with the semantics aspets of the search - thus turning it off may make OpenSemantic Search "dumber". Given that I want to search a database of papers and technical documents that I create, this seems fair enough for/to me.

Now for some reason, this has lead to Open Semantic Search not showing me how many files it yet wants to OCR... - But tesserract ocr is the only process hogging the CPU. (I don't think SOLR is particularly heavy for the straightforward searching. It is the part that tries to be clever which is CPU-intensive.)

DetlevCM avatar Aug 10 '20 11:08 DetlevCM

Thank you, looking at the descriptions, Graph DB is not something I will need. Named Entity maybe, but we are also just using it as a search in technical documents.

mbanks850 avatar Aug 10 '20 12:08 mbanks850

Same issue in OSS 20.04.17 and 20.01.17

RiteshSingh avatar Oct 25 '20 03:10 RiteshSingh

Same issue here.

Ubuntu 20.04, OSS 20.11.01 and 21.01.03

Indexing via opensemanticsearch-index-dir -> ~210.000 files. After about 16 hours 2 documents are extracted but CPU is on 100% with 8 tasks from "etl_tasks".

"NER" and "Neo4j" are disabled.

I tried to reset filemonitoring and deleted index but CPU is always on 100% with "etl_tasks" without indexing?

Only if i stop the service "opensemanticetl" the cpu is in normal use.

Has someone news about this?

rusty9283 avatar Jan 13 '21 20:01 rusty9283

After some testing I think my problem is maybe another: #341

rusty9283 avatar Jan 16 '21 16:01 rusty9283

Same issue here. It's been a few months since this post. Did you encountered any other file import issue after this? Also, would it help if we turn it off after the fact (after it got stucked) or do we need to clean start and do another indexing?

@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options.

movanet avatar Mar 11 '21 02:03 movanet

I had the same problem today. Around 4,80,000 documents got indexed but they were stuck at file import. I waited for around 6 to 7 hours but still it looked to be stuck. I restarted the server and the import process started automatically. I am not sure but it looks like issue is something related to flower server worker. I am using virtual machine appliance (21.01.17).

nikhilbhalwankar avatar Jul 03 '21 18:07 nikhilbhalwankar

This issue is still unresolved. I have probably encountered the same problem (Open Semantic Search installation package from 22.10.08). It regularly hangs during the extraction of files (see issue #461 for details). Did you guys ever find any solution to this?

HenryJones23 avatar Mar 17 '23 18:03 HenryJones23