sist2 icon indicating copy to clipboard operation
sist2 copied to clipboard

Indexing Does not Complete

Open rickcecil opened this issue 2 years ago • 21 comments

Device Information (please complete the following information):

  • OS: Ubuntu 20.04
  • Deployment: Docker
  • SIST2 Version: v3.2.0
  • Elasticsearch Version (if relevant) : ``

Command with arguments Sist2-admin indexing

Describe the bug After indexing a dataset, the indexing process looks to complete, but does not actually finish.

The Tasks page shows a full blue bar and seems to indicate that it has finished.

Screenshot 2023-07-24 at 6 51 54 PM

When I look at the logs, the last few lines are ...

2023-07-24 19:34:27 [WARNING elastic.c] calling _elastic_flush with 0 in queue
2023-07-24 19:34:27 [DEBUG elastic.c] Destroying indexer
2023-07-24 19:34:27 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-19053.sqlite
2023-07-24 19:34:27 [DEBUG tpool.c] Child process terminated with status code 0
2023-07-24 19:34:27 [DEBUG tpool.c] Child process terminated with status code 0
2023-07-24 19:34:27 [DEBUG tpool.c] Child process terminated with status code 0
2023-07-24 19:34:44 [INFO elastic.c] Refresh index <200>
2023-07-24 19:34:46 [INFO elastic.c] Merge index <200>
2023-07-24 19:34:46 [INFO elastic.c] Set refresh interval <200>

However, when I look at the docker logs for the admin, I see the following with a new line coming in every second or so and each line showing a different port on the same IP address.

07/24/2023 6:52:30 PM       INFO:     192.XXX.XXX.XX:42798 - "GET /api/task HTTP/1.1" 200 OK
07/24/2023 6:52:30 PM       INFO:     192.XXX.XXX.XX:42806 - "GET /api/task HTTP/1.1" 200 OK
07/24/2023 6:52:31 PM       INFO:     192.XXX.XXX.XX:42810 - "GET /api/task HTTP/1.1" 200 OK
07/24/2023 6:52:31 PM       INFO:     192..XXX.XXX.XX:42812 - "GET /api/task HTTP/1.1" 200 OK
07/24/2023 6:52:32 PM       INFO:     192.XXX.XXX.XX:42816 - "GET /api/task HTTP/1.1" 200 OK
07/24/2023 6:52:32 PM       INFO:     192.XXX.XXX.XX:42818 - "GET /api/task HTTP/1.1" 200 OK

Memory and CPU usage are both low -- certainly far less than when the index appears to be active.

Clicking "Kill" on the tasks page does nothing. Or at least, it doesn't seem to do anything. I've waited about 12 hours at one point and the status never changed.

And here is the last entry in my docker logs for the elasticsearch -- none of the docker logs for elasticsearch seem to indicate any kind of error.

07/24/2023 2:34:46 PM
cluster.name=docker-clustercluster.uuid=qyK6Qru4R9ioo2wfahJRTwcomponent=o.e.c.s.IndexScopedSettingslevel=INFOmessage= [writing] updating [index.refresh_interval] from [30s] to [1s]node.id=dwP5FddhTzW0mm6oHptQJAnode.name=9fb1f4952588timestamp=2023-07-24T14:34:46,645-05:00type=server
add cluster.name=docker-cluster
add cluster.uuid=qyK6Qru4R9ioo2wfahJRTw
add component=o.e.c.s.IndexScopedSettings
add level=INFO
add message= [writing] updating [index.refresh_interval] from [30s] to [1s]
add node.id=dwP5FddhTzW0mm6oHptQJA
add node.name=9fb1f4952588
add timestamp=2023-07-24T14:34:46,645-05:00
add type=server

This does not always happen.

A previous scan of this same data set took 20m and the indexing took 6m.

rickcecil avatar Jul 25 '23 00:07 rickcecil

Hi, are there any errors at all in the logs, and how many threads did you use for the index step?

simon987 avatar Jul 25 '23 01:07 simon987

No errors at all in the log. This line shows up several times -- with the same timestamp:

2023-07-24 19:34:27 [WARNING elastic.c] calling _elastic_flush with 0 in queue

As for threads, I've been using 50 -- though, this one says 49. FWIW, I have settled on ~50 randomly. I am not sure if that is high, low -- or how I should be determining the right, best number of threads.

rickcecil avatar Jul 25 '23 02:07 rickcecil

Thanks I'll try to reproduce the bug.

50 is very high, the number should be at most equal to the number of logical cores on the machine where Elasticsearch is hosted. I've found that anything more than ~8 threads will periodically make Elasticsearch crash or run out of memory (on a 16-core machine) and 6 threads was the sweet spot in terms of performance

simon987 avatar Jul 25 '23 11:07 simon987

Huh. Good to know. When I get home from work, I will reduce it to a more logical number and run those again. Seems reasonable that my excessively high thread count might be responsible for the issue. I’ll report back tonight.

rickcecil avatar Jul 25 '23 12:07 rickcecil

So, I re-ran the failing job overnight and it still failed in the same way. I am going to delete the job and run it again against a new backend to see if setting the threads to a reasonable number from the outset helps.

rickcecil avatar Jul 26 '23 10:07 rickcecil

After running several indexes, I just experienced the issue again. This time, I had 4 threads (on a 4 core machine). There were over 700k documents that I indexed. Mostly PDFs and HTML files if that matters.

I've knocked it down to 3 threads and am re-running it.

rickcecil avatar Jul 27 '23 12:07 rickcecil

The error is still occurring.

I've reduced the number of threads in both the associated index and the job.

The docker logs are just a continuous stream of

07/27/2023 2:20:19 PM
INFO:     192.XXX.XXX.XXX:37374 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:20 PM
INFO:     192.XXX.XXX.XXX:37380 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:21 PM
INFO:     192.XXX.XXX.XXX:37388 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:22 PM
INFO:     192.XXX.XXX.XXX:37390 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:23 PM
INFO:     192.XXX.XXX.XXX:37402 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:24 PM
INFO:     192.XXX.XXX.XXX:37414 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:25 PM
INFO:     192.XXX.XXX.XXX:44728 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:26 PM
INFO:     192.XXX.XXX.XXX:44732 - "GET /api/task HTTP/1.1" 200 OK
07/27/2023 2:20:27 PM
INFO:     192.XXX.XXX.XXX:44746 - "GET /api/task HTTP/1.1" 200 OK

while the logs in the Sist2 admin show:

2023-07-27 12:41:37 [INFO tpool.c] Worker threads finished
2023-07-27 12:41:37 [INFO tpool.c] Destroying thread pool
2023-07-27 12:41:37 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-32.sqlite
2023-07-27 12:41:37 [WARNING elastic.c] calling _elastic_flush with 0 in queue
2023-07-27 12:41:37 [DEBUG elastic.c] Destroying indexer
2023-07-27 12:41:37 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-32.sqlite
2023-07-27 12:41:37 [DEBUG elastic.c] Indexed 35 documents (2kB) <200>
2023-07-27 12:41:37 [DEBUG elastic.c] Destroying indexer
2023-07-27 12:41:37 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-32.sqlite
2023-07-27 12:41:37 [DEBUG tpool.c] Child process terminated with status code 0
2023-07-27 12:41:37 [DEBUG tpool.c] Child process terminated with status code 0
2023-07-27 12:41:37 [DEBUG elastic.c] Indexed 17 documents (1kB) <200>
2023-07-27 12:41:37 [DEBUG elastic.c] Destroying indexer
2023-07-27 12:41:37 [DEBUG database.c] Closing database /dev/shm/sist2-ipc-32.sqlite
2023-07-27 12:41:37 [DEBUG tpool.c] Child process terminated with status code 0
2023-07-27 12:41:41 [INFO elastic.c] Refresh index <200>
2023-07-27 12:41:44 [INFO elastic.c] Merge index <200>
2023-07-27 12:41:44 [INFO elastic.c] Set refresh interval <200>

And the task looks to be completed in the Sist2 dashboard.

There were over 700K files scanned and indexed.

I am going to continue to let it run for the time being, but will likely kill it tonight if there is no change.

I an next going to try filtering some files out of the scan/index to see if that changes anything. I can't really think of any other troubleshooting to do -- so pointers would be appreciated.

rickcecil avatar Jul 27 '23 19:07 rickcecil

I just had a thought. I am also using Filerun and when it indexes a directory, it created a dotfile for each file in the directory. I wonder if these dotfiles are not getting counted in the scan, but are getting picked up in the index process -- or something because I have no idea how the scan and index processes work... but the Docker logs seem to indicate that the indexing process is continuing even though the admin says it is done.

Does this seem plausible to you? If so, I can exclude the dotfiles from the scan and index. Just gotta get the regex for that sorted.

(Or, if you think it would be a better test, I can create a copy of my files and delete the dotfiles from that copy. I have the space to do that.)

rickcecil avatar Jul 27 '23 21:07 rickcecil

Hi, thanks for investigating. usually after [INFO elastic.c] Set refresh interval <200> it should be done and exit, so I think maybe one child process isn't being terminated properly so it stays waiting forever or something like that. Can you count the number of Child process terminated lines vs. the number of threads?

During the Index step it doesn't read the filesystem - only the .sist2 index file that was created in the scan step (in theory you could delete/move your files after the scan step and everything would work including the web frontend (except downloads of course))

simon987 avatar Jul 27 '23 21:07 simon987

I set three threads and three child processes are terminated.

Other than the docker log showing a "ping" to random ports, it seems like the index finishes. I wonder if it is a UI bug -- though odd that it wouldn't go away on refresh... I'll keep a browser console open on my next pass.

Another thing I am going to try is indexing individual subdirectories to see if it's file or content related. Seems unlikely...but maybe?

How would I determine if an index was created once it freezes? If it's a file or a database, I am curious about permissions and just want to see if something has actually been created. Not sure that even makes sense.

And, randomly: could it be a permissions issue? Just asking because so many bugs that I encounter come down to permissions issues... haha...

Then, when something is finished indexing, should something show up in the elasticsearch log to indicate as such? What would that look like?

I can probably figure some of these things out, so will report back what I discover. At this point, I imagine this will be an error between the keyboard and the seat. Oof.

Appreciate the continued support. And, seriously, love Sist2. Nothing else quite like it. :D

rickcecil avatar Jul 28 '23 00:07 rickcecil

I don't think it's related to specific files, or a user error. If you can see that Set refresh interval <200> it means that everything was indexed fine (but the sist2 executable doesn't exit for some reason).

You should be able to trick sist2-admin into thinking that the index step was successful by running

sqlite3 sist2-admin-data/state.db
> UPDATE task_done SET return_code=0;
> .exit

Then if you start the web UI all your files should be there

simon987 avatar Jul 28 '23 01:07 simon987

Thanks. Again, much appreciated. I will give that a shot the next time it happens.

rickcecil avatar Jul 28 '23 02:07 rickcecil

Unfortunately, I have not been able to make this suggestion work.

I can't run it through docker as the docker image does not contain sqlite3 AFAICT.

And when I run the command through the host, everything seems to run, but it does not work.

Steps I take from the host:

  1. sudo sqlite3 sist2-admin-data/state.db
  2. UPDATE task_done SET return_code=0;
  3. .exit
  4. Refresh browser to check
  5. Docker-compose stop
  6. Docker-compose start

The task page still only showed a scan for that task.

  1. sudo sqlite3 sist2-admin-data/state.db
  2. UPDATE task_done SET return_code=0;
  3. .exit
  4. Refresh browser to check if fixed (it wasn't)
  5. Docker-compose down
  6. Docker-compose up -d

The task page still only showed a scan for that task.

Thanks!

rickcecil avatar Jul 28 '23 17:07 rickcecil

So, now, this problem is occurring every time I re-scan an index.

It happens whether I have 1, 2, 3, or 4 threads set -- on both the scanner and the indexer. (My process has four cores.)

If I run top on the host, I can see the the sist2 service running and I can see when the process stops, but the progress bar remains at 100%

I clicked Kill and got this error message:

07/30/2023 8:38:07 PM
2023-07-31 01:38:07 [INFO] Killing task ee7c097a-9c91-45cf-9eea-602f42163942 (pid=None)
07/30/2023 8:38:07 PM
INFO:     192.168.192.23:40012 - "POST /api/task/ee7c097a-9c91-45cf-9eea-602f42163942/kill HTTP/1.1" 500 Internal Server Error
07/30/2023 8:38:07 PM
ERROR:    Exception in ASGI application
07/30/2023 8:38:07 PM
Traceback (most recent call last):
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 419, in run_asgi
07/30/2023 8:38:07 PM
    result = await app(  # type: ignore[func-returns-value]
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
07/30/2023 8:38:07 PM
    return await self.app(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 289, in __call__
07/30/2023 8:38:07 PM
    await super().__call__(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 122, in __call__
07/30/2023 8:38:07 PM
    await self.middleware_stack(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 184, in __call__
07/30/2023 8:38:07 PM
    raise exc
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 162, in __call__
07/30/2023 8:38:07 PM
    await self.app(scope, receive, _send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 91, in __call__
07/30/2023 8:38:07 PM
    await self.simple_response(scope, receive, send, request_headers=headers)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 146, in simple_response
07/30/2023 8:38:07 PM
    await self.app(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
07/30/2023 8:38:07 PM
    raise exc
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
07/30/2023 8:38:07 PM
    await self.app(scope, receive, sender)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
07/30/2023 8:38:07 PM
    raise e
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
07/30/2023 8:38:07 PM
    await self.app(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 718, in __call__
07/30/2023 8:38:07 PM
    await route.handle(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 276, in handle
07/30/2023 8:38:07 PM
    await self.app(scope, receive, send)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 66, in app
07/30/2023 8:38:07 PM
    response = await func(request)
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 273, in app
07/30/2023 8:38:07 PM
    raw_response = await run_endpoint_function(
07/30/2023 8:38:07 PM
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 190, in run_endpoint_function
07/30/2023 8:38:07 PM
    return await dependant.call(**values)
07/30/2023 8:38:07 PM
  File "/root/sist2-admin/sist2_admin/app.py", line 133, in kill_job
07/30/2023 8:38:07 PM
    return task_queue.kill_task(task_id)
07/30/2023 8:38:07 PM
  File "/root/sist2-admin/sist2_admin/jobs.py", line 305, in kill_task
07/30/2023 8:38:07 PM
    os.kill(pid, signal.SIGTERM)
07/30/2023 8:38:07 PM
TypeError: 'NoneType' object cannot be interpreted as an integer

Also, I noticed a new error in the elasticsearch logs: "exception during geoip databases update". I added ingest.geoip.downloader.enabled=false as an environment variable in docker and am no longer getting that error. It did not seem to help this situation, though.

rickcecil avatar Jul 31 '23 11:07 rickcecil

Everything seems to run fine when I run the scan and index via command line, though they do not show up in the admin UI.

Does every search backend need a different Elasticsearch Index Name? Or should they all be the same?

rickcecil avatar Jul 31 '23 12:07 rickcecil

I can confirm this problem. I can scan no problem from the sist2-admin when manually running it. Using the scheduled rescan makes it somehow not finish the task. It seems somewhat random as to when it sticks. Sometimes with the first run or sometimes with the 7th run.

Badlee2020 avatar Jul 31 '23 19:07 Badlee2020

I have the same issue. I can scan and index two jobs but a third one (reproductible only this one) will not complete. No matter how many threads I'm using or how big the batch size is. Last logs entries are:

2023-07-31 09:54:48 [DEBUG elastic.c] Indexed 1000 documents (3703kB) <200>
2023-07-31 09:54:50 [DEBUG elastic.c] Indexed 1000 documents (3678kB) <200>
2023-07-31 09:54:50 [DEBUG elastic.c] Indexed 1000 documents (3416kB) <200>

Then nothing happens. On the admin website the task doesn't complete and pressing kill does nothing either. Let me know if I can deliver some helpful information.

einfachTobi avatar Aug 01 '23 08:08 einfachTobi

Thanks for the additional context, I'll take a look when I have a chance.

[INFO] Killing task ee7c097a-9c91-45cf-9eea-602f42163942 (pid=None)

This might be the root cause.. If the pid is null then I think waitpid will wait until the current process is completed (i.e. never) (and also the "Kill" button won't work)

simon987 avatar Aug 01 '23 22:08 simon987

The problem still exists. Is there any work on it or anything I can contribute other than code?

einfachTobi avatar Nov 16 '23 13:11 einfachTobi

I've been running into this issue recently and have come into two "fixes" for my case, hopefully this helps others

  1. Terminated scans will pile up in the directory and subsequent runs will generate new indicies, deleting both versions allows for a clean restart of the job. These can be identified also as jobs with the status 'started' instead of 'indexed' or 'created'

  2. upon reaching

[INFO elastic.c] Refresh index <200>
[INFO elastic.c] Merge index <200>
[INFO elastic.c] Set refresh interval <200>

exec into the container, top and SIGINT the sist2 webui processes. SIGKILL/ SIGTERM will cause them to restart, SIGKILL/ SIGTERM on the scanning processes creates zombies. If done correctly, [ADMIN ] Restart frontend pid=1465 frontend_name='name' should come up, repeat for number of frontend processes.

mchangrh avatar May 22 '24 21:05 mchangrh