sacred MongoObserver fails - QueuedMongoObserver stalls forever

MongoObserver fails - QueuedMongoObserver stalls forever

Open eliorc opened this issue 4 years ago • 18 comments

I'm using Sacred to monitor experiments, which all run inside of docker-compose containers, the sacred is run in side a Ubuntu 18.04 while the Mongo container is run on the official Mongo image, tag 4.0.11.

I am using Omniboard to monitor the results, and what happens is that the experiment starts successfully, and after 2 minutes +- seems like the MongoObserver fails because nothing is being shown up in the Omniboard and the status is PROBABLY_DEAD, which is a result of no heartbeat from my understanding. At the end of the run (after a few hours) I get

MongoObserver failed at some point during the run

Following #434 I tried using QueuedMongoObserver, which fails at about the same point but then stalls forever during the end of the run (I've let it a full night on, just to see if it will recover), so in that case I don't even see the error from above, I just get the script stall after the logs show

I0804 20:38:49.539427 140619051980608 run.py:238] Completed after 2:33:44

Nothing special happens on the part where it fails, it happens during some long normal processing (converting text to embedding indices and padding) which take like 5 minutes, but looking at the logs it sometimes fails after the text conversion and sometimes after the padding it is nothing consistent apart from happening all the time after less than 5 minutes total

@JarnoRFB - On another note, from my understanding you are writing the QueuedMongoObserver as you noted you'd be happy for people trying it out, so I did - and it might worth noting the stalling part at the end, seems like the observer stalls the experiment forever if the observer fails at some point

Aug 05 '19 07:08 eliorc

@eliorc Thanks for trying it out! Those are exactly the problems I expected would come up if using it for real. Just from reading your description it seems like the mongo connection is dropped and then the QueuedMongoObserver retries forever. What is kind of strange is that it seems to succeed in the end. It would be extremely helpful if you could somehow share a minimal example reproducing the problem. Otherwise I will just try to investigate this error condition. Thanks for reporting!

Aug 05 '19 18:08 JarnoRFB

This will be extremely hard since it is part of code which is private, but if you wish I can insert logging in anywhere you want in the code and post the outputs...

There are actually two wrong thins here, (1) that the MongoObserver/QueuedMongoObserver always fail without any error that I can see and (2) that when QueuedMongoObserver fails, it stalls forever

If it is of any help, here is how the things are connected (the containers and sacred) with some obvious changes for privacy:

python_script.py

...

# Sacred related code
ex = sacred.Experiment(name='train')
mongo_url = 'mongodb://{mongo_username}:{mongo_password}@mongo:27017/?authMechanism=SCRAM-SHA-1'.format(
    mongo_username=os.environ['MONGO_USERNAME'],
    mongo_password=os.environ['MONGO_PASSWORD'])
ex.observers.append(MongoObserver.create(
    url=mongo_url,
    db_name='db'))
ex.captured_out_filter = apply_backspaces_and_linefeeds

# Adding source files
for f in filter(lambda fn: fn.stem != '__init__', Path('utils').rglob('*.py')):
    ex.add_source_file(str(f))
    ex.add_source_file((mymodel.__file__))
...

Then I'm using _log.info few times before it fails

docker-compose.yml

version: '2.3'

services:
  docs:
    build:
      context: ''
      dockerfile: 'dockerfiles/some_image2'
    image: 'some_image2:0.2'
    hostname: 'docker_hostname2'
    ports:
      - '5000:5000'
    volumes:
      - '.:/opt'
    command: entrypoint.sh

  main_container:
    build:
      context: ''
      dockerfile: 'dockerfiles/some_image'
    image: 'some_image:0.3'
    runtime: 'nvidia'
    hostname: 'docker_hostname1'
    ports:
      - '6006:6006'
      - '8888:8888'
    environment:
      MONGO_USERNAME: 'admin'
      MONGO_PASSWORD: 'password'
      SACRED_MONGO_DATABASE: 'db'
    volumes:
      - '.:/opt'
    command: entrypoint.sh

  mongo:
    image: mongo
    ports:
      - '27017:27017'
    environment:
      MONGO_INITDB_ROOT_USERNAME: 'admin'
      MONGO_INITDB_ROOT_PASSWORD: 'password
      MONGO_INITDB_DATABASE: 'db'
    expose:
      - '27017'
    volumes:
      - './data/mongo_backup:/data/db'

  omniboard:
    image: vivekratnavel/omniboard:latest
    command: ["--mu", "mongodb://admin:password@mongo:27017/db?authSource=admin"]
    ports:
      - '5590:9000'
    depends_on:
      - 'mongo'

Aug 06 '19 12:08 eliorc

This behavior might be fixed with the changes in #590. In the current implementation, the queue observer will retry forever to have the MongoObserver process all events in order. So if the MongoObserver fails before a terminal event, i.e. interrupt, fail or complete, the QueueObserver will run forever. With the changes the QueueObserver should terminate shortly after the run if something went wrong.

Aug 14 '19 16:08 JarnoRFB

We are only discussing the infinite retry of the QueuedMongoObserver - but the source problem is that the two (QueuedMongoObserver / MongoObserver) are failing each time and I can't use them to monitor. How can I debug this? How can I tell why my observer is failing mid-run? It is reproducible each time, but I can't supply the code...

Aug 15 '19 07:08 eliorc

So there are several cases that can go wrong:

The database connection fails in mid run so the mongo observer can not send messages anymore. This seems unlikely if the run at least starts and only fails after a certain time.
You try to send an unserializable message
The document capacity is exceeded due to much captured output
The metrics document capacity is exceeded

To check the two last points I would take a look at the run documents in the data base for example using mongo express. Check https://github.com/IDSIA/sacred/blob/master/examples/docker/docker-compose.yml for a docker setup.

For the two first points I would set some break points in the mongo observer and see what happens.

Aug 15 '19 08:08 JarnoRFB

Okay, my script does output a lot, and indeed it fails where a big chunk for messages arrive - So is it possible to turn off the output capturing?

Aug 15 '19 09:08 eliorc

Indeed you can do ex.captured_out_filter = lambda text: 'Output capturing turned off.' We really need to document this behavior of the mongo observer ...

Aug 15 '19 09:08 JarnoRFB

Awesome I'll try that

Aug 15 '19 10:08 eliorc

@eliorc did the problem go away after turning off captured output?

Aug 22 '19 08:08 JarnoRFB

Truth is I haven't tried it yet, I'm OOO, I'll do my best to remember doing so next Sunday and I'll report

Aug 22 '19 08:08 eliorc

No pressure, I just wondered :D

Aug 22 '19 08:08 JarnoRFB

Ok, I finally tried it again and ex.captured_out_filter = lambda text: 'Output capturing turned off.' did not help, it indeed prevents the output from being captured, but nonetheless it still fails after 3~ minutes.

Is there anyway to see maybe error messages of the observer? Some logs or something? It just dies somewhere along the way with no apparent reason

Aug 30 '19 10:08 eliorc

If possible, you could just clone sacred, modify the code to add some logging and then pip install . into your environment. Using the queue observer you could catch and log the exception here https://github.com/IDSIA/sacred/blob/master/sacred/observers/queue.py#L83

Sep 03 '19 11:09 JarnoRFB

😱 too much overhead for me right now, maybe it's a good idea to be able to pass (Queued)MongoObserver's constructor a file path and support logging into user-defined files

Sep 04 '19 06:09 eliorc

Yes, better logging for the observers would be a good thing.

Sep 04 '19 08:09 JarnoRFB

Updating, what caused the error is that I had saved as an artifact a folder which caused an error (I did not know it is not possible) and therefore the observer failed.

As I said, no error was indicating this, not in the MongoObserver and not in the QueuedMongoObserver

Oct 24 '19 13:10 eliorc

Is it fixed? I discovered a very similar problem lately when used to track training of a tensorflow model.

For debugging run only short training procedures. There I found, that the QueuedMongoOserver stalls always, but the MongoObserver w/o Queue works as expected. Means: all metrics, outputs and artifacts are captured as expected.

I using Python 3.7 with sacred 0.8.1.

Mar 05 '20 19:03 AndreGosselink

This started happening to me as well when I added a lot more log_scalar calls.

Nov 22 '20 12:11 vladfi1

sacred sacred copied to clipboard

MongoObserver fails - QueuedMongoObserver stalls forever

sacred
sacred copied to clipboard