sacred
sacred copied to clipboard
MongoObserver fails - QueuedMongoObserver stalls forever
I'm using Sacred to monitor experiments, which all run inside of docker-compose containers, the sacred is run in side a Ubuntu 18.04 while the Mongo container is run on the official Mongo image, tag 4.0.11.
I am using Omniboard to monitor the results, and what happens is that the experiment starts successfully, and after 2 minutes +- seems like the MongoObserver fails because nothing is being shown up in the Omniboard and the status is PROBABLY_DEAD
, which is a result of no heartbeat from my understanding. At the end of the run (after a few hours) I get
MongoObserver failed at some point during the run
Following #434 I tried using QueuedMongoObserver, which fails at about the same point but then stalls forever during the end of the run (I've let it a full night on, just to see if it will recover), so in that case I don't even see the error from above, I just get the script stall after the logs show
I0804 20:38:49.539427 140619051980608 run.py:238] Completed after 2:33:44
Nothing special happens on the part where it fails, it happens during some long normal processing (converting text to embedding indices and padding) which take like 5 minutes, but looking at the logs it sometimes fails after the text conversion and sometimes after the padding it is nothing consistent apart from happening all the time after less than 5 minutes total
@JarnoRFB - On another note, from my understanding you are writing the QueuedMongoObserver as you noted you'd be happy for people trying it out, so I did - and it might worth noting the stalling part at the end, seems like the observer stalls the experiment forever if the observer fails at some point
@eliorc Thanks for trying it out! Those are exactly the problems I expected would come up if using it for real. Just from reading your description it seems like the mongo connection is dropped and then the QueuedMongoObserver retries forever. What is kind of strange is that it seems to succeed in the end. It would be extremely helpful if you could somehow share a minimal example reproducing the problem. Otherwise I will just try to investigate this error condition. Thanks for reporting!
This will be extremely hard since it is part of code which is private, but if you wish I can insert logging in anywhere you want in the code and post the outputs...
There are actually two wrong thins here, (1) that the MongoObserver/QueuedMongoObserver always fail without any error that I can see and (2) that when QueuedMongoObserver fails, it stalls forever
If it is of any help, here is how the things are connected (the containers and sacred) with some obvious changes for privacy:
python_script.py
...
# Sacred related code
ex = sacred.Experiment(name='train')
mongo_url = 'mongodb://{mongo_username}:{mongo_password}@mongo:27017/?authMechanism=SCRAM-SHA-1'.format(
mongo_username=os.environ['MONGO_USERNAME'],
mongo_password=os.environ['MONGO_PASSWORD'])
ex.observers.append(MongoObserver.create(
url=mongo_url,
db_name='db'))
ex.captured_out_filter = apply_backspaces_and_linefeeds
# Adding source files
for f in filter(lambda fn: fn.stem != '__init__', Path('utils').rglob('*.py')):
ex.add_source_file(str(f))
ex.add_source_file((mymodel.__file__))
...
Then I'm using _log.info
few times before it fails
docker-compose.yml
version: '2.3'
services:
docs:
build:
context: ''
dockerfile: 'dockerfiles/some_image2'
image: 'some_image2:0.2'
hostname: 'docker_hostname2'
ports:
- '5000:5000'
volumes:
- '.:/opt'
command: entrypoint.sh
main_container:
build:
context: ''
dockerfile: 'dockerfiles/some_image'
image: 'some_image:0.3'
runtime: 'nvidia'
hostname: 'docker_hostname1'
ports:
- '6006:6006'
- '8888:8888'
environment:
MONGO_USERNAME: 'admin'
MONGO_PASSWORD: 'password'
SACRED_MONGO_DATABASE: 'db'
volumes:
- '.:/opt'
command: entrypoint.sh
mongo:
image: mongo
ports:
- '27017:27017'
environment:
MONGO_INITDB_ROOT_USERNAME: 'admin'
MONGO_INITDB_ROOT_PASSWORD: 'password
MONGO_INITDB_DATABASE: 'db'
expose:
- '27017'
volumes:
- './data/mongo_backup:/data/db'
omniboard:
image: vivekratnavel/omniboard:latest
command: ["--mu", "mongodb://admin:password@mongo:27017/db?authSource=admin"]
ports:
- '5590:9000'
depends_on:
- 'mongo'
This behavior might be fixed with the changes in #590. In the current implementation, the queue observer will retry forever to have the MongoObserver process all events in order. So if the MongoObserver fails before a terminal event, i.e. interrupt, fail or complete, the QueueObserver will run forever. With the changes the QueueObserver should terminate shortly after the run if something went wrong.
We are only discussing the infinite retry of the QueuedMongoObserver - but the source problem is that the two (QueuedMongoObserver / MongoObserver) are failing each time and I can't use them to monitor. How can I debug this? How can I tell why my observer is failing mid-run? It is reproducible each time, but I can't supply the code...
So there are several cases that can go wrong:
- The database connection fails in mid run so the mongo observer can not send messages anymore. This seems unlikely if the run at least starts and only fails after a certain time.
- You try to send an unserializable message
- The document capacity is exceeded due to much captured output
- The metrics document capacity is exceeded
To check the two last points I would take a look at the run documents in the data base for example using mongo express. Check https://github.com/IDSIA/sacred/blob/master/examples/docker/docker-compose.yml for a docker setup.
For the two first points I would set some break points in the mongo observer and see what happens.
Okay, my script does output a lot, and indeed it fails where a big chunk for messages arrive - So is it possible to turn off the output capturing?
Indeed you can do ex.captured_out_filter = lambda text: 'Output capturing turned off.'
We really need to document this behavior of the mongo observer ...
Awesome I'll try that
@eliorc did the problem go away after turning off captured output?
Truth is I haven't tried it yet, I'm OOO, I'll do my best to remember doing so next Sunday and I'll report
No pressure, I just wondered :D
Ok, I finally tried it again and ex.captured_out_filter = lambda text: 'Output capturing turned off.'
did not help, it indeed prevents the output from being captured, but nonetheless it still fails after 3~ minutes.
Is there anyway to see maybe error messages of the observer? Some logs or something? It just dies somewhere along the way with no apparent reason
If possible, you could just clone sacred, modify the code to add some logging and then pip install .
into your environment. Using the queue observer you could catch and log the exception here https://github.com/IDSIA/sacred/blob/master/sacred/observers/queue.py#L83
😱 too much overhead for me right now, maybe it's a good idea to be able to pass (Queued)MongoObserver
's constructor a file path and support logging into user-defined files
Yes, better logging for the observers would be a good thing.
Updating, what caused the error is that I had saved as an artifact a folder which caused an error (I did not know it is not possible) and therefore the observer failed.
As I said, no error was indicating this, not in the MongoObserver and not in the QueuedMongoObserver
Is it fixed? I discovered a very similar problem lately when used to track training of a tensorflow model.
For debugging run only short training procedures. There I found, that the QueuedMongoOserver stalls always, but the MongoObserver w/o Queue works as expected. Means: all metrics, outputs and artifacts are captured as expected.
I using Python 3.7 with sacred 0.8.1.
This started happening to me as well when I added a lot more log_scalar calls.