bonobo icon indicating copy to clipboard operation
bonobo copied to clipboard

Testing a bonobo pipeline

Open benrudolph opened this issue 7 years ago • 5 comments

I'm using bonobo in my django project and specifically using the ETLCommand to run my ETL processes. I want to write one or two integration tests for this command using Django's call_command. However, when I use call command I get an error stating:

django.db.utils.OperationalError: database "XYZ" is being accessed by other users
DETAIL:  There is 1 other session using the database.

I believe this is happening due to bonobo's async nature; I tested this hypothesis by adding a sleep after the call_command and that "fixed" the issue.

I'm curious if you've run into this issue and if there's a proper way to fix it? Perhaps an ability to run the graph synchronously?

benrudolph avatar Feb 17 '18 04:02 benrudolph

You can indeed run the graph in a non-parallel way, using the "naive" strategy. As the name suggest, it can bring other problems but should work for the simplest use cases.

There is no real way to change the strategy in an ETLCommand, unless you change the handle method to pass it to bonobo.run(). This should be an easy enhancement, though.

I prefer using "parallel" instead of "async" to describe how it runs, not to confuse it with asyncio and friends.

I will try to write a minimal use case for what you're describing to understand why it is behaving this way. What kind of DB are you using under the scene ? I believe (but maybe I'm wrong) that this should not be a problem for a database, unless it's something like sqlite which does not really support having more than one concurrent user accessing it.

hartym avatar Feb 19 '18 10:02 hartym

This is with Postgresql. I believe the reason that error happens is that Postgres doesn't allow you to drop the database when there is more than 1 active connection to the db. So when the test runner is trying to drop the db after the test is completed, my guess is one of the bonobo workers still maintains a connection because it hasn't quite disconnected.

The base case for me was pretty simple, I just had a graph where the last node inserted data into the database.

Thanks for the tip on parallel vs async, noted.

benrudolph avatar Feb 19 '18 17:02 benrudolph

I don't understand why there is more than one connection open, but that's maybe django-orm specific. Is there a way in django to explicitely close connection?

Let's keep this issue open, I'll try to write a simple case like this to reproduce the issue but can't give an ETA.

  • [ ] Have a way to select execution strategy when using ETLCommand for Django
  • [ ] Find a way to run call_command in a test case or other batch that does not bring concurrency problems.
  • [ ] Write an example test case for django in the documentation

hartym avatar Feb 21 '18 12:02 hartym

FWIW I tested with naive strategy and that fixes the issue. Thanks for the tip.

benrudolph avatar Mar 05 '18 03:03 benrudolph

Naive removes all parallelism. It's stupid for production, but can be great for testing purposes. Sorry still had no time to reproduce the thing and try to understand what happens, let me know if you have more info.

hartym avatar Mar 08 '18 15:03 hartym