platoon Compilation locks

Running on the cluster I get a lot of fighting over locks. Maybe it would be good to have functionality that allows the first worker to finish compiling (and fill the cache) before the others start?

You basically just use it like this:

worker.start_compilation()
f = theano.function([inputs], [outputs])
# etc
worker.end_compilation()

The first worker to get there will go right on, the others will wait till the first workers reaches end_compilation after which the rest can go.

Feb 18 '16 02:02 bartvm

This isn't always good. If the optimization take times and the cache is full, then this will slow things down...

The few good fixes I know:

The first time, start one worker to fill the cache, then restart.
Make Theano compile faster when the cache is empty
Update to the lock? (I'm not sure this is possible)

I'm not again this PR is this is optional, but we should not make it mandatory or tell people to always use this.

Feb 18 '16 13:02 nouiz

The first time, start one worker to fill the cache, then restart.

That's kind of what I was trying to do, but it's true that this approach is not very efficient in the case where the cache is already full. Is there a way to check if the cache is full or not?

On which cluster did you had problems with this?

Helios, my compilation directory was in $RAP

This is mostly what the theano cache do, but via the file system... So mostly, the gain is bypassing the filesystem? Do you see something else?

I was just trying to avoid Theano's locking system, because it seemed really inefficient. The waiting period is between 5 and 10 seconds by default, which seems long, The lock itself is two calls to isdir and mkdir, which I guess technically speaking is not atomic and can result in race conditions (I had one process crash with a strange error which I think happened because the cache was corrupt, maybe because of that?).

What is the motivation for Theano not using fcntl.flock or fcntl.lockf? In the lab our system seems to be NFS4 with local_lock=none, so it should support locking files using fcntl. Hades and a bunch of other clusters use GPFS, which also supports fcntl locks. The only problem seems to be Helios, which uses Lustre but with localflock enabled (so locks are node-local apparently?). Do you think they would be willing to enable global locks? If so, Theano could switch to using blocking fcntl calls, leaving things up to the file system, which is likely to be far more efficient.

Feb 18 '16 19:02 bartvm

I think that folder creation is atomic and that's the reason why it's used there. As for the global lock, I asked them before to enable it and they said that it would have a huge impact on performance. I then did my research and it's totally false, so I have plans to ask again with more data in the coming weeks.

I also agree with Fred, I don't think this PR is the right solution. I think that just launching a dummy job on queue test to fill the cache first if the best thing to do at the moment.

Feb 18 '16 19:02 mgermain

Folder creation is, yes, but you first need to test whether it exists already. In between this test and the creation another process might have created it already and your folder creation will fail; so the locking operation isn't atomic in any way. It's generally a bad idea to try and reimplement locking when there are system calls that do it for you, but since you asked them about enabling global locking on Lustre I guess you are thinking the same thing.

When I looked into this just now, I reached the same conclusion. One paper actually explicitly said

While the Lustre documentation states that the locking mechanism can be disabled for higher performance, we have never observed such improvement by doing so.

Feb 18 '16 20:02 bartvm

Yes yes, I totally agree that reimplementing locking is a terrible idea in general that is why I asked them to fix that around a year ago :P

Thanks for the paper, I'll add it to the evidence I'll send them.

As for locking with folder creation, I don't know how it's implemented in Theano but, how about just creating and catching the "already exist" exception as we do here.

Feb 18 '16 20:02 mgermain

The directory creation is atomic. In fact, it is the only posic atomic operation. This is why we use it. We needed that in the past as we where using NFS3 that didn't had a global lock working.

Maybe the isdir can be removed to lower the load on the FS, but I don't think it will really help, but it should be quick to implement. Do you want to do it?

The 5-10s shouldn't be a problem. During this time, one process is compiling. When a process get the lock, it will pick up what others have done. In fact, to lower the overhead on the OS, and help on that, we could raise this on Helios to 30s-60s. It would help that.

@abergeron, and others here, what do you think of raising the wait time? Do you also think it can help? Bart, do you have the time to clear the cache and try it with an higher wait time? There is a Theano flag for this: compile.wait=30 would do it.

On Thu, Feb 18, 2016 at 3:28 PM, Mathieu Germain [email protected] wrote:

Yes yes, I totally agree that reimplementing locking is a terrible idea in general that is why I asked them to fix that around a year ago :P

Thanks for the paper, I'll add it to the evidence I'll send them.

As for locking with folder creation, I don't know how it's implemented in Theano but, how about just creating and catching the "already exist" exception as we do here https://github.com/SMART-Lab/smartdispatch/pull/100/files.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-185901784.

Feb 18 '16 20:02 nouiz

The current locking code for theano works very well in all sorts of hostile environments. This is not the case for fcntl() and flock() which both have silent failure cases in some configurations.

The isdir call is only there as an optimization to avoid doing the mkdir which is the real "lock" here. As matthieu said we use mkdir because it is atomic on NFS (in fact that is the only guaranteed atomic operation on all NFS versions).

It might be a tiny bit faster to use an fcntl() lock in environments that support it, but that time would be dwarfed by the time of the compilation itself.

What might improve performance of the cache is a better index than looping through the directory. This would also reduce the load on the filesystem.

I am not sure if increasing the wait time is actually going to win anything here.

Feb 18 '16 21:02 abergeron

The real problem is

Suppose that if you have an empty cache and launch 1 job and it take it 10m to fill the cache.

Then it happen frequently that if you have an empty cache and launch many jobs at the same time, it will take more then 10m for the first job to finish. It can take 30m 1h and even more.

I don't understand why this happen. If the problem is fight for the lock with non-efficient lock via the FS, fcntl can help and making the waiting time could also help as we will try to take it less often, so the same process will kept it for longer.

@abergeron, what tell you that the isdir() is an optimization? Did you timed it? Maybe it is as costly as mkdir. If that is the case, it is not efficient to use it.

Fred

On Thu, Feb 18, 2016 at 4:32 PM, abergeron [email protected] wrote:

The current locking code for theano works very well is all sorts of hostile environments. This is not the case for fcntl() and flock() which both have silent failure cases in some configurations.

The isdir call is only there as an optimization to avoid doing the mkdir which is the real "lock" here. As matthieu said we use mkdir because it is atomic on NFS (in fact that is the only guaranteed atomic operation on all NFS versions).

It might be a tiny bit faster to use an fcntl() lock in environments that support it, but that time would be dwarfed by the time of the compilation itself.

What might improve performance of the cache is a better index that looping through the directory. This would also reduce the load on the filesystem.

I am not sure if increasing the wait time is actually going to win anything here.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-185931479.

Feb 19 '16 00:02 nouiz

I didn't time it. I'm talking about what's there. Maybe the isdir is superfluous and we could go straight for mkdir.

Also, we should try to see why it takes more time with many processes. Is it because they are rescanning the cache? Is it because they are actually checking the lock too often? Is it because they just cause too much filesystem access when combined? Having an answer to these question would be more helpful than just replacing things blindly.

From timings I did I know that we are spending a large amount of time looking for things by listing the directory in the single-process case. Does that extend to the multi-process case? Is it made worse? That I don't know.

Feb 19 '16 00:02 abergeron

We rescan each time we take the lock. This is needed to don't compile the same module multiple time. I don't think a listdir is slower then puthing that information into a file a reading it.

Making the wait time longer could help by making less scanning?

In the past, we where taking the lock at the start and keeping it for all the compilation process. To not take it when we don't compile c code, I postponed it to only when we need to take it. But I forgot if we keep it to the end of c file compilation or not.

On Thu, Feb 18, 2016 at 7:13 PM, abergeron [email protected] wrote:

I didn't time it. I'm talking about what's there. Maybe the isdir is superfluous and we could go straight for mkdir.

Also, we should try to see why it takes more time with many processes. Is it because they are rescanning the cache? Is it because they are actually checking the lock too often? Is it because they just cause too much filesystem access when combined? Having an answer to these question would be more helpful than just replacing things blindly.

From timings I did I know that we are spending a large amount of time looking for things by listing the directory in the single-process case. Does that extend to the multi-process case? Is it made worse? That I don't know.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-185988564.

Feb 19 '16 00:02 nouiz

So I just created a little benchmark compiling our machine translation model for CPU on my desktop at home. If I start 4 workers in parallel with an empty cache and Theano's directory locking:

2016-02-19 11:00:54,316:Worker 1: Finished, took 282.9769949913025
2016-02-19 11:01:03,866:Worker 0: Finished, took 292.5302128791809
2016-02-19 11:01:04,291:Worker 3: Finished, took 292.93801403045654
2016-02-19 11:01:14,747:Worker 2: Finished, took 303.4081165790558
2016-02-19 11:01:14,819:Completed parallel processing, took 303.4868779182434

If I start 1 first, and then the other 3

2016-02-19 11:05:00,069:Worker 0: Finished, took 225.24447321891785
---
2016-02-19 11:06:40,361:Worker 3: Finished, took 100.25727009773254
2016-02-19 11:06:42,032:Worker 2: Finished, took 101.92840576171875
2016-02-19 11:06:50,461:Worker 1: Finished, took 110.35797786712646
2016-02-19 11:06:50,518:Completed sequential processing, took 335.69908452033997

If I use fcntl.lockf (see https://github.com/bartvm/Theano/commit/a341aa8f7e7f1fdcd071588b5c804fec1c469e87):

2016-02-19 11:26:02,181:Worker 0: Finished, took 307.9920198917389
2016-02-19 11:26:02,206:Worker 3: Finished, took 308.0243444442749
2016-02-19 11:26:02,271:Worker 1: Finished, took 308.06945419311523
2016-02-19 11:26:02,283:Worker 2: Finished, took 308.09377932548523
2016-02-19 11:26:02,348:Completed parallel processing, took 308.1741032600403

2016-02-19 11:29:40,563:Worker 0: Finished, took 218.2124490737915
---
2016-02-19 11:31:19,632:Worker 1: Finished, took 99.03914141654968
2016-02-19 11:31:20,215:Worker 2: Finished, took 99.62268471717834
2016-02-19 11:31:20,495:Worker 3: Finished, took 99.9016981124878
2016-02-19 11:31:20,552:Completed sequential processing, took 318.2032618522644

In short, the directory locking definitely slows things down (from ~220 to ~300 seconds), but the locking mechanism itself seems to make little difference. I'll try running it on the cluster, see if it gives the same results.

Feb 19 '16 17:02 bartvm

Thanks for the timing. So we know now that the lock itself, in the best case (locally) don't cause slowdowns.

Making the same timing on the cluster is a good idea.

I just checked the code and we take and release the lock for each thunk we compile. I started in the past the implementation merged in master to take it the first time we need it and release it only after all the thunks have been compiled. I'll see if I can quickly hack a working version to see if this help.

I think if we can understand what cause the 220s to become 300s and fix this, we would "fix" the problem that it is slower on cluster.

On Fri, Feb 19, 2016 at 12:14 PM, Bart van Merriënboer < [email protected]> wrote:

So I just created a little benchmark compiling our machine translation model for CPU on my desktop at home. If I start 4 workers in parallel with an empty cache and Theano's directory locking:

2016-02-19 11:00:54,316:Worker 1: Finished, took 282.9769949913025 2016-02-19 11:01:03,866:Worker 0: Finished, took 292.5302128791809 2016-02-19 11:01:04,291:Worker 3: Finished, took 292.93801403045654 2016-02-19 11:01:14,747:Worker 2: Finished, took 303.4081165790558 2016-02-19 11:01:14,819:Completed parallel processing, took 303.4868779182434

If I start 1 first, and then the other 3

2016-02-19 11:05:00,069:Worker 0: Finished, took 225.24447321891785

2016-02-19 11:06:40,361:Worker 3: Finished, took 100.25727009773254 2016-02-19 11:06:42,032:Worker 2: Finished, took 101.92840576171875 2016-02-19 11:06:50,461:Worker 1: Finished, took 110.35797786712646 2016-02-19 11:06:50,518:Completed sequential processing, took 335.69908452033997

If I use fcntl.lockf (see bartvm/Theano@a341aa8 https://github.com/bartvm/Theano/commit/a341aa8f7e7f1fdcd071588b5c804fec1c469e87 ):

2016-02-19 11:26:02,181:Worker 0: Finished, took 307.9920198917389 2016-02-19 11:26:02,206:Worker 3: Finished, took 308.0243444442749 2016-02-19 11:26:02,271:Worker 1: Finished, took 308.06945419311523 2016-02-19 11:26:02,283:Worker 2: Finished, took 308.09377932548523 2016-02-19 11:26:02,348:Completed parallel processing, took 308.1741032600403

2016-02-19 11:29:40,563:Worker 0: Finished, took 218.2124490737915

2016-02-19 11:31:19,632:Worker 1: Finished, took 99.03914141654968 2016-02-19 11:31:20,215:Worker 2: Finished, took 99.62268471717834 2016-02-19 11:31:20,495:Worker 3: Finished, took 99.9016981124878 2016-02-19 11:31:20,552:Completed sequential processing, took 318.2032618522644

In short, the directory locking definitely slows things down (from ~220 to ~300 seconds), but the locking mechanism itself seems to make little difference. I'll try running it on the cluster, see if it gives the same results.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-186309795.

Feb 19 '16 18:02 nouiz

So here are the cluster timings.

With Theano's locking:

2016-02-19 15:30:30,991:Worker 2: Finished, took 1314.3841021060944
2016-02-19 15:30:40,497:Worker 3: Finished, took 1323.8897619247437
2016-02-19 15:30:44,872:Worker 0: Finished, took 1328.2678787708282
2016-02-19 15:30:57,807:Worker 1: Finished, took 1341.2011613845825
2016-02-19 15:30:57,950:Completed parallel processing, took 1341.3760290145874

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455
---
2016-02-19 15:55:23,860:Worker 2: Finished, took 190.06892371177673
2016-02-19 15:55:23,860:Worker 1: Finished, took 190.07030272483826
2016-02-19 15:55:25,155:Worker 3: Finished, took 191.36408829689026
2016-02-19 15:55:25,287:Completed sequential processing, took 1467.3371329307556

With fcntl:

2016-02-19 16:28:01,387:Worker 1: Finished, took 1549.5820903778076
2016-02-19 16:28:01,668:Worker 2: Finished, took 1549.861941576004
2016-02-19 16:28:01,901:Worker 0: Finished, took 1550.0971040725708
2016-02-19 16:28:02,139:Worker 3: Finished, took 1550.3321452140808
2016-02-19 16:28:02,264:Completed parallel processing, took 1550.492042541504

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645
---
2016-02-19 16:53:59,291:Worker 3: Finished, took 167.07216930389404
2016-02-19 16:54:04,877:Worker 2: Finished, took 172.65851163864136
2016-02-19 16:54:05,945:Worker 1: Finished, took 173.72711896896362
2016-02-19 16:54:06,066:Completed sequential processing, took 1563.802050113678

Lastly, to see the impact of the networked file system, this is with using $RAMDISK for the cache:

2016-02-19 16:29:58,312:Worker 3: Finished, took 1412.8973224163055
2016-02-19 16:29:58,337:Worker 1: Finished, took 1412.9239094257355
2016-02-19 16:29:58,366:Worker 0: Finished, took 1412.9522206783295
2016-02-19 16:29:58,398:Worker 2: Finished, took 1412.9835093021393
2016-02-19 16:29:58,484:Completed parallel processing, took 1413.1036114692688

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078
---
2016-02-19 16:54:17,378:Worker 2: Finished, took 139.0113170146942
2016-02-19 16:54:22,000:Worker 1: Finished, took 143.63392329216003
2016-02-19 16:54:26,120:Worker 3: Finished, took 147.7526512145996
2016-02-19 16:54:26,201:Completed sequential processing, took 1467.7146391868591

It doesn't tell me much. There is still a slowdown when compiling in parallel (anywhere between 40-150 seconds), but neither the file system or choice of locking system seems to make much a difference.

Feb 19 '16 22:02 bartvm

The number are strange... The number in RAMDISK are slower then the first number with current Theano lock.

On which cluster did you do those timing? I think you should reserve a full node to prevent interaction with others jobs on the same node that would change the execution time.

Also, which software stack do you use on that cluster? If it is on Helios, then it don't support FS lock and the FS have problems in all cases. The stack v4 on Helios do some work around to help on this.

Bart, I did this branch in Theano take make the process keep the lock after it compiled one c code cache: https://github.com/nouiz/Theano/tree/lock. Can you use it and time on your local computer? It is too see if we can more the 300s back to 220s.

On Fri, Feb 19, 2016 at 5:39 PM, Bart van Merriënboer < [email protected]> wrote:

So here are the cluster timings.

With Theano's locking:

2016-02-19 15:30:30,991:Worker 2: Finished, took 1314.3841021060944 2016-02-19 15:30:40,497:Worker 3: Finished, took 1323.8897619247437 2016-02-19 15:30:44,872:Worker 0: Finished, took 1328.2678787708282 2016-02-19 15:30:57,807:Worker 1: Finished, took 1341.2011613845825 2016-02-19 15:30:57,950:Completed parallel processing, took 1341.3760290145874

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455

2016-02-19 15:55:23,860:Worker 2: Finished, took 190.06892371177673 2016-02-19 15:55:23,860:Worker 1: Finished, took 190.07030272483826 2016-02-19 15:55:25,155:Worker 3: Finished, took 191.36408829689026 2016-02-19 15:55:25,287:Completed sequential processing, took 1467.3371329307556

With fcntl:

2016-02-19 16:28:01,387:Worker 1: Finished, took 1549.5820903778076 2016-02-19 16:28:01,668:Worker 2: Finished, took 1549.861941576004 2016-02-19 16:28:01,901:Worker 0: Finished, took 1550.0971040725708 2016-02-19 16:28:02,139:Worker 3: Finished, took 1550.3321452140808 2016-02-19 16:28:02,264:Completed parallel processing, took 1550.492042541504

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645

2016-02-19 16:53:59,291:Worker 3: Finished, took 167.07216930389404 2016-02-19 16:54:04,877:Worker 2: Finished, took 172.65851163864136 2016-02-19 16:54:05,945:Worker 1: Finished, took 173.72711896896362 2016-02-19 16:54:06,066:Completed sequential processing, took 1563.802050113678

Lastly, to see the impact of the networked file system, this is with using $RAMDISK for the cache:

2016-02-19 16:29:58,312:Worker 3: Finished, took 1412.8973224163055 2016-02-19 16:29:58,337:Worker 1: Finished, took 1412.9239094257355 2016-02-19 16:29:58,366:Worker 0: Finished, took 1412.9522206783295 2016-02-19 16:29:58,398:Worker 2: Finished, took 1412.9835093021393 2016-02-19 16:29:58,484:Completed parallel processing, took 1413.1036114692688

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078

2016-02-19 16:54:17,378:Worker 2: Finished, took 139.0113170146942 2016-02-19 16:54:22,000:Worker 1: Finished, took 143.63392329216003 2016-02-19 16:54:26,120:Worker 3: Finished, took 147.7526512145996 2016-02-19 16:54:26,201:Completed sequential processing, took 1467.7146391868591

It doesn't tell me much. There is still a slowdown when compiling in parallel (anywhere between 40-150 seconds), but neither the file system or choice of locking system seems to make much a difference.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-186437488.

Feb 20 '16 00:02 nouiz

I just pushed an update. Now it will refresh the cache content only when it take the lock the first time.

On Fri, Feb 19, 2016 at 7:27 PM, Frédéric Bastien < [email protected]> wrote:

The number are strange... The number in RAMDISK are slower then the first number with current Theano lock.

On which cluster did you do those timing? I think you should reserve a full node to prevent interaction with others jobs on the same node that would change the execution time.

Also, which software stack do you use on that cluster? If it is on Helios, then it don't support FS lock and the FS have problems in all cases. The stack v4 on Helios do some work around to help on this.

Bart, I did this branch in Theano take make the process keep the lock after it compiled one c code cache: https://github.com/nouiz/Theano/tree/lock. Can you use it and time on your local computer? It is too see if we can more the 300s back to 220s.

On Fri, Feb 19, 2016 at 5:39 PM, Bart van Merriënboer < [email protected]> wrote:

So here are the cluster timings.

With Theano's locking:

2016-02-19 15:30:30,991:Worker 2: Finished, took 1314.3841021060944 2016-02-19 15:30:40,497:Worker 3: Finished, took 1323.8897619247437 2016-02-19 15:30:44,872:Worker 0: Finished, took 1328.2678787708282 2016-02-19 15:30:57,807:Worker 1: Finished, took 1341.2011613845825 2016-02-19 15:30:57,950:Completed parallel processing, took 1341.3760290145874

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455

2016-02-19 15:55:23,860:Worker 2: Finished, took 190.06892371177673 2016-02-19 15:55:23,860:Worker 1: Finished, took 190.07030272483826 2016-02-19 15:55:25,155:Worker 3: Finished, took 191.36408829689026 2016-02-19 15:55:25,287:Completed sequential processing, took 1467.3371329307556

With fcntl:

2016-02-19 16:28:01,387:Worker 1: Finished, took 1549.5820903778076 2016-02-19 16:28:01,668:Worker 2: Finished, took 1549.861941576004 2016-02-19 16:28:01,901:Worker 0: Finished, took 1550.0971040725708 2016-02-19 16:28:02,139:Worker 3: Finished, took 1550.3321452140808 2016-02-19 16:28:02,264:Completed parallel processing, took 1550.492042541504

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645

2016-02-19 16:53:59,291:Worker 3: Finished, took 167.07216930389404 2016-02-19 16:54:04,877:Worker 2: Finished, took 172.65851163864136 2016-02-19 16:54:05,945:Worker 1: Finished, took 173.72711896896362 2016-02-19 16:54:06,066:Completed sequential processing, took 1563.802050113678

Lastly, to see the impact of the networked file system, this is with using $RAMDISK for the cache:

2016-02-19 16:29:58,312:Worker 3: Finished, took 1412.8973224163055 2016-02-19 16:29:58,337:Worker 1: Finished, took 1412.9239094257355 2016-02-19 16:29:58,366:Worker 0: Finished, took 1412.9522206783295 2016-02-19 16:29:58,398:Worker 2: Finished, took 1412.9835093021393 2016-02-19 16:29:58,484:Completed parallel processing, took 1413.1036114692688

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078

2016-02-19 16:54:17,378:Worker 2: Finished, took 139.0113170146942 2016-02-19 16:54:22,000:Worker 1: Finished, took 143.63392329216003 2016-02-19 16:54:26,120:Worker 3: Finished, took 147.7526512145996 2016-02-19 16:54:26,201:Completed sequential processing, took 1467.7146391868591

It doesn't tell me much. There is still a slowdown when compiling in parallel (anywhere between 40-150 seconds), but neither the file system or choice of locking system seems to make much a difference.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-186437488.

Feb 20 '16 00:02 nouiz

@bartvm did you had the time to redo the timing with your local computer?

thanks

On Fri, Feb 19, 2016 at 7:37 PM, Frédéric Bastien < [email protected]> wrote:

I just pushed an update. Now it will refresh the cache content only when it take the lock the first time.

On Fri, Feb 19, 2016 at 7:27 PM, Frédéric Bastien < [email protected]> wrote:

The number are strange... The number in RAMDISK are slower then the first number with current Theano lock.

On which cluster did you do those timing? I think you should reserve a full node to prevent interaction with others jobs on the same node that would change the execution time.

Also, which software stack do you use on that cluster? If it is on Helios, then it don't support FS lock and the FS have problems in all cases. The stack v4 on Helios do some work around to help on this.

Bart, I did this branch in Theano take make the process keep the lock after it compiled one c code cache: https://github.com/nouiz/Theano/tree/lock. Can you use it and time on your local computer? It is too see if we can more the 300s back to 220s.

On Fri, Feb 19, 2016 at 5:39 PM, Bart van Merriënboer < [email protected]> wrote:

So here are the cluster timings.

With Theano's locking:

2016-02-19 15:30:30,991:Worker 2: Finished, took 1314.3841021060944 2016-02-19 15:30:40,497:Worker 3: Finished, took 1323.8897619247437 2016-02-19 15:30:44,872:Worker 0: Finished, took 1328.2678787708282 2016-02-19 15:30:57,807:Worker 1: Finished, took 1341.2011613845825 2016-02-19 15:30:57,950:Completed parallel processing, took 1341.3760290145874

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455

2016-02-19 15:55:23,860:Worker 2: Finished, took 190.06892371177673 2016-02-19 15:55:23,860:Worker 1: Finished, took 190.07030272483826 2016-02-19 15:55:25,155:Worker 3: Finished, took 191.36408829689026 2016-02-19 15:55:25,287:Completed sequential processing, took 1467.3371329307556

With fcntl:

2016-02-19 16:28:01,387:Worker 1: Finished, took 1549.5820903778076 2016-02-19 16:28:01,668:Worker 2: Finished, took 1549.861941576004 2016-02-19 16:28:01,901:Worker 0: Finished, took 1550.0971040725708 2016-02-19 16:28:02,139:Worker 3: Finished, took 1550.3321452140808 2016-02-19 16:28:02,264:Completed parallel processing, took 1550.492042541504

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645

2016-02-19 16:53:59,291:Worker 3: Finished, took 167.07216930389404 2016-02-19 16:54:04,877:Worker 2: Finished, took 172.65851163864136 2016-02-19 16:54:05,945:Worker 1: Finished, took 173.72711896896362 2016-02-19 16:54:06,066:Completed sequential processing, took 1563.802050113678

Lastly, to see the impact of the networked file system, this is with using $RAMDISK for the cache:

2016-02-19 16:29:58,312:Worker 3: Finished, took 1412.8973224163055 2016-02-19 16:29:58,337:Worker 1: Finished, took 1412.9239094257355 2016-02-19 16:29:58,366:Worker 0: Finished, took 1412.9522206783295 2016-02-19 16:29:58,398:Worker 2: Finished, took 1412.9835093021393 2016-02-19 16:29:58,484:Completed parallel processing, took 1413.1036114692688

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078

2016-02-19 16:54:17,378:Worker 2: Finished, took 139.0113170146942 2016-02-19 16:54:22,000:Worker 1: Finished, took 143.63392329216003 2016-02-19 16:54:26,120:Worker 3: Finished, took 147.7526512145996 2016-02-19 16:54:26,201:Completed sequential processing, took 1467.7146391868591

It doesn't tell me much. There is still a slowdown when compiling in parallel (anywhere between 40-150 seconds), but neither the file system or choice of locking system seems to make much a difference.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-186437488.

Feb 25 '16 14:02 nouiz

I meant to ask, which update are you referring to? I couldn't find any commits in Theano that seemed to affect the compilation lock, so I wasn't sure what you meant.

Feb 25 '16 17:02 bartvm

I didn't do a PR, it is in a branch in my fork on github:

https://github.com/nouiz/Theano/tree/lock

On Thu, Feb 25, 2016 at 12:57 PM, Bart van Merriënboer < [email protected]> wrote:

I meant to ask, which update are you referring to? I couldn't find any commits in Theano that seemed to affect the compilation lock, so I wasn't sure what you meant.

— Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-188904479.

Feb 25 '16 19:02 nouiz

platoon platoon copied to clipboard

Compilation locks

2016-02-19 11:05:00,069:Worker 0: Finished, took 225.24447321891785

2016-02-19 11:29:40,563:Worker 0: Finished, took 218.2124490737915

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078

2016-02-19 15:52:13,613:Worker 0: Finished, took 1275.6549570560455

2016-02-19 16:51:12,075:Worker 0: Finished, took 1389.8036732673645

2016-02-19 16:51:58,272:Worker 0: Finished, took 1319.7781176567078

platoon
platoon copied to clipboard