gaia icon indicating copy to clipboard operation
gaia copied to clipboard

Do a trace and pprof run on Gaia

Open Skarlso opened this issue 5 years ago • 19 comments

Skarlso avatar Aug 25 '19 20:08 Skarlso

Actually, I know there are more immediate issues out there which need attention, but man I ran into a race condition with the tests again today. So I'm a bit afraid that we should optimise the routines a bit.

Skarlso avatar Aug 25 '19 20:08 Skarlso

Yeah, sounds like a good idea. Especially the scheduler is critical and could potentially have some leaking routines.

michelvocks avatar Sep 03 '19 07:09 michelvocks

Aye. I agree. I'll get into it, once the worker rebuild is finished. I'm really eager to trace things. :)

Skarlso avatar Sep 03 '19 07:09 Skarlso

Alright, found my first data race already. Going to work to fix this one first as it's the easiest to find and reproduce.

==================
WARNING: DATA RACE
Read at 0x00c0003a8768 by goroutine 97:
  bufio.(*Writer).Flush()
      /usr/local/Cellar/go/1.13/libexec/src/bufio/bufio.go:590 +0x79
  github.com/gaia-pipeline/gaia/plugin.(*GoPlugin).FlushLogs()
      /Users/hannibal/goprojects/gaia/gaia/plugin/plugin.go:322 +0x50
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduler.func1()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:556 +0x178

Previous write at 0x00c0003a8768 by goroutine 26:
  bufio.(*Writer).Write()
      /usr/local/Cellar/go/1.13/libexec/src/bufio/bufio.go:638 +0x415
  github.com/hashicorp/go-plugin.(*Client).logStderr()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/[email protected]/client.go:759 +0x321

Goroutine 97 (running) created at:
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduler()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:551 +0x393
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduledJobs()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:504 +0xd7
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).prepareAndExec()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:222 +0x11c8
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).work()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:147 +0x1b0

Goroutine 26 (running) created at:
  github.com/hashicorp/go-plugin.(*Client).Start()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/[email protected]/client.go:554 +0x1723
  github.com/hashicorp/go-plugin.(*Client).Client()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/[email protected]/client.go:296 +0x63
  github.com/gaia-pipeline/gaia/plugin.(*GoPlugin).Init()
      /Users/hannibal/goprojects/gaia/gaia/plugin/plugin.go:169 +0xbac
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).prepareAndExec()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:206 +0xbf2
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).work()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:147 +0x1b0
==================

Skarlso avatar Sep 07 '19 13:09 Skarlso

@michelvocks At the top of your head, do you know the reason for implementing Plugin with pointers? I saw no reason in particular. Plugins aren't shared as far as I can see.

The Flush operation is causing a race condition through the scheduler. Apparently more than one routine at the same time can't flush. This is the most obvious race in Gaia.

Using values would fix it since plugins wouldn't be shared through out gaia. I don't know yet what that would mean in terms of memory usage though. I haven't tested it yet. I plan on doing so. :)

Skarlso avatar Sep 08 '19 08:09 Skarlso

@Skarlso Yes. One plugin instance is always one plugin run. We need to have an instance since we need to run multiple jobs synchronously via this instance. If we don't have a plugin instance, we would have to start the plugin for every pipeline job which is very insufficient.

Additionally, a user potentially wants to share data/information between jobs within a pipeline run. If we create a new plugin instance for every job, you couldn't share in-memory data between jobs and you are literally forced to write information to disk or to a remote cache/database which is more work.

michelvocks avatar Sep 09 '19 09:09 michelvocks

That could be achieved by giving back a modified version of the struct, instead of passing it around as pointers, not?

Skarlso avatar Sep 09 '19 10:09 Skarlso

I don't know how but maybe we are talking past each other. Could you elaborate a bit more on the problem/solution?

In my opinion, we could easily add a mutex to the Plugin struct and add a lock to the FlushLogs method?

michelvocks avatar Sep 09 '19 14:09 michelvocks

Adding a mutex is a patch. :) That's like using a buffered channel instead of a channel because that appears to be solving a problem.

I'm trying to eliminate rather the possibility of race, instead of putting a mutex on it.

Of course a mutex would be easy. But that's like a leaking boat, and putting a patch on the leak, instead of trying to fix the boat, I think.

On 2019. Sep 9., at 16:11, Michel Vocks [email protected] wrote:

I don't know how but maybe we are talking past each other. Could you elaborate a bit more on the problem/solution?

In my opinion, we could easily add a mutex to the Plugin struct and add a lock to the FlushLogs method?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gaia-pipeline/gaia/issues/205?email_source=notifications&email_token=AABMUQSE3WYESUENXIHMYETQIZKRZA5CNFSM4IPKCT42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6HW7ZY#issuecomment-529493991, or mute the thread https://github.com/notifications/unsubscribe-auth/AABMUQUTQ645QQJFPWXAPZTQIZKRZANCNFSM4IPKCT4Q.

Skarlso avatar Sep 09 '19 14:09 Skarlso

That said, it can occur that a mutex IS the solution. But I would like to make sure, I'm not just patching a sinking boat. :)

Skarlso avatar Sep 09 '19 15:09 Skarlso

I'm gonna show you in a draft pr what I meant and like'd your opinion on it. :)

Skarlso avatar Sep 09 '19 19:09 Skarlso

Ah, yes. It's way too much work and ends up being ugly. Mutex it is I guess. :)

Skarlso avatar Sep 09 '19 20:09 Skarlso

Hah no, I succeeded, but get this. Because of other things, like Job and such, I still got a data race at flush! But this is a good thing. Because now I verified that indeed a mutex is a good approach at this point. :) Awesome.

Skarlso avatar Sep 09 '19 21:09 Skarlso

Get this! :) Just so I could ease my mind, I got further and re-wrote a large chunk of passing around the plugin to values instead of pointers. And when there is a mutation, I pass the mutated value back instead of the pointer semantics. I got rid of the race. :) However, I don't like the resulting code that much. It's not intuitive enough and sometime it passes it back sometimes it doesn't. So I'm not going to go with that. Not to mention that I don't have data about the memory allocation increase because of the copy. Aw well. 🤷‍♂

Skarlso avatar Sep 10 '19 04:09 Skarlso

Huh, so guess what... The mutex didn't solve the data race.

Skarlso avatar Sep 10 '19 08:09 Skarlso

Oh I know why.

Skarlso avatar Sep 10 '19 08:09 Skarlso

Well this is problematic. Gaia is racing with go-plugin's logStderr. c.config.Stderr.Write([]byte(line)) since we have stderr replaced with our buffer. We can't just lock the flush. The flush is racing with go-plugin. Eh.

Skarlso avatar Sep 10 '19 08:09 Skarlso

So the flush happens so that the logs are displayed in the log window, if I see this correctly. But the plugin is also trying to write / read to the buffer we are providing it. This is problematic, because even if we lock on our end the plugin will not lock of course, because it's using a simple Write. Which means a data race on the buffer.

Skarlso avatar Sep 10 '19 08:09 Skarlso

Okay, so the solution must be that we implement our own writer using an underlying buffer and io writer. On which the Write is basically a bufio.Write but with a lock in it. And we then pass that to go-plugin.

Skarlso avatar Sep 10 '19 08:09 Skarlso