collins icon indicating copy to clipboard operation
collins copied to clipboard

spooky failure to start with "neither features.syslogAsset or multicollins.thisInstance were specified"

Open cburroughs opened this issue 8 years ago • 16 comments

This is on 1.3.0. I had a pair (per DC) of collins instances running. I don't think they had been restarted in O(months). Today I tried to migrate those instances to new servers (copied conf, dumped db etc) and collins failed to start with [1]. The spooky outage inducing part came when I tried to restart the old servers and they failed with the the same error despite no changes to the config. I checked many times over that multicollins was indeed in the config (and that I got it to work later leads me to believe that as far as multicollins itself is concerned everything is/was fine).

The eventual workaround was to set:

features {
   syslogAsset = "NameOfDC"
}

in production.conf.

Related discussion in #117 Again as far as I can tell neither the code nor the config changed since the last restart.

I looked into the current state of master: https://github.com/tumblr/collins/blob/master/app/collins/util/Tattler.scala#L83

lazy val syslogAsset = Asset.findByTag(Feature.syslogAsset.getOrElse("tumblrtag1")).getOrElse {
    throw new PlayException("", "neither features.syslogAsset or multicollins.thisInstance were specified")
  }

and I'm afraid I don't understand what is going on their either. The only reference to multicollins seems to be in the exception.

[1]

PlayException: Confguration error [neither features.syslogAsset or multicollins.thisInstance were specified]
        at util.config.ConfigAccessor$class.globalError(ConfigAccessor.scala:20)
        at util.config.Feature$.globalError(Feature.scala:10)
        at util.config.Feature$$anonfun$syslogAsset$2.apply(Feature.scala:35)
        at util.config.Feature$$anonfun$syslogAsset$2.apply(Feature.scala:35)
        at scala.Option.getOrElse(Option.scala:108)
        at util.config.Feature$.syslogAsset(Feature.scala:34)
        at util.config.Feature$.validateConfig(Feature.scala:61)
        at util.config.Configurable$class.mergeReferenceAndSave(Configurable.scala:106)
        at util.config.Feature$.mergeReferenceAndSave(Feature.scala:10)
        at util.config.Configurable$class.initialize(Configurable.scala:43)
        at util.config.Feature$.initialize(Feature.scala:10)
        at util.config.Registry$$anonfun$validate$1.apply(Registry.scala:52)
        at util.config.Registry$$anonfun$validate$1.apply(Registry.scala:52)
        at scala.collection.Iterator$class.foreach(Iterator.scala:660)
        at scala.collection.JavaConversions$JIteratorWrapper.foreach(JavaConversions.scala:573)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:73)
        at scala.collection.JavaConversions$JCollectionWrapper.foreach(JavaConversions.scala:592)
        at util.config.Registry$.validate(Registry.scala:52)
        at collins.config.ConfigPlugin.onStart(ConfigPlugin.scala:14)
        at play.api.Play$$anonfun$start$1$$anonfun$apply$mcV$sp$1.apply(Play.scala:84)
        at play.api.Play$$anonfun$start$1$$anonfun$apply$mcV$sp$1.apply(Play.scala:84)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
        at scala.collection.immutable.List.foreach(List.scala:45)
        at play.api.Play$$anonfun$start$1.apply$mcV$sp(Play.scala:84)
        at play.api.Play$$anonfun$start$1.apply(Play.scala:84)
        at play.api.Play$$anonfun$start$1.apply(Play.scala:84)
        at play.utils.Threads$.withContextClassLoader(Threads.scala:17)
        at play.api.Play$.start(Play.scala:83)
        at play.core.StaticApplication.<init>(ApplicationProvider.scala:51)
        at play.core.server.NettyServer$.createServer(NettyServer.scala:136)
        at play.core.server.NettyServer$$anonfun$main$5.apply(NettyServer.scala:165)
        at play.core.server.NettyServer$$anonfun$main$5.apply(NettyServer.scala:164)
        at scala.Option.map(Option.scala:133)
        at play.core.server.NettyServer$.main(NettyServer.scala:164)
        at play.core.server.NettyServer.main(NettyServer.scala)

cburroughs avatar Jan 20 '16 18:01 cburroughs

@cburroughs I have definitely seen this behavior before, but I cant remember exactly what triggered it. Generally I set multicollins on and use that feature to identify each instance instead of syslog asset. Seems to avoid the issue?

multicollins {
  enabled=true
  thisInstance = "IATA01"
}

byxorna avatar Jan 21 '16 14:01 byxorna

My multicolins config is

multicollins {
  enabled = true
  instanceAssetType = DATA_CENTER
  locationAttribute = LOCATION
  thisInstance = IAD
}

which is what worked until yesterday.

cburroughs avatar Jan 21 '16 14:01 cburroughs

There were definitely initialization order issues in the 1.3.0 release. I can't say for certain that they have all been fixed in master, but I've addressed a few of those during the time. Unfortunately this isn't easy to reproduce and therefore fix.

yl3w avatar Jan 22 '16 13:01 yl3w

@cburroughs - is the IAD DATA_CENTER asset still in your instance of collins? I think I've seen this error when the asset did not exist before. I would image it does exist, since you said you were running with that config until recently, just want to double check.

william-richard avatar Jan 25 '16 16:01 william-richard

Also, this may help you debug, but this is where syslogAsset gets populated to the value specified in the multicollins config: https://github.com/tumblr/collins/blob/v1.3.0/app/util/config/Feature.scala#L32-L36

william-richard avatar Jan 25 '16 16:01 william-richard

Yep, there is still an asset of type 'Data Center' with the tag IAD.

cburroughs avatar Jan 25 '16 17:01 cburroughs

@cburroughs were you able to figure this out? Maybe switching on and off multicollins might help fix the problem?

william-richard avatar Feb 17 '16 15:02 william-richard

No I have not figured this out and I have not debugged further since features.syslogAsset workaround put out the fire.

cburroughs avatar Feb 17 '16 16:02 cburroughs

So I got this error:

Play server process ID is 7203
[info] play - database [collins] connected at jdbc:h2:mem:play
[error] application - Failed to create assetlog
play.api.PlayException: [neither features.syslogAsset or multicollins.thisInstance were specified]

After building collins with activator dist, extracting the zip file in a new directory and running:

java -server -Dhttp.port=9023 -Dconfig.file=$(pwd)/conf/application.conf -DapplyEvolutions.collins=true -cp `pwd`/lib/\* play.core.server.NettyServer /tmp

Turns out, this is caused missing test/resources/profiles.yaml. Maybe this issue at hand is similar? Either way, this behavior is problematic.. Such runtime dependencies should be included in the dist zip file, it shouldn't be in test/ and ultimately the error is super misleading. I'm happy to fill some issues but not sure where to start..

discordianfish avatar Aug 10 '16 16:08 discordianfish

Oh WTF. Now it's again not working on another system.. So well, dunno. Will update if I find out more.

discordianfish avatar Aug 10 '16 16:08 discordianfish

Okay, this was related to a different jvm version. But if this also triggers this error, there is something seriously wrong with the error handling.

discordianfish avatar Aug 10 '16 16:08 discordianfish

I think we have seen this before, where a syntax error in the production.conf leads to very confusing error reporting due to a spaghettimess of try/catches around parsing configs. @discordianfish

I would check your production.conf very carefully and see if you can find the syntax error.

byxorna avatar Aug 10 '16 16:08 byxorna

@byxorna I used the config from the repo without any changes. My core issue was that file missing from my working directory.

discordianfish avatar Aug 10 '16 16:08 discordianfish

Oh what a PITA. Is it possible that right now it always returns the error not matter what is wrong during initialization? I got that error because of that missing file, because of something triggered by using java 1.8 instead of 1.7 and now because of something related to the database after changing h2:mem:play to h2:/var/lib/collins/database.h2..

Is there a more stable version? Can't get 1.3.0 running either and that's 2 years old already..

discordianfish avatar Aug 10 '16 17:08 discordianfish

..and for solr misconfiguration as well as permission problem.

discordianfish avatar Aug 10 '16 17:08 discordianfish

Okay, finally got this up and running but I'm wondering: Are there plans to fix the error handling? I would look into it but scala is far outside my comfort zone. I'm a bit worried of introducing it here with this issue pending and I don't know how active collins is still developed.

discordianfish avatar Aug 10 '16 18:08 discordianfish