SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

chore: Update isolation-forest to 3.0.1

Open nightscape opened this issue 2 years ago • 9 comments

In preparation of the Scala 2.13 update.

Related Issues/PRs

The Scala 2.13 build requires the new version of isolation-forest: https://github.com/microsoft/SynapseML/pull/1772 I've separated out the update into a separate PR, so that CI can verify that the dependency update itself does not break anything. There's a change to the API contained in the update which doesn't break anything in my local build, but the CI covers more.

What changes are proposed in this pull request?

To update the isolation-forest dependency to the most recent version, which has a Scala 2.13 build.

How is this patch tested?

  • [x] I have run sbt test:compile locally to verify that the API change doesn't break Scala code.

Does this PR change any dependencies?

  • [x] Yes. Make sure the dependencies are resolved correctly, and list changes here. "com.linkedin.isolation-forest" %% "isolation-forest_3.2.0" % "2.0.8" => "com.linkedin.isolation-forest" %% "isolation-forest_3.2.0" % "3.0.1"

Does this PR add a new feature? If so, have you added samples on website?

  • [x] No. You can skip this section.

nightscape avatar Dec 21 '22 22:12 nightscape

Hey @nightscape :wave:! Thank you so much for contributing to our repository :raised_hands:. Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source. Check out the developer guide for additional guidance on testing your change.

github-actions[bot] avatar Dec 21 '22 22:12 github-actions[bot]

/azp run

svotaw avatar Dec 22 '22 07:12 svotaw

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Dec 22 '22 07:12 azure-pipelines[bot]

Codecov Report

Merging #1776 (2fb1a46) into master (8dc4a58) will decrease coverage by 0.03%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1776      +/-   ##
==========================================
- Coverage   86.02%   85.98%   -0.04%     
==========================================
  Files         278      278              
  Lines       14722    14722              
  Branches      767      767              
==========================================
- Hits        12664    12659       -5     
- Misses       2058     2063       +5     
Impacted Files Coverage Δ
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala 67.64% <0.00%> (-7.36%) :arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Dec 22 '22 07:12 codecov-commenter

@nightscape the multivariate anomaly detection notebook (which uses Isolation Forest) notebook failed on databricks with the following error:

Py4JJavaError: An error occurred while calling o1228.load.
: org.json4s.MappingException: Did not find value which can be converted into int
	at org.json4s.reflect.package$.fail(package.scala:53)
	at org.json4s.Extraction$.$anonfun$convert$2(Extraction.scala:881)
	at scala.Option.getOrElse(Option.scala:189)
	at org.json4s.Extraction$.convert(Extraction.scala:881)
	at org.json4s.Extraction$.$anonfun$extract$10(Extraction.scala:456)
	at org.json4s.Extraction$.$anonfun$customOrElse$1(Extraction.scala:780)
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
	at scala.PartialFunction$$anon$1.applyOrElse(PartialFunction.scala:257)
	at org.json4s.Extraction$.customOrElse(Extraction.scala:780)
	at org.json4s.Extraction$.extract(Extraction.scala:454)
	at org.json4s.Extraction$.extract(Extraction.scala:56)
	at org.json4s.ExtractableJsonAstNode.extract(ExtractableJsonAstNode.scala:22)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelReader.load(IsolationForestModelReadWrite.scala:52)
	at com.linkedin.relevance.isolationforest.IsolationForestModelReadWrite$IsolationForestModelReader.load(IsolationForestModelReadWrite.scala:38)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
	at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
	at org.apache.spark.ml.Pipeline$PipelineReader.$anonfun$load$2(Pipeline.scala:215)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
	at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
	at org.apache.spark.ml.Pipeline$PipelineReader.$anonfun$load$1(Pipeline.scala:214)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
	at org.apache.spark.ml.Pipeline$PipelineReader.load(Pipeline.scala:214)
	at org.apache.spark.ml.Pipeline$PipelineReader.load(Pipeline.scala:209)
	at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355)
	at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355)
	at org.apache.spark.ml.Pipeline$.load(Pipeline.scala:197)
	at org.apache.spark.ml.PipelineSerializer.read(Serializer.scala:129)
	at org.apache.spark.ml.PipelineSerializer.read(Serializer.scala:122)
	at com.microsoft.azure.synapse.ml.core.serialize.ComplexParam.load(ComplexParam.scala:24)
	at org.apache.spark.ml.ComplexParamsReader$.$anonfun$getAndSetComplexParams$2(ComplexParamsSerializer.scala:176)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at org.apache.spark.ml.ComplexParamsReader$.getAndSetComplexParams(ComplexParamsSerializer.scala:172)
	at org.apache.spark.ml.ComplexParamsReader.load(ComplexParamsSerializer.scala:155)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
	at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
	at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:161)
	at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:156)
	at org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:43)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:284)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:284)
	at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

This is likely due to an json4s mismatch between new isolation forest and databricks

mhamilton723 avatar Dec 22 '22 13:12 mhamilton723

@nightscape do you have a databricks account you can use to repro or need help here?

mhamilton723 avatar Dec 22 '22 13:12 mhamilton723

It seems that my personal account is not able to view the logs. I'm getting the following error when I try to open one of the links provided in the CI error list: AADSTS50020: User account '[email protected]' from identity provider 'live.com' does not exist in tenant 'Microsoft' and cannot access the application '2ff814a6-3304-4ab8-85cb-cd0e6f879c1d'(AzureDatabricks) in that tenant. The account needs to be added as an external user in the tenant first. Sign out and sign in again with a different Azure Active Directory user account.

nightscape avatar Dec 22 '22 13:12 nightscape

The error seems to be that the loaded JSON does not contain the field newly introduced in isolation-forest. I haven't checked the notebook yet, does it include JSON created by the old version of isolation-forest?

nightscape avatar Dec 22 '22 13:12 nightscape

@nightscape not sure where you are truying to log into. The build system is scoped to MSFT users so that might be what you are encountering. However if you have an ADB account its easy to load in the notebook and run it with the updated version. If you dont, we can figure out how to get you a repro. I'm not sure its loading an old model though.

mhamilton723 avatar Dec 22 '22 20:12 mhamilton723