frameless icon indicating copy to clipboard operation
frameless copied to clipboard

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release)

Open chris-twiner opened this issue 2 years ago • 15 comments

per #787 and #300 - The key files and code that has changed since the 2.4->3 migration is abstracted out and moved to arity based shims and helper functions. (this includes Spark 4 snapshot changes - see below). If the approach and scope of Shim usage within Frameless is ok, I'll push out an actual 0.0.1 cut (excluding Spark4), in the meantime the snaps are on central.

Frameless' use of the internal apis for extending or mixing in is thusfar (aside from the lit pushdown issue) isolated to the shim interfaces for encoding usage (and creating analysisexception). As such a version compiled from the shimmed 0.16 against 3.5.0 allows encoding to be used on 14.3 LTS (3.5 with 4.0 StaticInvoke) down to 9.1 LTS (3.1.3) without issue.

of note - I've run Quality tests against Spark 4 SNAPSHOT proper (with the Frameless3.5 build) and in a local build of Spark 4 Frameless all tests pass - although cats tests sometimes freeze when run directly (on "inner pairwise monoid") (I've also run the Quality tests against Frameless built against 4).

Lit and UDF on base Expression are (as of 29th Feb) are stable api wise against Spark 4 so there is no need for shims there.

NB: I've only pushed a Spark 4 shim_runtime against 2.13 into snapshots, I won't push any full versions until RC's drop

If the approach / change scope is ok I'll push release versions of shim out and remove the resolvers:

// needed for shim_runtim snapshots
resolvers in Global += MavenRepository(
  "sonatype-s01-snapshots",
  Resolver.SonatypeS01RepositoryRoot + "/snapshots"
)
// needed for 4.0 snapshots
resolvers in Global += MavenRepository(
  "apache_snaps",
  "https://repository.apache.org/content/repositories/snapshots"
)

per #787 - let me know if I should shim something else (I don't see Dataset/SparkSession etc. as being useful, but I'd be happy to move the scala reflection stuff over to shim, not that it's currently needed)

NB (Spark 4 changes:

build sbt needs the correct shim_runtime_4.0.0.oss_4.0 dependency and 2.13 main scala version (as well as jdk 17/21) . Comments have the versions.

Source (changes compatible with 0.16 builds):

  1. Swapped FramelessInternals to use a shim to create an AnalysisException. (different args on Spark 4)
  2. TypedColumn gets actual imports from org.apache.spark.sql.catalyst.expressions (new With Expression in Spark 4)
  3. Pushdown tests needs to use the previous currentTimestamp code (Spark 4 removed it, could shim this if preferred)
  4. SchemaTests.structToNonNullable resets the metadata, Spark 4 sets meta data so the properties don't hold and you'll get:
Expected StructType(StructField(_1,LongType,false),StructField(_2,LongType,false)) but got StructType(StructField(_1,LongType,false),StructField(_2,LongType,false))

In order to run tests on jdk 17/21 you'll need this adding to the vm args:

--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
--add-opens=java.base/sun.security.action=ALL-UNNAMED
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED

)

chris-twiner avatar Mar 01 '24 12:03 chris-twiner

Codecov Report

Attention: Patch coverage is 96.81529% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 95.60%. Comparing base (0fb9c58) to head (986891a). Report is 1 commits behind head on master.

:exclamation: Current head 986891a differs from pull request most recent head 25cc5c3. Consider uploading reports for the commit 25cc5c3 to get more accurate results

Files Patch % Lines
...taset/src/main/scala/frameless/RecordEncoder.scala 92.85% 5 Missing :warning:
...taset/src/main/scala/frameless/functions/Udf.scala 93.10% 4 Missing :warning:
.../src/main/scala/frameless/FramelessInternals.scala 96.42% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
+ Coverage   95.46%   95.60%   +0.13%     
==========================================
  Files          67       65       -2     
  Lines        1257     1341      +84     
  Branches       42       52      +10     
==========================================
+ Hits         1200     1282      +82     
- Misses         57       59       +2     
Flag Coverage Δ
2.12-root-spark33 95.30% <95.54%> (-0.16%) :arrow_down:
2.12-root-spark34 ?
2.12-root-spark35 95.52% <96.49%> (+0.06%) :arrow_up:
2.13-root-spark35 96.08% <97.10%> (+0.02%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 01 '24 13:03 codecov[bot]

Is there an issue with the existing build? Otherwise I'm not sure we want to introduce such upstream dependency.

cchantep avatar Mar 02 '24 22:03 cchantep

Is there an issue with the existing build? Otherwise I'm not sure we want to introduce such upstream dependency.

As described in #787 the 14.2 and 14.3 LTS Databricks Runtime cannot use Frameless 0.16 due to backporting Spark 4 changes. The core of the encoding derivation logic, at least, is essentially identical since 2.4 days / 2020 when I started making commits to support newer versions/runtimes, what changes is internal spark api usage.

This PR is aims for a hot swappable jar based solution to such changes and would reduce dependency on the core of frameless (including committers) to support such runtime differences i.e. focus on oss major releases only * and have the runtime compatibility issues pushed to another library.

* This is not strictly required for encoding either, per above a 0.16+shim Frameless base can encode on all versions from 3.1.3 through to 4.0 nightlies/14.3 DBR by swapping the runtime

chris-twiner avatar Mar 03 '24 10:03 chris-twiner

Yes, feels like we def need some kind of shims to simplify life of the DB folks who use frameless.

pomadchin avatar Mar 03 '24 13:03 pomadchin

With the last push I think I've covered all remaining non-essential expression package* and case class usage, I've also targeted the 5 deprecated sql.functions. I've moved sql FramelessInternals to frameless and shimmed the private interface functions, including removing the ml. FramelessInternals - no sql packages are left outside of the hadoop testing for streaming (given that's not been accepted yet I may fork) - that could also close #300 given the remaining parts are kind of important (udf, join disambiguation etc.).

I've also copied the rest of the reflection code needed from spark so any future changes (assuming expressionencoder stays) should be covered.

The publishing of test artefacts allows for creation of a shaded test package for each databricks runtime - I've started testless for that. (I've made clear in the README that although tests will show what should work it doesn't imply a supported combination). Ideally a new shim_runtime and testless push can be made for a new databricks runtime and run the current appropriate frameless tests without any need to involve frameless.

* Literal, NonSQLExpression, NamedExpression, Attribute, AttributeReference, UnsafeArrayData, GenericInternalRow, codegen, LeafExpression and Expression itself are still present in the main source and Cast and GenericRowWithSchema in the pushdown tests. I'll be verifying if any of them have issues against the test jars

chris-twiner avatar Mar 09 '24 11:03 chris-twiner

The latest version 1008b85 has allowed running all frameless test suites using the 3.5.1 build on Databricks 14.3 (only CreateTests "dataset with different column order" fails, unsure as to why yet).

The same frameless built against 3.5.1 running against 9.1 LTS (3.1.2 base) shims only fails udftests (incompatible expression) and on the pushdown tests (classes/interfaces don't exist until 3.2). That's a pretty decent result imo.

Although in both cases after running the tests the community databricks driver restarts, which is annoying and leaves no trace (forcing you to catch the driver logs before the server dies).

chris-twiner avatar Mar 11 '24 19:03 chris-twiner

re - dd10cee: The dataset test failed due to a Databricks specific analysis check (VariantExpressionsCheck) added in 14.3 which evaluates types "early" (Quality was also hit by this with it's struct functions), the test itself relied on the action being the only time the expressions types are evaluated not at the "as" call. This isn't always the case for normal OSS encoders, which (if using tuple2 for example) will fail on UpCast checks at the "as" call.

As such the test failed before getting to create(Dataset), hence createUnsafe logic is never called to correct the order. With that test correction 14.3 passes all tests. I'm going through a list of verified version combinations (against this shim based branch) here: https://github.com/sparkutils/testless?tab=readme-ov-file#tested-combos

chris-twiner avatar Mar 14 '24 11:03 chris-twiner

b1610674 - RC4 shim is needed for mapgroups on 12.2 and #803 / SPARK-41991 is the reason for udf failures, fixed here.

Using this build and RC4 shims, all LTS versions now work, which is nice (predicate test is the only one that doesn't as functionally + interface wise it can't on earlier pre 3.3 spark versions).

chris-twiner avatar Mar 14 '24 20:03 chris-twiner

per https://github.com/typelevel/frameless/commit/b8802616f38eeeeba79c44bad994b0278300ba47, proper fix for https://github.com/typelevel/frameless/issues/803 and https://github.com/typelevel/frameless/issues/804 are confirmed as working on all LTS versions of Databricks, Spark 4 and the latest 15.0 runtime - test combinations are documented here

chris-twiner avatar Mar 27 '24 11:03 chris-twiner

NB 25cc5c3 test cases are able to run on a full cluster 14.3 and 15.0, most fails are due to randomness rather than actual test code issues now.

chris-twiner avatar Apr 12 '24 19:04 chris-twiner

omg that's huge; I'll try to get back to you soonish!

pomadchin avatar Jun 18 '24 11:06 pomadchin

RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs

RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt.

pomadchin avatar Jul 13 '24 17:07 pomadchin

RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs

This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...

RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt.

yeah, there should have been a reformat all files PR immediately after adding it in, ah well, hindsight etc.

chris-twiner avatar Jul 15 '24 14:07 chris-twiner

@chris-twiner

This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...

So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them.

Maintaining cross major version releases via a shims lib could be a bit too much.

BTW the same concerns are around the DB runtime:

  • how many versions are we gonna support?
  • who supports them?
  • can users support their own DB version runtime that is not yet published?

MB we could have some abstraction that users could override themselves so we shift this responsibility to the user?

pomadchin avatar Jul 15 '24 14:07 pomadchin

So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them.

Maintaining cross major version releases via a shims lib could be a bit too much.

BTW the same concerns are around the DB runtime:

  • how many versions are we gonna support?
  • who supports them?
  • can users support their own DB version runtime that is not yet published?

MB we could have some abstraction that users could override themselves so we shift this responsibility to the user?

I'll answer the last first - if it's typeclass based they can throw their own in - each location get's it (not sure this works for all off the top of my head but I could try it out).

Version wise - frameless should only support it's usage interface, if someone -e.g. me- requires a funny version of frameless to work with an obscurity of databricks then let them provide the shim PR (or per above custom typeclass) - externalise the problem. The only things I'm aware of which aren't easy to take this approach with are changes which stop udf's or lit's working (the final gencode springs to mind as does the foldable pushdown issue) and of course the whole agnostic encoder mess.

If you think the idea of using typeclasses to isolate this sounds reasonable I'm happy to give it a go.

chris-twiner avatar Jul 15 '24 14:07 chris-twiner