Bulk Export Issue - Spark incompatible
Hi, I have LangSmith plus and trying to export data to AWS redshift. Because many of the fields ex. feedback or metadata contain nested jsons, in order to analyse them in Redshift (Relational DB) I need to preprocess them and extract the relevant fields.
My current AWS setup is:
https://linear.app/langchain/issue/LS-5021/bulk-export-issue-spark-incompatibleS3 (parquet) -> AWS Glue (Spark) -> Redshift
The issue is that already at the injestion stage I face an issue due to the incompatibility between format that LangSmith uses vs AWS Glue
Data preview failure. Py4JJavaError - An error occurred while calling o263.getSampleDynamicFrame. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (172.35.150.16 executor 1): org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1830) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:199) at
ChatGPT results with this:
You're hitting a Parquet logical type your Glue/Spark preview can’t parse:
Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
Spark (including Glue 4/5) understands Parquet timestamps in millis or micros, not nanos. Your files were written with nanosecond precision, so Glue’s data preview (which uses Spark under the hood) aborts while inferring the schema.
What this means
The parquet footer says some column(s) are TIMESTAMP(NANOS, …).
Spark can’t convert that to a Spark SQL type → schema inference fails → preview fails.
Any Spark read of those objects will fail until you rewrite (or avoid) the nano timestamps.
Can you please assist with this issue? I believe most people who wish to extract the data will want to perform some ETL on it, and spark is widely used tool, so having compatibility on that front is expected.
@dast-draftwise Please use the LangChain Forum for LangSmith platform issues (vs. client SDKs). You’ll get faster support from our customer support and engineering teams there. I’ve forwarded your bulk-export issue to the responsible team.
@angus-langchain thank you for your comment, the Forum is super inactive. I have other questions I asked on the forum about langsmith and havent recieved any support in over 2 weeks,
Hi @dast-draftwise, thanks for your detailed report here. We've added this to the list of bulk export improvements we plan to make soon. I can't promise an exact timeline, but will follow up once released.
Oki - thank you. If it would be possible to specify columns to be dropped while performing the bulk export that would be amazing - not sure how difficult would it be to implement but having both the timestamp fixed + ability to drop columns implemented, would allow for much easier trace processing outside of langsmith
To confirm, you mean the ability to select which columns are exported? If yes, that is also on our roadmap
To confirm, you mean the ability to select which columns are exported? If yes, that is also on our roadmap
Yes, exactly -
@bvs-langchain hi, Any hints on when this could be available?