langsmith-sdk icon indicating copy to clipboard operation
langsmith-sdk copied to clipboard

Bulk Export Issue - Spark incompatible

Open dast-draftwise opened this issue 2 months ago • 7 comments

Hi, I have LangSmith plus and trying to export data to AWS redshift. Because many of the fields ex. feedback or metadata contain nested jsons, in order to analyse them in Redshift (Relational DB) I need to preprocess them and extract the relevant fields.

My current AWS setup is:

https://linear.app/langchain/issue/LS-5021/bulk-export-issue-spark-incompatibleS3 (parquet) -> AWS Glue (Spark) -> Redshift

The issue is that already at the injestion stage I face an issue due to the incompatibility between format that LangSmith uses vs AWS Glue

Data preview failure. Py4JJavaError - An error occurred while calling o263.getSampleDynamicFrame. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (172.35.150.16 executor 1): org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1830) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:199) at

ChatGPT results with this:

You're hitting a Parquet logical type your Glue/Spark preview can’t parse:

Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))

Spark (including Glue 4/5) understands Parquet timestamps in millis or micros, not nanos. Your files were written with nanosecond precision, so Glue’s data preview (which uses Spark under the hood) aborts while inferring the schema.

What this means

The parquet footer says some column(s) are TIMESTAMP(NANOS, …).

Spark can’t convert that to a Spark SQL type → schema inference fails → preview fails.

Any Spark read of those objects will fail until you rewrite (or avoid) the nano timestamps.

Can you please assist with this issue? I believe most people who wish to extract the data will want to perform some ETL on it, and spark is widely used tool, so having compatibility on that front is expected.

dast-draftwise avatar Sep 26 '25 12:09 dast-draftwise

@dast-draftwise Please use the LangChain Forum for LangSmith platform issues (vs. client SDKs). You’ll get faster support from our customer support and engineering teams there. I’ve forwarded your bulk-export issue to the responsible team.

angus-langchain avatar Sep 26 '25 17:09 angus-langchain

@angus-langchain thank you for your comment, the Forum is super inactive. I have other questions I asked on the forum about langsmith and havent recieved any support in over 2 weeks,

dast-draftwise avatar Sep 29 '25 08:09 dast-draftwise

Hi @dast-draftwise, thanks for your detailed report here. We've added this to the list of bulk export improvements we plan to make soon. I can't promise an exact timeline, but will follow up once released.

bvs-langchain avatar Sep 30 '25 14:09 bvs-langchain

Oki - thank you. If it would be possible to specify columns to be dropped while performing the bulk export that would be amazing - not sure how difficult would it be to implement but having both the timestamp fixed + ability to drop columns implemented, would allow for much easier trace processing outside of langsmith

dast-draftwise avatar Oct 03 '25 07:10 dast-draftwise

To confirm, you mean the ability to select which columns are exported? If yes, that is also on our roadmap

bvs-langchain avatar Oct 03 '25 18:10 bvs-langchain

To confirm, you mean the ability to select which columns are exported? If yes, that is also on our roadmap

Yes, exactly -

dast-draftwise avatar Oct 06 '25 08:10 dast-draftwise

@bvs-langchain hi, Any hints on when this could be available?

dast-draftwise avatar Oct 14 '25 20:10 dast-draftwise