Question: is there a way to write SQL query result (each record in []any) to a parquet file (with a json schema)?

Open ns-gzhang opened this issue 1 year ago • 1 comments

I'd like to export SQL query results to parquet files, trying to achieve that using this package. The parquet writer generated from a JSON schema takes an object of the struct corresponding to the schema. Constructing objects from []any seems to incur overhead that is unnecessary?? Basically we need to dynamically construct a schema, and then write records in []any each to the parquet file. Thanks.

Mar 27 '25 04:03 ns-gzhang

Just to report back that the best I could do with the existing API is to define a struct with parquet tags and transfer SQL query result rows from []any to the struct object with reflect, and send to the parquet writer. Code fragment:

	...
	txnlog := &Txnlog{}
	rec := reflect.ValueOf(txnlog).Elem()

	r := make([]any, len(cols))
	v := make([]any, len(cols))
	for i := range v {
		v[i] = &r[i]
	}
	for rows.Next() {
		err := rows.Scan(v...)
		if err != nil {
			log.Fatalf("scan fail, err: %s", err)
		}
		for i := range v {
			rec.Field(i).Set(reflect.ValueOf(r[i]))
		}

		if err = pw.Write(txnlog); err != nil {
			log.Println("Write error", err)
		}
	}
	if rows.Err() != nil {
		log.Fatalf("rows error: %s", err)
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
		return
	}
	log.Println("Write Finished")

For some reason, the generated parquet file size is pretty large (2x) compared with that from DuckDB on exactly the same test data with comparable encodings. It would be great if someone could explain to me how to tune to make the file smaller... Thanks.

Mar 28 '25 18:03 ns-gzhang