libhdfs3 Files written by gohdfs library are unreadable by Clickhouse

Files written by gohdfs library are unreadable by Clickhouse

Open knl opened this issue 1 year ago • 0 comments

I already submitted the issue to the mentioned gohdfs library: https://github.com/colinmarc/hdfs/issues/313

When I try writing a simple csv file (initially, I tried parquet to the same effect), the file written by this library ends up not being readable by clickhouse. This might also be an issue with clickhouse, as all other datalakes I use can read these files without issues.

What I tried so far:

the attached program that writes directly to hdfs: cannot be read by clickhouse
the attached program that writes to a local file, then upload using gohdfs put: cannot be read by clickhouse
the attached program that writes to a local file, then upload using hdfs dfs -put (hadoop tools): can be read by clickhouse
the attached program that writes directly to hdfs, use hdfs dfs -get followed by hdfs dfs -put: can be read by clickhouse, nothing missing

This is the same code:

package main

import (
	"log"
	"fmt"

	"github.com/colinmarc/hdfs/v2"
)

func main() {
	var err error
	client, err := hdfs.New("")
	if err != nil {
		log.Println("Can't create hdfs client", err)
		return
	}
	_ = client.Remove("/random/yet/existing/path/flat.csv")
	fw, err := client.Create("/random/yet/existing/path/flat.csv")
	if err != nil {
		log.Println("Can't create writer", err)
		return
	}

	num := 100
	for i := 0; i < num; i++ {
		if _, err = fmt.Fprintf(fw, "%d,%d,%f\n", int32(20+i%5), int64(i), float32(50.0)); err != nil {
			log.Println("Write error", err)
		}
	}
	log.Println("Write Finished")
	if err = fw.Close(); err != nil {
		log.Println("Issue closing file", err)
	}
	log.Println("Wrote ", num, "rows")
}

this is the response from running the latest clickhouse-local:

server.internal :) select * from hdfs('hdfs://nameservice1/random/yet/existing/path/flat.csv', 'CSV')

SELECT *
FROM hdfs('hdfs://nameservice1/random/yet/existing/path/flat.csv', 'CSV')

Query id: e33d9bf7-41b0-4025-a5fc-8dc6ebb65c0f


0 rows in set. Elapsed: 60.292 sec.

Received exception:
Code: 210. DB::Exception: Fail to read from HDFS: hdfs://nameservice1, file path: /random/yet/existing/path/flat.csv. Error: 
HdfsIOException: InputStreamImpl: cannot read file: /random/yet/existing/path/flat.csv, from position 0, size: 1048576.	
Caused by: HdfsIOException: InputStreamImpl: all nodes have been tried and no valid replica can be read for Block: [block pool 
ID: BP-2134387385-192.168.12.6-1648216715170 block ID 1367614010_294283603].: Cannot extract table structure from CSV
 format file. You can specify the structure manually. (NETWORK_ERROR)

I'm using Hadoop 2.7.3.2.6.5.0-292. Using strace, I can see that clickhouse is trying to access different data nodes, so there is a lot of network traffic.

Mar 13 '23 11:03 knl

libhdfs3 libhdfs3 copied to clipboard

Files written by gohdfs library are unreadable by Clickhouse

libhdfs3
libhdfs3 copied to clipboard