hop icon indicating copy to clipboard operation
hop copied to clipboard

[Bug]: Error reading parquet file

Open pauloalexsb opened this issue 1 year ago • 4 comments

Apache Hop version?

2.7.0

Java version?

21.0.2

Operating system

Windows

What happened?

I'm using Parquet Input, following the documentation I started with Get File names and called Parquet Input File; I chose the filename as Filename Field and looked for a parquet file to pull the fields, everything ok so far....but it presents the following error when trying to read the files:

Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException : Relative path in absolute URI: ParquetStream of file 'E:%5CQSR%5CTEMP%5CSIPNI_COVID19_OUTROS_21012024_06_50.parquet'

I'm on a Windows machine and the absolute path of the file is:

E:\QSR\TEMP\SIPNI_COVID19_OUTROS_21012024_06_50.parquet

I'm trying to read several files, and after some tests I realized that the error appeared in a single file. I then tested searching for the fields only from this file and realized that the field types (Type) had two types that were different from the other files. Example, in file 1 the column type was Integer and in file 2 the column was of type Number. After adjusting this and leaving everything as String, the error no longer appeared and I was able to read all the files.

I think the error message about not finding the file or about an error reading the files should be more explicit. I received a correct error when, in the tests, the file did not exist, but when the files existed and the number and column names were the same, but the data type was different between the files, the error appeared.

erro_hop

Issue Priority

Priority: 3

Issue Component

Component: Hop Gui, Component: Pipelines

pauloalexsb avatar Jan 31 '24 17:01 pauloalexsb

I forgot... to present the error, take two parquet files with the same name and number of columns, in file 1 a column must be integer, in file 2 a column must be of type number, with Get File Names look for the two files and in Input Parquet choose one of the two as file, perform the transformation.

pauloalexsb avatar Jan 31 '24 17:01 pauloalexsb

Would you please create two small parquet files and an hpl file that reproduce the issue. That will make it much easier to isolate the problem. The parquet files can be super tiny.

usbrandon avatar Feb 02 '24 03:02 usbrandon

Hello, here are two parquet files with two fields, the same names. One of them has the interger type field, and the other has the same field with the Number type. As you can see in the hpl file, I searched for both and chose one to search for the fields, the error occurs. If I change the type to String, reading happens normally. Please note that the error apparently implies that it is a file location error, not a field type error. parquet_hpl_file.tar.gz

pauloalexsb avatar Feb 02 '24 19:02 pauloalexsb

@pauloalexsb I was looking int your problem. First: replaced the backslashes in the path with slashes by mean of a replace transform. This way the error changed and was more clear because it is like this

2024/02/28 19:26:04 - Parquet File Input.0 - Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 13: ParquetStream of file 'C:/java/hop/hop/config/projects/default/parquet_hpl_file/interger.parquet'

Can you tell me from which platform you files were generated?

sramazzina avatar Feb 28 '24 22:02 sramazzina