ApacheOrcDotNet
ApacheOrcDotNet copied to clipboard
use for reading .ORC files without knowing there schema
Hi, Thanks for writing lib! I am trying to understand if I can use this lib to read.orc files. My service receives many .orc files from costumers, and I wish to read the first n columns and the first k rows from each file without knowing the file schema. Can I do that with this Lib?
Hi @YuvalGrMic, take a look at https://github.com/ddrinka/ApacheOrcDotNet/blob/master/src/ApacheOrcDotNet.ReaderTest.App/Program.cs which uses the low-level API to dynamically read data from an ORC file without needing to pre-create POCOs to deserialize the data into. Maybe you could use that code to accomplish what you're aiming to do?
If you're feeling ambitious, you could also try out the new reader functionality which has less testing but uses modern .Net concepts to avoid high levels of garbage collection. There's not an explicit example of dynamically reading columns, but combining this sample code with the code above, you could accomplish your requirement: https://github.com/ddrinka/ApacheOrcDotNet/blob/ab93abd5198e6f4a56ba491a988643cb049518d3/src/ApacheOrcDotNet.OptimizedReaderTest.App/ReadAllApp.cs
If I were tasked with your requirements today, rather than using a lower-level library like ApacheOrcDotNet, I would use Pandas in Python. Here's some code ChatGPT wrote:
I have some ORC files with different columns in each file. Write some code to load all the ORC files in a directory, determine the column names and types, and read the first 10 rows of each file, outputting a CSV. Use Python.
import os
import pyarrow.orc as orc
import pandas as pd
def read_orc_files_in_directory(directory_path):
orc_files = [file for file in os.listdir(directory_path) if file.endswith(".orc")]
for file in orc_files:
file_path = os.path.join(directory_path, file)
with orc.ORCFile(file_path) as reader:
schema = reader.schema
column_names = schema.names
column_types = schema.types
print(f"File: {file}")
print(f"Column Names: {column_names}")
print(f"Column Types: {column_types}")
# Read the first 10 rows
num_rows = min(10, reader.numpages * reader.metadata.num_rows)
rows = reader.read().to_pandas().head(num_rows)
output_csv(file, rows)
def output_csv(file_name, data_frame):
csv_file_name = os.path.splitext(file_name)[0] + ".csv"
data_frame.to_csv(csv_file_name, index=False)
print(f"CSV output saved as: {csv_file_name}")
if __name__ == "__main__":
directory_path = "/path/to/orc/directory"
read_orc_files_in_directory(directory_path)
Note GPT's caveat:
Please note that the code assumes that the ORC files are not too large to fit into memory. If your files are very large, you might need to consider processing them in chunks.
Hi @ddrinka Thank you for your reply. I am using a .Net service and can't run a Python code on my machine therefore, I want to use this .Net lib. I am running this reader and can read some files, but I don't know what the 'FileTail' really does. My end goal is to sample only the first 100 rows of a file. But when I am using the Rasder.Read() I get all of the rows in the column (this can be very heavy for my machine memory). What would you recommend me on doing? Again thanks a lot in advance.
Hi @YuvalGrMic, if you're going to work with the low-level API you'll need to have a firm understanding of the ORC file format. The FileTail
is part of the ORC format.
The OrcReader.Read
call returns an IEnumerable
. Just stop enumerating when you have read enough rows. It's loading them in batches and passing them on to the caller so if you stop enumerating after 100 rows, you'll only have read the first stripe of data in the file.