bigdata-file-viewer icon indicating copy to clipboard operation
bigdata-file-viewer copied to clipboard

Request support for LZ4 compression?

Open sharpe5 opened this issue 5 years ago • 4 comments

Fantastic work on this utility, thanks for developing it!

I'm wondering if it would be possible to compile in support for LZ4 compression? It already supports Snappy. LZ4 is about 50% faster for compression compared to Snappy, so newer Parquet files may tend to use this instead of Snappy. I was wondering why it couldn't read the files in the archive, and it turns out this was the cause. I fixed the issue by migrating the data from Snappy to LZ4.

I believe that the latest version of Arrow/Parquet supports all compression codecs by default?

sharpe5 avatar Mar 25 '20 08:03 sharpe5

Good point, marked your comment as an enhancement. Thanks for your contribution.

Eugene-Mark avatar Mar 29 '20 06:03 Eugene-Mark

@sharpe5 Hi sharpe5, can you provide me with sample parquet files in LZ4 or other compression codec. I need them for testing usage.

Eugene-Mark avatar Apr 02 '20 14:04 Eugene-Mark

Here you go:

type=blockStream,rowCount=1000,compression=LZ4.zip

GitHub accepts .zip files, so unzip the .parquet file. There should be 6 columns of random doubles, a few thousand rows.

Anything else, let me know!

sharpe5 avatar Apr 02 '20 16:04 sharpe5

C++ code to create said file (missing functions; demo only). Arrow Parquet library was installed using vcpkg. Compiles with MSVC and gcc.

void demo3()
{
	using namespace std;
	using namespace fmt;
	using namespace System::Diagnostics;

	print("Demo 3: Open a file, flush blocks of rows to it until done:\n");
	
	{
		print("  - Test:\n");
		double r1 { drand() };
		print("    - r1={}\n", r1);
	}

	//const int maxRows = 1'000'000;
	const int maxRows = 500;
	vector<tuple<double, double, double, double, double, double>> rows;	
	{	
		rows.reserve(maxRows);

		print("  - Creating raw data:\n");
		Stopwatch sw = Stopwatch::StartNew();
		for (int i=0;i<maxRows;i++)
		{
			rows.push_back({drand(), drand(), drand(), drand(), drand(), drand()});
		}
		sw.Stop();
		print("    - rows.size(): {}\n", rows.size());
		print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
	}
	
	shared_ptr<arrow::Table> arrowTable;
	{
		const vector<string> names ={"col1", "col2", "col3", "col4", "col5", "col6"};		
		print("  - Creating Parquet table:\n");
		Stopwatch sw = Stopwatch::StartNew();
		if (!arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rows, names, &arrowTable).ok()) 
		{
			// Error handling code should go here.
			print("    - Error when creating table.\n");
			return;
		}
		sw.Stop();
		print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
	}

	string filepath;
	{
		std::shared_ptr<arrow::io::FileOutputStream> outfile;
		const string filename=format("type=blockStream,rowCount={},compression=LZ4.parquet",maxRows * 2); // As we are writing two chunks (see below).

		print("  - Write Parquet table:\n");
		Stopwatch sw = Stopwatch::StartNew();
		PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open(filename));

		parquet::WriterProperties::Builder propertiesBuilder;
	        propertiesBuilder.compression(parquet::Compression::LZ4);
	        const auto properties = propertiesBuilder.build();
		
		// https://stackoverflow.com/questions/45572962/how-can-i-write-streaming-row-oriented-data-using-parquet-cpp-without-buffering
		auto arrow_output_stream = arrow::io::FileOutputStream::Open(filename, false);
		std::unique_ptr<parquet::arrow::FileWriter> writer;
		parquet::arrow::FileWriter::Open(*(arrowTable->schema()), ::arrow::default_memory_pool(), *arrow_output_stream, properties, parquet::default_arrow_writer_properties(), &writer);

		const int chunkSize = static_cast<int>(rows.size()); 
		writer->WriteTable(*arrowTable, chunkSize);		
                // Demonstrates writing data in blocks.
		writer->WriteTable(*arrowTable, chunkSize);
		writer->Close();

		print("    - Compression: LZ4\n");
		print("    - Block size: {}\n", chunkSize);
		print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
		const string dir = System::IO::Directory::GetCurrentDirectoryAlt();
		filepath = Path::Combine(dir, filename);
	}

	{
		print("  - Output file: {}\n", filepath);
	}
}

sharpe5 avatar Apr 02 '20 16:04 sharpe5

Close the issue since it's over years, will reopen the feature is in the roadmap.

Eugene-Mark avatar Sep 04 '23 01:09 Eugene-Mark