Z.ExtensionMethods icon indicating copy to clipboard operation
Z.ExtensionMethods copied to clipboard

Add SplitOnChunkSize() to FileInfo

Open adamfisher opened this issue 6 years ago • 4 comments

This issue is a proposal to add SplitOnChunkSize() to FileInfo that would split a file into multiple files and return an array of the newly created files. The challenge with this one will be handling line breaks if the breakOnNewlines is true and also taking into account large files means buffering a chunk of data at a time so as not to overload system resources.

/// <summary>
/// Splits a file into multiple files based on the specified chunk size of each file.
/// </summary>
/// <param name="file">The file.</param>
/// <param name="chunkSize">The maximum number of bytes to store in each file.
/// If a chunk size is not provided, files will be split into 1 MB chunks by default.
/// The breakOnNewlines parameter can slightly affect the size of each file.</param>
/// <param name="targetPath">The destination where the split files will be saved.</param>
/// <param name="deleteAfterSplit">if set to <c>true</c>, the original file is deleted after creating the newly split files.</param>
/// <param name="breakOnNewlines">if set to <c>true</c> break the file on the next newline once the chunk size limit is reached.</param>
/// <returns>
/// An array of references to the split files.
/// </returns>
/// <exception cref="ArgumentNullException">file</exception>
/// <exception cref="ArgumentOutOfRangeException">chunkSize - The chunk size must be larger than 0 bytes.</exception>
public static FileInfo[] SplitOnChunkSize(
	this FileInfo file,
	int chunkSize = 1000000,
	DirectoryInfo targetPath = null,
	bool deleteAfterSplit = false,
	bool breakOnNewlines = true
	)
{
	if (file == null)
		throw new ArgumentNullException(nameof(file));

	if (chunkSize < 1)
		throw new ArgumentOutOfRangeException(nameof(chunkSize), chunkSize,
			"The chunk size must be larger than 0 bytes.");

	if (file.Length <= chunkSize)
		return new[] {file};

	var buffer = new byte[chunkSize];
	var extraBuffer = new List<byte>();
	targetPath = targetPath ?? file.Directory;
	var chunkedFiles = new List<FileInfo>((int)Math.Abs(file.Length / chunkSize) + 1);

	using (var input = file.OpenRead())
	{
		var index = 1;

		while (input.Position < input.Length)
		{
			var chunkFileName = new FileInfo(Path.Combine(targetPath.FullName, $"{file.Name}.CHUNK_{index++}"));
			chunkedFiles.Add(chunkFileName);
			using (var output = chunkFileName.Create())
			{
				var chunkBytesRead = 0;
				while (chunkBytesRead < chunkSize)
				{
					var bytesRead = input.Read(buffer,
						chunkBytesRead,
						chunkSize - chunkBytesRead);

					if (bytesRead == 0)
					{
						break;
					}

					chunkBytesRead += bytesRead;
				}

				if (breakOnNewlines)
				{
					var extraByte = buffer[chunkSize - 1];
					while (extraByte != '\n')
					{
						var flag = input.ReadByte();
						if (flag == -1)
							break;
						extraByte = (byte)flag;
						extraBuffer.Add(extraByte);
					}

					output.Write(buffer, 0, chunkBytesRead);
					if (extraBuffer.Count > 0)
						output.Write(extraBuffer.ToArray(), 0, extraBuffer.Count);

					extraBuffer.Clear();
				}
			}
		}
	}

	if (deleteAfterSplit)
		file.Delete();

	return chunkedFiles.ToArray();
}

adamfisher avatar Dec 27 '18 18:12 adamfisher

Maybe just calling it Split() instead of SplitOnChunkSize() would be ok too if we want to have overloaded methods in the future that would handle other scenarios like splitting on number of lines per file.

adamfisher avatar Dec 27 '18 18:12 adamfisher

Thank @adamfisher for all your next extensions,

My employee will review all them when he will be back from his vacancy in one week.

Best Regards,

Jonathan

JonathanMagnan avatar Dec 28 '18 15:12 JonathanMagnan

No worries Jonathan. Very cool library you guys have. I would rather contribute to it instead of creating yet another one-off NuGet package 😃

adamfisher avatar Dec 31 '18 16:12 adamfisher

@JonathanMagnan Any movement on this?

adamfisher avatar Apr 02 '20 18:04 adamfisher