CsvReader
CsvReader copied to clipboard
Why is CsvReader forward only?
More of a question, initially thought it was a bug.
I need to know how many lines are in my csv file.
So I did a myCsvReader.Count()
(Linq) to get it. Of course, this moves the cursor forward to the end as it has to iterate it. But then I called myCsvReader.GetEnumerator().Reset()
or even myCsvReader.MoveTo(-1)
and they have no effect.
I found this line of code in MoveTo
:
if (record < _currentRecordIndex) return false;
which confirm my thoughts that this is a forward only reader.
This msdn page (https://msdn.microsoft.com/en-us/library/65zzykke(v=vs.100).aspx) says that
Iterators do not support the IEnumerator.Reset method. To re-iterate from the beginning, you must obtain a new iterator.
But I've tried to do var en2 = myCsvReader.GetEnumerator();
and that doesn't return a new iterator. It returns the existing iterator which points to the end.
Why? Why can't I go back to the begining?
I didn't do the original design but I would guess its about two things
- Parsing forwards is relatively easy, backwards less so,
- Keeping all the data on the of chance you need it is expensive (think files in the 1 GB+ range)
Also,, most typical processing for CSV files is...
- Read a row
- Process a row
- Next
Bear in mind there is a class included called CachedCsvReader which does allow you to go backwards, but you have to take the memory hit
Just to give a real-world scenario for this. For my use I dont need to process a CSV line by line as all Im doing is loading a CSV file into a SQLserver table using SQLBulkCopy.
For large CSV files with millions of rows I really need the rowcount in advance so I can update a progressbar as the upload progresses. For larger files an upload might take 30 mins so having a progressbar is really essential IMO. But because its a large file I would also like to minimize the memory footprint so prefer not to use CachedCSVReader if possible.
So its frustrating that the basic CSVReader which would have otherwise been ideal doesnt have the record count.
@hugobyrne Same reasons as before, parsing is (potentially) tough, but you could do a couple of things.
- Do a quick first pass that counts "\r\n" which would give you a row count if the data is not complex
- Re-work this idea as part of CsvReader so we can have it as an option e.g. NaiveCount or EstimatedCount
Slightly related point, have you looked at this http://sqlblog.com/blogs/alberto_ferrari/archive/2009/11/30/sqlbulkcopy-performance-analysis.aspx, relatively old, but gives a lot of performance analysis/ideas on SqlBulkCopy
Thankyou very much Paul for the fast reply. III see what I can work out based on your advice. And that sqlbulkcopy performance analysis is indeed very useful. There is some excellent tips in there which III be following up on.
Just use a second reader instance to go through the file a second time.