CSVImporter icon indicating copy to clipboard operation
CSVImporter copied to clipboard

Automatically determine the encoding of the file

Open dkalinai opened this issue 6 years ago • 7 comments

Hi there, again thanks for making this since it saves tons of time.

Could you point me in the code or explain how does the importer determine what type of encoding the file is in when importing. I need to somehow extract this information and not sure how to do that. Maybe you can give me a hint where to look. not a bug more like request for information. And is there actually an automatic encoding determination or am i misinterpreting things?

   ```

guard let csv = CSVImporter<[String: String]>(url: fileURL) else { return }

    csv.startImportingRecords(structure: { (headerValues) -> Void in
        print(headerValues)
        
    }) {$0}.onFinish {importedRecord in
        print(importedRecord)
        
    }

dkalinai avatar Feb 11 '18 20:02 dkalinai

I think what you're looking for is this line. This means, when creating a CSVImporter object you can pass a parameter named encoding with your files encoding. By default it's set to .utf8.

Jeehut avatar Feb 12 '18 09:02 Jeehut

I see, so no way to determine encoding automatically then. Tough :( there is a method on NSString that computes that from an NSData object. Will try to look into that then on my own. Thanks for having a look.

dkalinai avatar Feb 12 '18 10:02 dkalinai

As you can see here we already have logic in place which will automatically determine the type of line ending of the file when .unknown is specified by the user. I'm not against using the exact same strategy for encoding as well. This could be done e.g. by making encoding an optional parameter on the init method, and if it is nil we could use the FileSource to determine the encoding.

Would you be up adding this feature youself and sending a PR with tests and docs updated? If there's a method on NSString/NSData which can handle that, than it should be pretty straight forward to implement since you have the same logic for line ending already in place. That would be really awesome. I'm reopening this issue and renaming it to describe this feature.

Jeehut avatar Feb 13 '18 07:02 Jeehut

I can make a PR, probably in next few days, I have already found a solution to this by the way and made a simple String extension that returns the encoding to me in String.Encoding format.

The only other issue and a bit off topic here is the delimeters (can be ; as well sometimes) and if one can process a string from memory as a CSV file. Because the NSString method i am referring to not only guesses the encoding but also returns the string to you which would potentially need to be handled by the importer on the fly rather than from a file.

dkalinai avatar Feb 13 '18 07:02 dkalinai

Sounds good. Note that one of the advanatages of CSVImporter is that it's able to read big files faster and more safely since it doesn't read the entire file at once, which your solution probably does. So that's another plus on implementing this in CSVImporter.

I don't really understand your other problems though. You probably would need to post some code so I can understand. Note though, that if it's a different problem than this one, it's probably better you open another issue for each problem.

Jeehut avatar Feb 13 '18 07:02 Jeehut

how does csvimporter handle garbage? i have a specific data structure but it can be corrupted or fields missiing or added so i need to add some regex.

gaming-hacker avatar Mar 09 '18 18:03 gaming-hacker

CSVImporter generally expects a valid CSV file according to RFC 4180 which specifies:

Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.

When a line for example doesn't have the same number of fields, then – at the moment – the entire line is simply ignored. That's not required by the RFC though (that's why it's a "should" not a "must"), so we could implement multiple different fallback strategies and let the user choose between them.

Can you give examples of lines and how they are "corrupted"? Depending on the case, I'm perfectly okay with a little more accommodating behavior, so long as it doesn't conflict with the RFC.

Feel free to post a PR with the changes you need and I'll have a look. As long as it is an opt-in feature, is documented (in the README) and is covered by tests (your corrupted file), I'm happy to merge it!

Jeehut avatar Mar 12 '18 09:03 Jeehut