pyemu icon indicating copy to clipboard operation
pyemu copied to clipboard

Delim whitespace

Open wkitlasten opened this issue 3 years ago • 7 comments

Buggin out when reading whitespace delim files with multiple spaces. Added option for sep='w' to trigger delim_whitespace=True in read_csv. Replace sep='w' in mult2model with single space. Lots of bonus auto formatting courtesy of pycharm.

wkitlasten avatar Aug 11 '22 08:08 wkitlasten

Coverage Status

Coverage remained the same at 78.122% when pulling 41c4329ebb37bc96b94e0a09790fb0b8a94bd68c on delim_whitespace into 29b5a75689a2bd8ff63d39cc3960b6eeff3cb1ec on develop.

coveralls avatar Aug 11 '22 10:08 coveralls

I could do with looking into this a bit. I thought we were support multiple delims already (with some "cheap" assumptions relating to file extensions if sep was not passed). Anyway I wonder if passing mfile_sep="\s+" is sufficient?

briochh avatar Aug 11 '22 22:08 briochh

Seems to be equivalent. I was unaware of '\s+' syntax. How will '\s+' be represented/written in the mult2model table?

On Fri, Aug 12, 2022 at 10:39 AM Brioch Hemmings @.***> wrote:

I could do with looking into this a bit. I thought we were support multiple delims already (with some "cheap" assumptions relating to file extensions if sep was not passed). Anyway I wonder if passing mfile_sep="\s+" is sufficient?

— Reply to this email directly, view it on GitHub https://github.com/pypest/pyemu/pull/358#issuecomment-1212564062, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSJXRHCPFKS75GRA66PPU3VYV6IRANCNFSM56HLDXUQ . You are receiving this because you authored the thread.Message ID: @.***>

-- "Perfect spheres are pointless."

wkitlasten avatar Aug 12 '22 00:08 wkitlasten

If the file extension is not '.csv' and sep=None (which I think is default) the model file should be read with sep='\s+', as long as fmt='free' (which I think is also the default). Unless I am missing something (highly probable), I think the "whitespace" delimited model file should "work" by default. If you extension is '.csv' you should just be able to override the default sep (',') with sep='\s+'. (I think)

briochh avatar Aug 12 '22 19:08 briochh

It appears to have been an issue with whitespace before the comment character. Disregard my meddling.

wkitlasten avatar Aug 16 '22 21:08 wkitlasten

Ahhh! There maybe some inefficiencies/deficiencies with how comments are handled. Raise an issue on that if you spot something of concern!

briochh avatar Aug 16 '22 21:08 briochh

Reopening. Whitespace before comment was one issue (not pyemu related). But there still seems to be an issue with sep='\s+' when writing to mult2model_info file.

wkitlasten avatar Aug 16 '22 22:08 wkitlasten

I suspect that there will be issues relating to the '' the string probably needs to be r'\s+'.

More generally though, I think we need to consider what use case we are trying to cover off here: If the extension is not ".csv" and the file is a list-like file, then the default is '\s+' internally -- no change necessary. (Actually ' ' is what get written to our mult2model info file. We collapse the spaces in the resulting model input files--sorry, not sorry). If the extension is ".csv" but we actually have a space delimited list-like file you might be able to get away with r'\s+' (in the case when the number of spaces as delimiter is > 1 and variable). If the file is ".csv" and array-like and is actually space delimited with multiple spaces we may have a bit of a challenge with the current code. This uses the numpy engine which when sep=None treats multiple delims as one (I believe). -- the issue we might have is that if we pass 'sep=None' for these files, internally we store sep=',' for csvs. So this is probs where we need a change. If the user explicitly says that the array type files is space delimited (sep=' ') we might need to allow for multiple delims as one (internally change to sep=None).

briochh avatar Aug 17 '22 10:08 briochh

Im just trying to catch-up on this convo. I think generally (given that it is now 2022) we should treat "whitespace" as any combination of one or more spaces and/or tabs. And I think that is what sep="\+s" and assume that when we recreate the files for the model at runtime, we can use that same definition of "whitespace" - a single space is sufficient (like B said - sorry, not sorry). So about the file extension tho...if sep is not passed, then I think we have to rely on extension, right? If sep is passed, then ignore extension?

jtwhite79 avatar Aug 25 '22 16:08 jtwhite79

It was an issue with a space before a comment_char, combined with my inability to navigate the complexity of pst_from. In a whitespace delimited file pandas thinks there is a column before the comment_char (as it would with something like: ",#comment", makes sense). Sorry my coding skills are still stuck in 1997 (a great year... if any of you were alive back then)!

wkitlasten avatar Aug 25 '22 22:08 wkitlasten