bcpandas icon indicating copy to clipboard operation
bcpandas copied to clipboard

Support more possible delimiter and quote characters

Open owen-synergy opened this issue 4 years ago • 7 comments

Hi Josh, I'm not sure if this is possible, but can BCP and Pandas to_csv support more delimiter and quote characters than just the ones you've specified so far? I'm possibly trying to push the limits of bcpandas too far by writing DataFrames with big json blob elements (I know you don't recommend using it with messy text data) but I didn't know if there was still potential to improve it in that regard.

Happy to try creating a PR if you think there's still potential improvements here.

owen-synergy avatar Jul 09 '20 03:07 owen-synergy

Fair point. Technically, I think every single character is fair game to be used as a delimiter or quote character, as long as it doesn't appear in the actual data. We can (and do) check for this in constants.py. I only specified those specific options because they are the most common ones. So we could change it to allow for all characters in ASCII 32-127. I was just unsure how this would actually work in real life as it is a bit unorthodox, and I've never tried it. But we can try it.

yehoshuadimarsky avatar Jul 09 '20 15:07 yehoshuadimarsky

I'll have to double check the BCP docs again... I think I remember it having to be a visible character or a few others (maybe that is the 32 - 127 range...). Using other characters for quotes does seem a bit funny to me too... I didn't see anything in the Pandas documentation as what characters it would allow for that.

I'll see if I can find some time to experiment. I was going to try and extract the set of all characters that appear in the DataFrame to check if there's two left over that can be used.

owen-synergy avatar Jul 10 '20 01:07 owen-synergy

Taking a stab now at working through it... Here's the link for Microsoft docs about terminators https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server

yehoshuadimarsky avatar Jul 10 '20 03:07 yehoshuadimarsky

Nice, I made this code snippet to grab all characters from all strings in a DataFrame, not sure if there's a more efficient way...

all_chars = set()
for col in df.columns:
    try:
        all_chars.update(set(df[col].str.cat(sep="")))
    except AttributeError:
        pass

owen-synergy avatar Jul 10 '20 03:07 owen-synergy

There is a better way - it's easier to just go through all the possible valid ASCII values, and find one that it's not in the DataFrame using the same code we currently use

https://github.com/yehoshuadimarsky/bcpandas/blob/d6b16141e40bbabc28ea12ed21ed05eaaf999c51/bcpandas/constants.py#L64-L68

(learned how to do this here)

yehoshuadimarsky avatar Jul 10 '20 03:07 yehoshuadimarsky

Any chance to add | (pipeline char) as the delimiter? That seems to be less likely to appear in the actual data, and somewhat commonly used as a delimiter.

puruzio avatar Apr 21 '21 05:04 puruzio

It already is one of the delimiter options https://github.com/yehoshuadimarsky/bcpandas/blob/db37a2c1ffba7f19a01019b833ba526622ae577c/bcpandas/constants.py#L38

yehoshuadimarsky avatar Apr 21 '21 13:04 yehoshuadimarsky