Escaping non-ASCII data in JSONL output based on user's choice?
Hello!
This is really meant to be a suggestion/question:
In JSONL output, can snscrape set the ensure_ascii={true|false} to json.dump() based on user's input through a new command line parameter?
Currently snscrape seems to be NOT setting this flag in the json.dump() call. And it defaults to True. So it ends up unconditionally escaping all non-ASCII data in its JSONL output.
Is it already done? Is there an undocumented command line param that does this? Or am I missing something?
Thanks.
Why, out of curiosity? Isn't escaping the data safer? (Sure, it uses more space, but...)
Why, out of curiosity? Isn't escaping the data safer? (Sure, it uses more space, but...)
Escaping the data is surely safer. Even space is not that much of a constraint. But one primary use case of snscrape is to automate the collection and analysis of data. And if you expect Unicode data and know that you can handle it safely, then this default escaping just adds an extra step of converting the escaped output back to binary. Escaping also removes human readability of this data. So escaping is definitely good and needed - it should just be conditional based on user's choice.
@JustAnotherArchivist can you please comment if there's already an undocumented way to achieve this?
can you please comment if there's already an undocumented way to achieve this?
There is not, unless you implement your own JSON serialisation stuff or do ugly monkeypatches.
this default escaping just adds an extra step of converting the escaped output back to binary.
If you process JSON data by hand, you're going to have a bad time anyway. Use a proper JSON parser, and the escaping will be handled correctly without any extra steps from you.
There is one technical argument where disabling the escaping could be useful, and that's if there are unpaired surrogate characters in the data. However, no social networks should allow such content, snscrape's modules would likely blow up with decoding errors if they ever occurred, and storing such characters in JSON is undefined behaviour anyway. In all other cases, escaped data should not pose any issues, but unescaped data may well do so in broken environments unable to process UTF-8 (e.g. #122).
Do you have a use case for unescaped data which does not involve manual JSON parsing?
Ok. Well, I'm not processing data by hand; but the constraint is that I still need an ability to take an occasional cursory look at it. And since some of my data can be in Indic languages/scripts, I need to produce the output in human readable form because I don't know which random portion of it might get manually reviewed. As to the issues with un-escaped data (e.g. #122 above), that seemed like an environment's issue (not a generic issue with a platform). And in general, all modern versions of major operating systems either natively or fully support Unicode. So not escaping the data should not be that "unsafe". In the rare scenario that the source data itself is corrupt, I believe there would be errors even upstream of escaping and writing it i.e. snscrape (or any module for that matter) itself would face problem reading such data, won't it?
But, in summary, to your question: No, I don't have any use case for un-escaped data that doesn't involve manual parsing/reading. So we can close this issue as a "non-issue".
Sincerely thank you for your quick and detailed response! :)
If I was in your position, I'd create a script that unescapes the JSON. That way you can unescape it without this feature being added to snscrape. :-)
If I was in your position, I'd create a script that unescapes the JSON. That way you can unescape it without this feature being added to snscrape. :-)
:) That is what I have done. But then I thought it might be more appropriate to add a switch to the snscrape itself.