Support for type casting
Support for type casting (design document)
Overview
- Introduction
- How each type will be handled
- string
- number
- literal names (
null,true, andfalse) - object
- array
- Using the feature
- New command-line options
- New module options
Introduction
Currently (v0.1.4), all CSV cell values are assumed to be strings and are encoded as such in the JSON output. For example, a cell containing 4.0 becomes "4.0".
However according to the JSON standard (STD 90, i.e. RFC 8259), a JSON value can be any one of the following:
- string (anything surrounded by quotes)
- number (
[ minus ] int [ frac ] [ exp ]) - literal names
nulltruefalse
- object (
begin-object [ member *( value-separator member ) ] end-object) - array (
begin-array [ value *( value-separator value ) ] end-array)
A type casting feature was requested in #4, so this document describes how the feature be implemented.
To start, each cell value can be safely evaluated by either using ast.literal_eval (for string, number, null, true, false), or json.JSONDecoder (for object and array). This will give us the Python representation of that value using the most appropriate type available. From there, we have to encode each type into the appropriate JSON data type. The details on how this will be done are described below.
How each type will be handled
string
The handling of this type will not change.
number
JSON supports integers, fractions, and exponents. Converting a number encoded as a string to an integer or float is trivial in Python, but we have to be careful to preserve scientific notation. We want 4E10 to be encoded as 4E10, not 40000000000.0.
Unfortunately, we may lose the original scientific notation when we convert the cell value to a float. And so when the values are being converted to JSON, the default JSONEncoder will not know to keep the number in scientific notation. Here is the solution that I propose:
- Cell values that can be evaluated to a
floatbut also contain an exponent (eorE) will be wrapped indecimal.Decimal. - Python's
JSONEncoderwill be extended to support the encoding ofdecimal.Decimalso that we can preserve the original scientific notation of a value.
literal names (null, true, and false)
Currently, if a cell in a row of the CSV file is left out, it's corresponding field is not included in the outputted JSON object. This is a sensible default. However, the user should be able to map values to null if there's a semantic reason for doing so. For example, the string "N/A" would be a good candidate for null replacement.
As I mentioned in #4, there is no standard way to represent a boolean value in a CSV file. Due to this, if the user wants to automatically convert cell values to JSON boolean values, they must specify a mapping.
The takeaway from this section is that if they user wants to use null, true, or false in the outputted JSON, they must specify a literal name mapping for those values. For example:
{
true : [1, "1", "yes", "YES", "y", "Y", "true", "True", "T"],
false : [0, "0", "no", "NO", "n", "N", "false", "False", "F"],
null : ["N/A", "None", "none"]
}
The user will be able to express the above using individual flags for true values, false values, and null values (instead of a single map for all literals).
object
If type casting for objects is enabled, the assumption will be that any cell value that starts with an opening curly brace and ends with a closing curly brace will contain a JSON object. JSONDecoder will be used to decode the object.
However, the Python's JSONDecoder won't work out-of-the-box because of our need to preserve scientific notation in floats, and to parse boolean values as mentioned in the previous sections. Thus, we will extend the decoder similarly to how we will extend the encoder (see number section).
array
If type casting for arrays is enabled, the assumption will be that any cell value that starts with an opening square brace and ends with a closing square brace will contain a JSON array. We will handle this in the same way we handle objects - by using a custom JSON decoder that takes into account our special handling of boolean and float values.
Using the feature
New command-line options
--type-cast
-t --type-cast ["string", "int", "float", "object", "array", "true", "false", "null"]
- When possible, convert CSV values to one or more specified data types supported by JSON.
- The default is ["string", "int", "float", "object", "array"].
- If "true" is included, the values to convert to true must be specified using --true-values.
- If "false" is included, the values to convert to false must be specified using --false-values.
- If "null" is included, the values to convert to null must be specified using --null-values.
--true-values
-t --true-values [value1, value2, ...]
- Used to specify which values should be converted to true in the JSON output.
- Required when "true" is included in the input array for --type-cast.
- [value1, value2, ...] should be a Python list.
--false-values
-t --false-values [value1, value2, ...]
- Used to specify which values should be converted to false in the JSON output.
- Required when "false" is included in the input array for --type-cast.
- [value1, value2, ...] should be a Python list.
--null-values
-n --null-values [value1, value2, ...]
- Used to specify which values should be converted to null in the JSON output.
- Required when "null" is included in the input array for --type-cast.
- [value1, value2, ...] should be a Python list.
New module options
This may change after #1 is implemented, so this will be finalized later.