hone icon indicating copy to clipboard operation
hone copied to clipboard

Support for type casting

Open chamkank opened this issue 5 years ago • 0 comments

Support for type casting (design document)

Overview

  • Introduction
  • How each type will be handled
    • string
    • number
    • literal names (null, true, and false)
    • object
    • array
  • Using the feature
    • New command-line options
    • New module options

Introduction

Currently (v0.1.4), all CSV cell values are assumed to be strings and are encoded as such in the JSON output. For example, a cell containing 4.0 becomes "4.0".

However according to the JSON standard (STD 90, i.e. RFC 8259), a JSON value can be any one of the following:

  • string (anything surrounded by quotes)
  • number ([ minus ] int [ frac ] [ exp ])
  • literal names
    • null
    • true
    • false
  • object (begin-object [ member *( value-separator member ) ] end-object)
  • array (begin-array [ value *( value-separator value ) ] end-array)

A type casting feature was requested in #4, so this document describes how the feature be implemented.

To start, each cell value can be safely evaluated by either using ast.literal_eval (for string, number, null, true, false), or json.JSONDecoder (for object and array). This will give us the Python representation of that value using the most appropriate type available. From there, we have to encode each type into the appropriate JSON data type. The details on how this will be done are described below.

How each type will be handled

string

The handling of this type will not change.

number

JSON supports integers, fractions, and exponents. Converting a number encoded as a string to an integer or float is trivial in Python, but we have to be careful to preserve scientific notation. We want 4E10 to be encoded as 4E10, not 40000000000.0.

Unfortunately, we may lose the original scientific notation when we convert the cell value to a float. And so when the values are being converted to JSON, the default JSONEncoder will not know to keep the number in scientific notation. Here is the solution that I propose:

  1. Cell values that can be evaluated to a float but also contain an exponent (e or E) will be wrapped in decimal.Decimal.
  2. Python's JSONEncoder will be extended to support the encoding of decimal.Decimal so that we can preserve the original scientific notation of a value.

literal names (null, true, and false)

Currently, if a cell in a row of the CSV file is left out, it's corresponding field is not included in the outputted JSON object. This is a sensible default. However, the user should be able to map values to null if there's a semantic reason for doing so. For example, the string "N/A" would be a good candidate for null replacement.

As I mentioned in #4, there is no standard way to represent a boolean value in a CSV file. Due to this, if the user wants to automatically convert cell values to JSON boolean values, they must specify a mapping.

The takeaway from this section is that if they user wants to use null, true, or false in the outputted JSON, they must specify a literal name mapping for those values. For example:

{
  true : [1, "1", "yes", "YES", "y", "Y", "true", "True", "T"],
  false : [0, "0", "no", "NO", "n", "N", "false", "False", "F"],
  null : ["N/A", "None", "none"]
}

The user will be able to express the above using individual flags for true values, false values, and null values (instead of a single map for all literals).

object

If type casting for objects is enabled, the assumption will be that any cell value that starts with an opening curly brace and ends with a closing curly brace will contain a JSON object. JSONDecoder will be used to decode the object.

However, the Python's JSONDecoder won't work out-of-the-box because of our need to preserve scientific notation in floats, and to parse boolean values as mentioned in the previous sections. Thus, we will extend the decoder similarly to how we will extend the encoder (see number section).

array

If type casting for arrays is enabled, the assumption will be that any cell value that starts with an opening square brace and ends with a closing square brace will contain a JSON array. We will handle this in the same way we handle objects - by using a custom JSON decoder that takes into account our special handling of boolean and float values.

Using the feature

New command-line options

--type-cast

-t --type-cast ["string", "int", "float", "object", "array", "true", "false", "null"]
  - When possible, convert CSV values to one or more specified data types supported by JSON. 
  - The default is ["string", "int", "float", "object", "array"].
  - If "true" is included, the values to convert to true must be specified using --true-values.
  - If "false" is included, the values to convert to false must be specified using --false-values.
  - If "null" is included, the values to convert to null must be specified using --null-values.

--true-values

-t --true-values [value1, value2, ...]
  - Used to specify which values should be converted to true in the JSON output.
  - Required when "true" is included in the input array for --type-cast.
  - [value1, value2, ...] should be a Python list.

--false-values

-t --false-values [value1, value2, ...]
  - Used to specify which values should be converted to false in the JSON output.
  - Required when "false" is included in the input array for --type-cast.
  - [value1, value2, ...] should be a Python list.

--null-values

-n --null-values [value1, value2, ...]
  - Used to specify which values should be converted to null in the JSON output.
  - Required when "null" is included in the input array for --type-cast.
  - [value1, value2, ...] should be a Python list.

New module options

This may change after #1 is implemented, so this will be finalized later.

chamkank avatar May 28 '20 03:05 chamkank