csv-schema-inference icon indicating copy to clipboard operation
csv-schema-inference copied to clipboard

A tool to automatically infer columns data types in .csv files

Csv Schema Inference

A tool to automatically infer columns data types in .csv files

Check the article here: Building a Schema Inference Data Pipeline for Large CSV files

Installing csv-schema-inference 🔧

pip install csv-schema-inference
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting csv-schema-inference
  Downloading csv_schema_inference-0.0.9-py3-none-any.whl (7.3 kB)
Installing collected packages: csv-schema-inference
Successfully installed csv-schema-inference-0.0.9

Importing csv-schema-inference library

from csv_schema_inference import csv_schema_inference

Setting csv-schema-inference configuration


#if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"}

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions)
pathfile = "/content/file__500k.csv"

Run inference 🏃

aprox_schema = csv_infer.run_inference(pathfile)

Showing the approximate data type inference for each column 🔍

csv_infer.pretty(aprox_schema)
0
	name
		id
	type
		INTEGER
	nullable
		False
1
	name
		full_name
	type
		STRING
	nullable
		True
2
	name
		age
	type
		INTEGER
	nullable
		False
3
	name
		city
	type
		STRING
	nullable
		True
4
	name
		weight
	type
		FLOAT
	nullable
		False
5
	name
		height
	type
		FLOAT
	nullable
		False
6
	name
		isActive
	type
		BOOLEAN
	nullable
		False
7
	name
		col_int1
	type
		INTEGER
	nullable
		False
8
	name
		col_int2
	type
		INTEGER
	nullable
		False
9
	name
		col_int3
	type
		INTEGER
	nullable
		False
10
	name
		col_float1
	type
		FLOAT
	nullable
		False
11
	name
		col_float2
	type
		FLOAT
	nullable
		False
12
	name
		col_float3
	type
		FLOAT
	nullable
		False
13
	name
		col_float4
	type
		FLOAT
	nullable
		False
14
	name
		col_float5
	type
		FLOAT
	nullable
		False
15
	name
		col_float6
	type
		FLOAT
	nullable
		False
16
	name
		col_float7
	type
		FLOAT
	nullable
		False
17
	name
		col_float8
	type
		FLOAT
	nullable
		False
18
	name
		col_float9
	type
		FLOAT
	nullable
		False
19
	name
		col_float10
	type
		FLOAT
	nullable
		False
20
	name
		test_column
	type
		FLOAT
	nullable
		False

Checking schema values for specific columns

result = csv_infer.get_schema_columns(columns = {"test_column"})
csv_infer.pretty(result)
20
	_name
		test_column
	types_found
		INTEGER
			cnt
				406130
		FLOAT
			cnt
				50964
	nullable
		False
	type
		FLOAT

Explore all possible data types for a specific columns

result = csv_infer.explore_schema_column(column = "test_column")
csv_infer.pretty(result)
20
	name
		test_column
	types_found
		INTEGER
			88.85043339006856
		FLOAT
			11.149566609931437
	nullable
		False

Benchmark

The tests were done with 9 .csv files, 21 columns, different sizes and number of records, an average of 5 executions was calculated for each process, shuffle time and inferring time.

  • file__20m.csv: 20 million records
  • file__15m.csv: 15 million records
  • file__12m.csv: 12 million records
  • file__10m.csv: 10 million records
  • And so on...

If you want to know more about the shuffling process, you can check this other repository: A tool to automatically Shuffle lines in .csv files, the shuffling process helps us to:

  1. Increase the probability of finding all the data types present in a single column.
  2. Avoid iterate the entire dataset.
  3. Avoid see biases in the data that may be part of its organic behavior and due to not knowing the nature of its construction.

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the MIT License.