type_infer
type_infer copied to clipboard
Distributed type inference
The current implementation of type_infer
is not suitable to be used in distributed compute environments
(i.e. non-scalable); currently, type_infer
can only be executed in a single node and needs to load data into memory. This makes type_infer
unsuitable to analyze large datasets that do not fit in RAM.
The internal workings of type_infer
allow for a relatively straightforward implementation to allow execution in distributed compute environments; one could use sub-sets of columns (each subset loaded into a different worker) to infer the data types, and then apply something like a voting mechanism to choose a type over the other.
The voting mechanism shall be aware of data type hierarchy. For example, consider the the case of having 4 workers: worker 1 identifies a subset of a column to be of type text
while workers 2, 3, and 4 identify the rest of the subsets as being of type integer
. Because text
is a more general data type than integer
(one level higher in the data type hierarchy), the entire column should be casted as text
instead of integer
, even if there are more votes for the former. It might be important to mention that the current implementation does not handle this situation which might seem like an edge-case but it is likely very common.
The proposed implementation shall use torch.disitrbuted
to distribute the work across nodes. Because torch
is a heavy dependency, this capability shall be available only if the user install type_infer
by running
pip install type_infer[distributed]
All of the distributed modules should be encapsulated into a sub-module called distributed
to avoid breaking the already existing code-base.