type_infer icon indicating copy to clipboard operation
type_infer copied to clipboard

Distributed type inference

Open pedrofluxa opened this issue 10 months ago • 0 comments

The current implementation of type_infer is not suitable to be used in distributed compute environments (i.e. non-scalable); currently, type_infer can only be executed in a single node and needs to load data into memory. This makes type_infer unsuitable to analyze large datasets that do not fit in RAM.

The internal workings of type_infer allow for a relatively straightforward implementation to allow execution in distributed compute environments; one could use sub-sets of columns (each subset loaded into a different worker) to infer the data types, and then apply something like a voting mechanism to choose a type over the other.

The voting mechanism shall be aware of data type hierarchy. For example, consider the the case of having 4 workers: worker 1 identifies a subset of a column to be of type text while workers 2, 3, and 4 identify the rest of the subsets as being of type integer. Because text is a more general data type than integer (one level higher in the data type hierarchy), the entire column should be casted as text instead of integer, even if there are more votes for the former. It might be important to mention that the current implementation does not handle this situation which might seem like an edge-case but it is likely very common.

The proposed implementation shall use torch.disitrbuted to distribute the work across nodes. Because torch is a heavy dependency, this capability shall be available only if the user install type_infer by running

pip install type_infer[distributed]

All of the distributed modules should be encapsulated into a sub-module called distributed to avoid breaking the already existing code-base.

pedrofluxa avatar Oct 18 '23 09:10 pedrofluxa