Trill
Trill copied to clipboard
Columnar data format effiency: create extra columns for needed expressions
For a data structure with fields a, b, and c, if the downstream query operators never refer to field a directly but instead refer to a.d.z, or a["bacon"], or some other constant expression, it may make sense to have a column representing a.d.z or a["bacon"] instead of a. This change would require an alteration of the data type structure of the generated columnar batch, and it would change the way that generated operators over those columns reference fields.
There are multiple discussions around this topic, I think. I link the other places at https://github.com/dotnet/corefx/issues/26845 and https://github.com/dotnet/machinelearning/issues/69. It appears the handling industry is converging around Apache Arrow (https://arrow.apache.org/) as the columnar format and it landed an initial C# implementation just recently (https://github.com/apache/arrow/tree/master/csharp). It might make sense to coordinate a bit around this a bit to make a good case for .NET at large (as a side note, tangentially discussed heterogenous computing, Arrow, machine learning parameter tunings and other things at https://github.com/dotnet/orleans). :)
For the readers coming from other links, the Trill has a few other related issues: https://github.com/Microsoft/Trill/issues/7 https://github.com/Microsoft/Trill/issues/6
That's a fantastic idea. If there is already data arriving natively in Arrow format then making Trill operate directly and efficiently on it would be fantastic.