fst
fst copied to clipboard
any plans for python/julia interfaces
Are there any plans to make interfaces from other languages to your binary format. Having python or julia interface we can easily move data between different platforms. Something that feather was meant to do, but it is slow and crashing in R even for 500 MB data csv input.
Hi @jangorecki, thanks for asking, the answer is yes, absolutely!
The underlying lib that powers fst
is called fstlib
(and is available here). It compiles on all major platforms and recently I updated the Travis builds to include also Windows
.
To add a new client language, a wrapper for fstlib
has to be created and the fstlib
API needs to be implemented (mainly to create and delete memory and map the specific types to native types).
I don't have much experience using Julia
but creating a new wrapper is very important I think (any help in that department would be much appreciated :-)
By the way, @xiaodaigh created a wrapper for the fst
package in Julia
(see here). So that package is not a direct implementation of the fstlib
library but rather a wrapper around the R
package.
(the same holds true for Python
. Getting a package out is high on the priority list.)
hello there! as you may have noticed, I think there is a need for an efficient storage format that works with R and Python. Do you have timeline for the fst
python bindings?? Happy to do some testing if needed.
Thanks!!!
Hi @randomgambit, thanks for the heads-up! Yes, there seems to be a void between R
and Python
that could be filled nicely with fst
bindings for Python I think. My plan is to get a Python
package operational before the end of this year.
Your offer to help with testing is much appreciated!
before the end of this year.
I really wish you said before the end of the month instead!! :D
Ha @randomgambit, yes, the same here! If only I had more time, I'll make sure to talk to my 'day-time-job' director on your behalf :-)
Just curious if there is any update on the progress on the Python side? I use R, but a lot of people in my research field (astronomy) use Python. Feather seems the current best bet in this regard, but FST seems like it could be a decent step up given its subsetting and compression capabilities.
Hi @asgr,
thanks for your question. Yes, the Python bindings are long overdue and the fst format could be a faster and more dynamic bridge between R
and Python
than feather.
The python
bindings could basically follow the same strategy as the r
bindings: the fstlib
library generates 1D numpy arrays from the stored column data. And those arrays can be wrapped into a pandas data frame.
I'll try to get a repository up and running soon with an initial package version and we can work from there (user input much would be much appreciated :-)).
pydt would be useful too, fyi @st-pasha
that sounds like a great default return type for python's read_fst() 😺
@MarcusKlik Do you have any documentation for the fst file format?
Hi @st-pasha, thanks for asking, are you interested in a specification of the format meta-data, data-block design, etc or the C++ API documentation?
(both are not readily available at the moment, but just to know were to direct my efforts 😸)
format meta-data, data-block design for me as I am writing a Julia serializer
Ha @xiaodaigh, that's great to hear. I suspect that you won't need the exact details of the fstlib
implementation and fst format (but you will definitely need good API documentation)
(that is, unless you mean you like to write your own format, in that case format specs are of interest off course)
The fstlib
library has an abstract representation (C++) of a table and it's columns, and it will take some effort to write an implementation for that using the Julia
C/C++ API and internal data layout.
Please let me know if I can help you there, implementing a Julia
binding will be a very good test of the flexibility of fstlib
:-)
(see also this issue in fstlib)
Hi @MarcusKlik , sorry I should have given more context for my question.
So, I'm a primary developer of the Python datatable library. This library provides a data frame object and facilities to manipulate this data frame. So, I guess it's pretty close to fstlib
in functionality. We also have our own format for storing data on disk, called Jay.
Some time in the near future (maybe around winter) we were planning to add integrations for other on-disk data formats, foremost arrow (feather) and parquet. And, as @jangorecki points out, the fst
format is another good candidate to consider.
In other words, I'm not looking to using the fstlib
itself, just add the ability to read (possibly write) fst
files produced by some source (say, it can be written in R and then read in Python by datatable). This is, of course, conditional on whether you'd be ok with making your file format open for 3rd-party libraries to implement and use, especially if those other libraries are not GPL-based.
So, if this all sounds agreeable to you, I would be looking for a document describing how to interpret data stored in a .fst
file. Something similar to our Jay format description linked above.
Hi @st-pasha, no problem!
I'm very familiar with your work on (py)datatable (big fan ;-)) and just wanted to get clear how you would integrate fst
into the package.
In short, fstlib
is similar to parquet
and feather
libraries; it contains the code to wrap an existing data structure (e.g. a (py)datatable
) and serialize that to disk (or RAM in a future upgrade).
So it was explicitly not designed to manipulate in-memory datasets, like datatable
and pandas
(and arrow
).
What the fstlib
will be able to do is to run custom functions while loading data from disk. So during a load, each chunk can be processed on main- or background threads. This will enable fast calculations on on-disk data. But the actual methods used for these calculations will be provided by the user.
This is a difference with the goals of arrow
for example because arrow
does provide code for operations on it's internal dataframe structure. So it aims to be a universal dataframe manipulation framework, that can be used from different languages and systems.
With fstlib
calculations are done by the client, leveraging the strong points of specific languages and it's functions.
The fst format is tightly bound to the fstlib
library, as data-blocks and meta-data are compressed using optimized algorithms that are (only) available in fstlib
. For example, compression usually involves a bit-shifting filter to speed up results. This filter is part of the fstlib
library.
For datatable
to add the fst format to it's reading and writing capabilities, fstlib
will have to be compiled and integrated with datatable
. Then, a (zero copy) wrapper for a datatable
object can be created so that fstlib
can (de-)serialize data to disk.
Currently, fstlib
has a AGPL-3.0 license, and the LZ4
and ZSTD
compression libraries have their own licenses (BSD). So that cannot easily be re-licensed to datatable
's MPL-2.0 at the moment (I think). Options are to create a package fst
for Python that returns a datatable
, that would give a separation of licenses. Another option would be to create a special license for use by datatable
.
please let me know what you think, thanks!
Hi Marcus,
Based on your description it looks like the fst format is sufficiently complicated that it doesn't make sense to create an independent reader. In that case the simplest solution would be to have a separate fst
library wrapping the fstlib
.
Then in datatable we could have simple wrappers such as
class Frame:
def to_fst(self, path):
import fst
fst.write(self, path)
def fread(path):
if path.endswith(".fst"):
import fst
fst.read(path, output_format="datatable")
We also have a feature proposal (https://github.com/h2oai/datatable/issues/1950) for implementing xread()
which reads data + performs computations on that data at the same time. We will need to think how to integrate this with fst
properly.
For now, however, there are 2 main questions:
-
How the fst package can create a datatable Frame? We'll have to add an API function into
datatable
for that, which is not that hard. Ultimately, a Frame is just a list of named columns, and all we need is to understand what kind of a notion of a "column" fstlib exposes. Specifically, we'll need to know how fst encodes NAs, string data (including non-ascii), datetime objects, etc. -
How the fst package can read the existing datatable Frame? We already have API for accessing raw frame's data, but that works for "material" data only. Generally, datatable supports "virtual" (computed) columns too, and I wonder whether fst can be made to read those columns directly without materializing?
Hi @st-pasha,
thanks, that sounds excellent. On your 2 main questions:
-
The
(py)fst
library implements virtual C++ classes fromfstlib
. These (relatively simple) virtual classes include a table factory and column factory. The implemented(py)fst
C++ classes allow creation ofFrame
's and the correct columns. So the details of mapping specific column types from the fst format to Python are contained in the(py)fst
package. Conversion from the different representations of NA's will be handled in thefstlib
library however, as that is a cross-language problem (same with strings). At first, I think the creation of columns andTable
's can be done by calling Python code from the C++ side. The overhead should be minimal because there are only relatively few of these calls for each read. -
The
(py)fst
library also has to implement a virtual class to represent aFrame
. That C++ implementation includes member functions to access the underlying raw data and these can perfectly use the existingdatatable
API to access that.
There are currently a few virtual columns in the fst format, but only for boundary cases like a factor column with just a single factor (which can be represented by a few numbers). Columns like sequences from n to m will also be encoded in dense format later on. Virtual columns would be a tremendous enhancement to that and I would very much like to see how we can support that. The challenge is to provide a cross-language way of encoding common expressions and constants. Virtual columns that depend on other virtual columns should also be possible. Does datatable
have such an universal implementation?
Interestingly, on the R
side, virtual columns can be implemented using the new ALTREP
framework that was released with R 3.5
.
Your xread()
proposal is very interesting and something similar is planned for fst. The idea is that during reading, additional transforms can be added to the processing. Now fst does read -> decompress -> bit-shift for each (16 kB) data-block, but additional transformations can be added like row-selection, or (custom-) functions. The plan is to restrict these transforms to the main thread and do the reading and decompression on the background threads. That way, the user can call native R
or Python
methods and not get into trouble with memory management. This only works for methods that have a map-reduce like implementation (think sum()
, min()
, max()
). Other methods might only be applicable when full columns are read first (think median()
, my_custom_func()
). The same applies to reads where a by
argument is selected, except for sorted table's.
So, bottom line, the setup is very similar to the setup used for the fst
package in R
. We will need implementations of interfaces for factories, virtual table's and virtual columns in a (py)fst
package. That package has a dependency on datatable
as the implementations require constructors and the API of datatable
. Seen from datatable
the impact is very small, the methods like the ones you posted above are probably sufficient.
Thanks!
Hi Marcus,
I presume you have much more experience with developing R libraries than Python extensions, so let me point out few peculiarities of Python that could be relevant to the design process.
-
Python has only one native list type: a list of objects (or more precisely, a
PyList
ofPyObject
s). Unlike in R, there are no native types for "list of ints", or "list of strings", etc. The closest alternative that we have is a numpy array, or a pandas series, or a datatable frame, or arrow dataframe -- all of them keep their data in C structures, exposing to python only a "frontend" object that marshals all methods to the backend implementation. -
For the same reason, calling native python functions for data transformations would not work: you'd need numpy methods, or datatable methods, or arrow methods, etc.
-
Since we want to create a native C object from another C library, this calls for a C API between the libraries. Luckily, Python supports this use case via so-called capsule objects. This is neat, because you'd only need a single
.h
file when compiling your library, and they'll be linked dynamically at runtime. In fact,fst
wouldn't even need to listdatatable
as a dependency: it can attempt to import the functionality at runtime, and then fail with a graceful error if the user doesn't have the module installed.
Virtual columns that depend on other virtual columns should also be possible. Does datatable have such an universal implementation?
Virtual columns are a new functionality in datatable, their implementation is largely complete, though there's still some refactoring to do to make sure the existing code uses the new functionality to full extent. And yes, in our design a "virtual column" is an object that knows how to calculate its i-th element. For example, a binary_plus
column could look like this:
template <T> class binary_plus : public ColumnImpl {
Column lhs, rhs;
bool get_element(size_t i, T* out) { // returns `isna` flag
T x, y;
bool lhs_isna = lhs.get_element(i, &x);
if (lhs_isna) return true;
bool rhs_isna = rhs.get_element(i, &y);
if (rhs_isna) return true;
*out = x + y;
return false;
}
};
Hi @st-pasha,
thanks for the pointers! And yes, your assumption is very correct :-)
About your second point, could we:
- Let the
fst.read()
method materialize Frame's from subsets of the stored data. - Do transformations on those subsets
- Combine the transformed Frame's into a single larger frame
That way, wouldn't it be possible to use datatable
operations (from the python API) to do the transformations in step 2?
Or, perhaps simpler, when column A is being transformed, column B can be read into memory on background threads. When that's finished, column B can be transformed while column C is being read, etc.
Obviously, we would have datatable
and fst
competing for thread resources so need some way of tuning that...
thanks!
In my team some uses R and others use python, so we had to use hdf5 because fst is only for R. But I like fst better.
Hi @ssh352,
thanks, I'm happy to hear that fst
works for you and your team!
Yes, the python
bindings are an important next step for the fst format, and the goal is to roll out a package in 2020.
Hope you and your team can wait for that :-)
Just wanted to express that being able to read in Python would be extremely useful :)
This issue has been quiet for a while. Has any progress been made with Python access to fst? (I'm very excited for this feature!)
Hi @richierocks, thanks for checking in on the progress. Unfortunately, I haven't had much time to work on a python
package (as you have noticed :-)).
I do think the python
bindings are very important, it's just that time is a real bottleneck here. The package will probably have to wait until later in 2021, apologies for that!
This SO can be improved when fst in python will be ready https://stackoverflow.com/a/64880745/2490497
Hi @jangorecki, thanks for the heads up, when fst
has a python
interface, I will make sure to add the timings to your SO answer!
Has there been any progress on the project to create a python interface for fstlib
?
Hey, would give you a heads up we in the genomics community are starting to experiment with this format, it is a very powerful substitute for older formats like bam files. I do believe if you do not have time to fix this yourself that funding could be acquired through grants etc, also supervised master students could do project courses to implement simpler/smaller parts. Many possibilities here, let me know if this could be of interest.