traceml
                                
                                
                                
                                    traceml copied to clipboard
                            
                            
                            
                        Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
Datatile
    
  
A library for managing, summarizing, and visualizing data.
N.B.1:
pandas-summarywas renamed to datatile, a more ambitious project with sevral planned features and enhancements to add support for visualizations, quality checks, linking summaries to versions, and integrations with third party libraries.
Installation
The module can be easily installed with pip:
> pip install datatile
This module depends on numpy and pandas. Optionally you can get also some nice visualisations if you have matplotlib installed.
Tests
To run the tests, execute the command python setup.py test
Usage
DataFrameSummary
An extension to pandas dataframes describe function.
The module contains DataFrameSummary object that extend describe() with:
- properties
- dfs.columns_stats: counts, uniques, missing, missing_perc, and type per column
 - dsf.columns_types: a count of the types of columns
 - dfs[column]: more in depth summary of the column
 
 - function
- summary(): extends the 
describe()function with the values withcolumns_stats 
 - summary(): extends the 
 
The DataFrameSummary expect a pandas DataFrame to summarise.
from datatile.summary.df import DataFrameSummary
dfs = DataFrameSummary(df)
getting the columns types
dfs.columns_types
numeric     9
bool        3
categorical 2
unique      1
date        1
constant    1
dtype: int64
getting the columns stats
dfs.columns_stats
                      A            B        C              D              E
counts             5802         5794     5781           5781           4617
uniques            5802            3     5771            128            121
missing               0            8       21             21           1185
missing_perc         0%        0.14%    0.36%          0.36%         20.42%
types            unique  categorical  numeric        numeric        numeric
getting a single column summary, e.g. numerical column
# we can also access the column using numbers A[1]
dfs['A']
std                                                                 0.2827146
max                                                                  1.072792
min                                                                         0
variance                                                           0.07992753
mean                                                                0.5548516
5%                                                                  0.1603367
25%                                                                 0.3199776
50%                                                                 0.4968588
75%                                                                 0.8274732
95%                                                                  1.011255
iqr                                                                 0.5074956
kurtosis                                                            -1.208469
skewness                                                            0.2679559
sum                                                                  3207.597
mad                                                                 0.2459508
cv                                                                  0.5095319
zeros_num                                                                  11
zeros_perc                                                               0,1%
deviating_of_mean                                                          21
deviating_of_mean_perc                                                  0.36%
deviating_of_median                                                        21
deviating_of_median_perc                                                0.36%
top_correlations                         {u'D': 0.702240243124, u'E': -0.663}
counts                                                                   5781
uniques                                                                  5771
missing                                                                    21
missing_perc                                                            0.36%
types                                                                 numeric
Name: A, dtype: object
Future development
Summaries
- [ ] Add summary analysis between columns, i.e. 
dfs[[1, 2]] 
Visualizations
- [ ] Add summary visualization with matplotlib.
 - [ ] Add summary visualization with plotly.
 - [ ] Add summary visualization with altair.
 - [ ] Add predefined profiling.
 
Catalog and Versions
- [ ] Add possibility to persist summary and link to a specific version.
 - [ ] Integrate with quality libraries.