Goal

As a developer, I want to add features to the existing machine learning model (XGBoost), so that I can develop a more accurate machine learning model.

Consider

Consider using Cryptocurrency volume data as features, already integrated in Kryptos (see volume() method), as features
Consider adding external data sources, already integrated in Kryptos (see task #8), as features:

Google Search Volume (see trends.py)
Blockchain Info (see bchain_activity.py)

Consider using blue-yonder's tsfresh for automatic extraction of more than 184+ time series features, such as:

abs_energy(x) | Returns the absolute energy of the time series which is the sum over the squared values
absolute_sum_of_changes(x) | Returns the sum over the absolute value of consecutive changes in the series x
agg_autocorrelation(x, param) | Calculates the value of an aggregation function f_agg (e.g.
agg_linear_trend(x, param) | Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one.
approximate_entropy(x, m, r) | Implements a vectorized Approximate entropy algorithm.
ar_coefficient(x, param) | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
augmented_dickey_fuller(x, param) | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
autocorrelation(x, lag) | Calculates the autocorrelation of the specified lag, according to the formula [1]
binned_entropy(x, max_bins) | First bins the values of x into max_bins equidistant bins.
c3(x, lag) | This function calculates the value of
change_quantiles(x, ql, qh, isabs, f_agg) | First fixes a corridor given by the quantiles ql and qh of the distribution of x.
count_above_mean(x) | Returns the number of values in x that are higher than the mean of x
count_below_mean(x) | Returns the number of values in x that are lower than the mean of x
cwt_coefficients(x, param) | Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is
energy_ratio_by_chunks(x, param) | Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole
fft_coefficient(x, param) | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast
first_location_of_maximum(x) | Returns the first location of the maximum value of x.
first_location_of_minimum(x) | Returns the first location of the minimal value of x.
friedrich_coefficients(x, param) | Coefficients of polynomial , which has been fitted to
has_duplicate(x) | Checks if any value in x occurs more than once
has_duplicate_max(x) | Checks if the maximum value of x is observed more than once
has_duplicate_min(x) | Checks if the minimal value of x is observed more than once
index_mass_quantile(x, param) | Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i.
kurtosis(x) | Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
large_standard_deviation(x, r) | Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x.
last_location_of_maximum(x) | Returns the relative last location of the maximum value of x.
last_location_of_minimum(x) | Returns the last location of the minimal value of x.
length(x) | Returns the length of x
linear_trend(x, param) | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
longest_strike_above_mean(x) | Returns the length of the longest consecutive subsequence in x that is bigger than the mean of x
longest_strike_below_mean(x) | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
max_langevin_fixed_point(x, r, m) | Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial ,
maximum(x) | Calculates the highest value of the time series x.
mean(x) | Returns the mean of x
mean_abs_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
mean_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
mean_second_derivate_central |
median(x) | Returns the median of x
minimum(x) | Calculates the lowest value of the time series x.
number_crossing_m(x, m) | Calculates the number of crossings of x on m.
number_cwt_peaks(x, n) | This feature calculator searches for different peaks in x.
number_peaks(x, n) | Calculates the number of peaks of at least support n in the time series x.
partial_autocorrelation(x, param) | Calculates the value of the partial autocorrelation function at the given lag.
percentage_of_reoccurring_datapoints_to_all_datapoints(x) | Returns the percentage of unique values, that are present in the time series more than once.
percentage_of_reoccurring_values_to_all_values(x) | Returns the ratio of unique values, that are present in the time series more than once.
quantile(x, q) | Calculates the q quantile of x.
range_count(x, min, max) | Count observed values within the interval [min, max).
ratio_beyond_r_sigma(x, r) | Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x.
ratio_value_number_to_time_series_length(x) | Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case.
sample_entropy(x) | Calculate and return sample entropy of x.
set_property(key, value) | This method returns a decorator that sets the property key of the function to value
skewness(x) | Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
spkt_welch_density(x, param) | This feature calculator estimates the cross power spectral density of the time series x at different frequencies.
standard_deviation(x) | Returns the standard deviation of x
sum_of_reoccurring_data_points(x) | Returns the sum of all data points, that are present in the time series more than once.
sum_of_reoccurring_values(x) | Returns the sum of all values, that are present in the time series more than once.
sum_values(x) | Calculates the sum over the time series values
symmetry_looking(x, param) | Boolean variable denoting if the distribution of x looks symmetric.
time_reversal_asymmetry_statistic(x, lag) | This function calculates the value of
value_count(x, value) | Count occurrences of value in time series x.
variance(x) | Returns the variance of x
variance_larger_than_standard_deviation(x) | Boolean variable denoting if the variance of x is greater than its standard deviation.abs_energy(x) | Returns the absolute energy of the time series which is the sum over the squared values
absolute_sum_of_changes(x) | Returns the sum over the absolute value of consecutive changes in the series x
agg_autocorrelation(x, param) | Calculates the value of an aggregation function f_agg (e.g.
agg_linear_trend(x, param) | Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one.
approximate_entropy(x, m, r) | Implements a vectorized Approximate entropy algorithm.
ar_coefficient(x, param) | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
augmented_dickey_fuller(x, param) | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
autocorrelation(x, lag) | Calculates the autocorrelation of the specified lag, according to the formula [1]
binned_entropy(x, max_bins) | First bins the values of x into max_bins equidistant bins.
c3(x, lag) | This function calculates the value of
change_quantiles(x, ql, qh, isabs, f_agg) | First fixes a corridor given by the quantiles ql and qh of the distribution of x.
count_above_mean(x) | Returns the number of values in x that are higher than the mean of x
count_below_mean(x) | Returns the number of values in x that are lower than the mean of x
cwt_coefficients(x, param) | Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is
energy_ratio_by_chunks(x, param) | Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole
fft_coefficient(x, param) | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast
first_location_of_maximum(x) | Returns the first location of the maximum value of x.
first_location_of_minimum(x) | Returns the first location of the minimal value of x.
friedrich_coefficients(x, param) | Coefficients of polynomial , which has been fitted to
has_duplicate(x) | Checks if any value in x occurs more than once
has_duplicate_max(x) | Checks if the maximum value of x is observed more than once
has_duplicate_min(x) | Checks if the minimal value of x is observed more than once
index_mass_quantile(x, param) | Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i.
kurtosis(x) | Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
large_standard_deviation(x, r) | Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x.
last_location_of_maximum(x) | Returns the relative last location of the maximum value of x.
last_location_of_minimum(x) | Returns the last location of the minimal value of x.
length(x) | Returns the length of x
linear_trend(x, param) | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
longest_strike_above_mean(x) | Returns the length of the longest consecutive subsequence in x that is bigger than the mean of x
longest_strike_below_mean(x) | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
max_langevin_fixed_point(x, r, m) | Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial ,
maximum(x) | Calculates the highest value of the time series x.
mean(x) | Returns the mean of x
mean_abs_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
mean_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
mean_second_derivate_central |
median(x) | Returns the median of x
minimum(x) | Calculates the lowest value of the time series x.
number_crossing_m(x, m) | Calculates the number of crossings of x on m.
number_cwt_peaks(x, n) | This feature calculator searches for different peaks in x.
number_peaks(x, n) | Calculates the number of peaks of at least support n in the time series x.
partial_autocorrelation(x, param) | Calculates the value of the partial autocorrelation function at the given lag.
percentage_of_reoccurring_datapoints_to_all_datapoints(x) | Returns the percentage of unique values, that are present in the time series more than once.
percentage_of_reoccurring_values_to_all_values(x) | Returns the ratio of unique values, that are present in the time series more than once.
quantile(x, q) | Calculates the q quantile of x.
range_count(x, min, max) | Count observed values within the interval [min, max).
ratio_beyond_r_sigma(x, r) | Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x.
ratio_value_number_to_time_series_length(x) | Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case.
sample_entropy(x) | Calculate and return sample entropy of x.
set_property(key, value) | This method returns a decorator that sets the property key of the function to value
skewness(x) | Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
spkt_welch_density(x, param) | This feature calculator estimates the cross power spectral density of the time series x at different frequencies.
standard_deviation(x) | Returns the standard deviation of x
sum_of_reoccurring_data_points(x) | Returns the sum of all data points, that are present in the time series more than once.
sum_of_reoccurring_values(x) | Returns the sum of all values, that are present in the time series more than once.
sum_values(x) | Calculates the sum over the time series values
symmetry_looking(x, param) | Boolean variable denoting if the distribution of x looks symmetric.
time_reversal_asymmetry_statistic(x, lag) | This function calculates the value of
value_count(x, value) | Count occurrences of value in time series x.
variance(x) | Returns the variance of x
variance_larger_than_standard_deviation(x) | Boolean variable denoting if the variance of x is greater than its standard deviation.abs_energy(x) | Returns the absolute energy of the time series which is the sum over the squared values
absolute_sum_of_changes(x) | Returns the sum over the absolute value of consecutive changes in the series x
agg_autocorrelation(x, param) | Calculates the value of an aggregation function f_agg (e.g.
agg_linear_trend(x, param) | Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one.
approximate_entropy(x, m, r) | Implements a vectorized Approximate entropy algorithm.
ar_coefficient(x, param) | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
augmented_dickey_fuller(x, param) | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
autocorrelation(x, lag) | Calculates the autocorrelation of the specified lag, according to the formula [1]
binned_entropy(x, max_bins) | First bins the values of x into max_bins equidistant bins.
c3(x, lag) | This function calculates the value of
change_quantiles(x, ql, qh, isabs, f_agg) | First fixes a corridor given by the quantiles ql and qh of the distribution of x.
count_above_mean(x) | Returns the number of values in x that are higher than the mean of x
count_below_mean(x) | Returns the number of values in x that are lower than the mean of x
cwt_coefficients(x, param) | Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is
energy_ratio_by_chunks(x, param) | Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole
fft_coefficient(x, param) | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast
first_location_of_maximum(x) | Returns the first location of the maximum value of x.
first_location_of_minimum(x) | Returns the first location of the minimal value of x.
friedrich_coefficients(x, param) | Coefficients of polynomial , which has been fitted to
has_duplicate(x) | Checks if any value in x occurs more than once
has_duplicate_max(x) | Checks if the maximum value of x is observed more than once
has_duplicate_min(x) | Checks if the minimal value of x is observed more than once
index_mass_quantile(x, param) | Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i.
kurtosis(x) | Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
large_standard_deviation(x, r) | Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x.
last_location_of_maximum(x) | Returns the relative last location of the maximum value of x.
last_location_of_minimum(x) | Returns the last location of the minimal value of x.
length(x) | Returns the length of x
linear_trend(x, param) | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
longest_strike_above_mean(x) | Returns the length of the longest consecutive subsequence in x that is bigger than the mean of x
longest_strike_below_mean(x) | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
max_langevin_fixed_point(x, r, m) | Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial ,
maximum(x) | Calculates the highest value of the time series x.
mean(x) | Returns the mean of x
mean_abs_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
mean_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
mean_second_derivate_central |
median(x) | Returns the median of x
minimum(x) | Calculates the lowest value of the time series x.
number_crossing_m(x, m) | Calculates the number of crossings of x on m.
number_cwt_peaks(x, n) | This feature calculator searches for different peaks in x.
number_peaks(x, n) | Calculates the number of peaks of at least support n in the time series x.
partial_autocorrelation(x, param) | Calculates the value of the partial autocorrelation function at the given lag.
percentage_of_reoccurring_datapoints_to_all_datapoints(x) | Returns the percentage of unique values, that are present in the time series more than once.
percentage_of_reoccurring_values_to_all_values(x) | Returns the ratio of unique values, that are present in the time series more than once.
quantile(x, q) | Calculates the q quantile of x.
range_count(x, min, max) | Count observed values within the interval [min, max).
ratio_beyond_r_sigma(x, r) | Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x.
ratio_value_number_to_time_series_length(x) | Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case.
sample_entropy(x) | Calculate and return sample entropy of x.
set_property(key, value) | This method returns a decorator that sets the property key of the function to value
skewness(x) | Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
spkt_welch_density(x, param) | This feature calculator estimates the cross power spectral density of the time series x at different frequencies.
standard_deviation(x) | Returns the standard deviation of x
sum_of_reoccurring_data_points(x) | Returns the sum of all data points, that are present in the time series more than once.
sum_of_reoccurring_values(x) | Returns the sum of all values, that are present in the time series more than once.
sum_values(x) | Calculates the sum over the time series values
symmetry_looking(x, param) | Boolean variable denoting if the distribution of x looks symmetric.
time_reversal_asymmetry_statistic(x, lag) | This function calculates the value of
value_count(x, value) | Count occurrences of value in time series x.
variance(x) | Returns the variance of x
variance_larger_than_standard_deviation(x) | Boolean variable denoting if the variance of x is greater than its standard deviation.

Inspiration

What generally improves a model's score more on average, feature engineering or hyperparameter tuning? Feature engineering, without a doubt.

Jun 09 '18 21:06 slavakurilyak

I have included tsfresh in the platform

https://github.com/produvia/cryptocurrency-trading-platform/commit/91895ad5c9e2eae3b55c04e954851a17c6da4ecd

Jun 21 '18 16:06 bukosabino

We could add technical analysis features too (ta-lib). Sounds good to you?

Jun 21 '18 16:06 bukosabino

Yes! Let's use technical analysis (ta-lib) as features (see #64) for machine learning.

Jun 22 '18 01:06 slavakurilyak

I'm working on adding technical analysis features.

This article could be useful for us in order to add more features. Let me know if you agree to will work on this.

Jun 22 '18 19:06 bukosabino

I'm working on adding technical analysis features.

I am looking forward to it

This article could be useful for us in order to add more features.

Thanks for sharing this practical article on the enigma data marketplace

Let me know if you agree to will work on this.

Let's implement Kryptos existing datasets as features. We already support Google Trends and Quadl data sources (see #8).

Jun 23 '18 07:06 slavakurilyak

Let's add non-pricing datasets as features. We can use cryptocurrency volume data, Blockchain Info, and Google Search Volume.

Jul 01 '18 19:07 slavakurilyak

I have added some external data sources (Google Search Volume and Blockchain Info) as features for Machine Learning models.

However, I don't completely understand you with cryptocurrency volume data:

We are already using the volume as a feature: https://github.com/produvia/cryptocurrency-trading-platform/blob/49951f284edbc13c77689d5a69ab67a30b59353e/kryptos/platform/strategy/strategy.py#L227

Edit: At this moment Google Search Volume is fine, but Quant dataset is unstable. So, we can use:

$ strat -d google -c "bitcoin" -c "btc" -ml xgboost
or
$ strat -ml xgboost -d google -c "bitcoin" -c "btc"

Jul 04 '18 17:07 bukosabino

I have added some external data sources (Google Search Volume and Blockchain Info) as features for Machine Learning models.

Excellent! Since we now have multiple machine learning models, let's compare the differences between them in terms of accuracy.

We are already using the volume as a feature.

Perfect!

At this moment Google Search Volume is fine, but Quant dataset is unstable.

Can you clarify what you mean by Quandl dataset being unstable?

Jul 04 '18 19:07 slavakurilyak

There were some bugs merging Quandl dataset on the system. Now it is fine. Some examples:

strat -ml xgboost -d google -c "bitcoin" -c "btc" 
strat -ml xgboost -d quandl -c 'MKTCP' -c 'NTRAN'

Jul 05 '18 09:07 bukosabino

Excellent work! Now we can combine all of our existing datasets, including:

google dataset (see manager.py#L243) (use google search terms: "btc usd" associated with the btc/usd cryptoasset),
quandl datasets (see manager.py#L398) (there are currently 32 datasets)
pricing & volume datasets (see manager.py#L57).

Jul 06 '18 20:07 slavakurilyak

kryptos
kryptos copied to clipboard

Feature Engineering

Goal

Consider

Inspiration

kryptos kryptos copied to clipboard

Feature Engineering

Goal

Consider

Inspiration

kryptos
kryptos copied to clipboard