kryptos
kryptos copied to clipboard
Feature Engineering
Goal
As a developer, I want to add features to the existing machine learning model (XGBoost), so that I can develop a more accurate machine learning model.
Consider
-
Consider using Cryptocurrency volume data as features, already integrated in Kryptos (see volume() method), as features
-
Consider adding external data sources, already integrated in Kryptos (see task #8), as features:
- Google Search Volume (see trends.py)
- Blockchain Info (see bchain_activity.py)
- Consider using blue-yonder's tsfresh for automatic extraction of more than 184+ time series features, such as:
- abs_energy(x) | Returns the absolute energy of the time series which is the sum over the squared values
- absolute_sum_of_changes(x) | Returns the sum over the absolute value of consecutive changes in the series x
- agg_autocorrelation(x, param) | Calculates the value of an aggregation function f_agg (e.g.
- agg_linear_trend(x, param) | Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one.
- approximate_entropy(x, m, r) | Implements a vectorized Approximate entropy algorithm.
- ar_coefficient(x, param) | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
- augmented_dickey_fuller(x, param) | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
- autocorrelation(x, lag) | Calculates the autocorrelation of the specified lag, according to the formula [1]
- binned_entropy(x, max_bins) | First bins the values of x into max_bins equidistant bins.
- c3(x, lag) | This function calculates the value of
- change_quantiles(x, ql, qh, isabs, f_agg) | First fixes a corridor given by the quantiles ql and qh of the distribution of x.
- count_above_mean(x) | Returns the number of values in x that are higher than the mean of x
- count_below_mean(x) | Returns the number of values in x that are lower than the mean of x
- cwt_coefficients(x, param) | Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is
- energy_ratio_by_chunks(x, param) | Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole
- fft_coefficient(x, param) | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast
- first_location_of_maximum(x) | Returns the first location of the maximum value of x.
- first_location_of_minimum(x) | Returns the first location of the minimal value of x.
- friedrich_coefficients(x, param) | Coefficients of polynomial , which has been fitted to
- has_duplicate(x) | Checks if any value in x occurs more than once
- has_duplicate_max(x) | Checks if the maximum value of x is observed more than once
- has_duplicate_min(x) | Checks if the minimal value of x is observed more than once
- index_mass_quantile(x, param) | Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i.
- kurtosis(x) | Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
- large_standard_deviation(x, r) | Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x.
- last_location_of_maximum(x) | Returns the relative last location of the maximum value of x.
- last_location_of_minimum(x) | Returns the last location of the minimal value of x.
- length(x) | Returns the length of x
- linear_trend(x, param) | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
- longest_strike_above_mean(x) | Returns the length of the longest consecutive subsequence in x that is bigger than the mean of x
- longest_strike_below_mean(x) | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
- max_langevin_fixed_point(x, r, m) | Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial ,
- maximum(x) | Calculates the highest value of the time series x.
- mean(x) | Returns the mean of x
- mean_abs_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
- mean_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
- mean_second_derivate_central |
- median(x) | Returns the median of x
- minimum(x) | Calculates the lowest value of the time series x.
- number_crossing_m(x, m) | Calculates the number of crossings of x on m.
- number_cwt_peaks(x, n) | This feature calculator searches for different peaks in x.
- number_peaks(x, n) | Calculates the number of peaks of at least support n in the time series x.
- partial_autocorrelation(x, param) | Calculates the value of the partial autocorrelation function at the given lag.
- percentage_of_reoccurring_datapoints_to_all_datapoints(x) | Returns the percentage of unique values, that are present in the time series more than once.
- percentage_of_reoccurring_values_to_all_values(x) | Returns the ratio of unique values, that are present in the time series more than once.
- quantile(x, q) | Calculates the q quantile of x.
- range_count(x, min, max) | Count observed values within the interval [min, max).
- ratio_beyond_r_sigma(x, r) | Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x.
- ratio_value_number_to_time_series_length(x) | Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case.
- sample_entropy(x) | Calculate and return sample entropy of x.
- set_property(key, value) | This method returns a decorator that sets the property key of the function to value
- skewness(x) | Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
- spkt_welch_density(x, param) | This feature calculator estimates the cross power spectral density of the time series x at different frequencies.
- standard_deviation(x) | Returns the standard deviation of x
- sum_of_reoccurring_data_points(x) | Returns the sum of all data points, that are present in the time series more than once.
- sum_of_reoccurring_values(x) | Returns the sum of all values, that are present in the time series more than once.
- sum_values(x) | Calculates the sum over the time series values
- symmetry_looking(x, param) | Boolean variable denoting if the distribution of x looks symmetric.
- time_reversal_asymmetry_statistic(x, lag) | This function calculates the value of
- value_count(x, value) | Count occurrences of value in time series x.
- variance(x) | Returns the variance of x
- variance_larger_than_standard_deviation(x) | Boolean variable denoting if the variance of x is greater than its standard deviation.abs_energy(x) | Returns the absolute energy of the time series which is the sum over the squared values
- absolute_sum_of_changes(x) | Returns the sum over the absolute value of consecutive changes in the series x
- agg_autocorrelation(x, param) | Calculates the value of an aggregation function f_agg (e.g.
- agg_linear_trend(x, param) | Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one.
- approximate_entropy(x, m, r) | Implements a vectorized Approximate entropy algorithm.
- ar_coefficient(x, param) | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
- augmented_dickey_fuller(x, param) | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
- autocorrelation(x, lag) | Calculates the autocorrelation of the specified lag, according to the formula [1]
- binned_entropy(x, max_bins) | First bins the values of x into max_bins equidistant bins.
- c3(x, lag) | This function calculates the value of
- change_quantiles(x, ql, qh, isabs, f_agg) | First fixes a corridor given by the quantiles ql and qh of the distribution of x.
- count_above_mean(x) | Returns the number of values in x that are higher than the mean of x
- count_below_mean(x) | Returns the number of values in x that are lower than the mean of x
- cwt_coefficients(x, param) | Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is
- energy_ratio_by_chunks(x, param) | Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole
- fft_coefficient(x, param) | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast
- first_location_of_maximum(x) | Returns the first location of the maximum value of x.
- first_location_of_minimum(x) | Returns the first location of the minimal value of x.
- friedrich_coefficients(x, param) | Coefficients of polynomial , which has been fitted to
- has_duplicate(x) | Checks if any value in x occurs more than once
- has_duplicate_max(x) | Checks if the maximum value of x is observed more than once
- has_duplicate_min(x) | Checks if the minimal value of x is observed more than once
- index_mass_quantile(x, param) | Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i.
- kurtosis(x) | Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
- large_standard_deviation(x, r) | Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x.
- last_location_of_maximum(x) | Returns the relative last location of the maximum value of x.
- last_location_of_minimum(x) | Returns the last location of the minimal value of x.
- length(x) | Returns the length of x
- linear_trend(x, param) | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
- longest_strike_above_mean(x) | Returns the length of the longest consecutive subsequence in x that is bigger than the mean of x
- longest_strike_below_mean(x) | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
- max_langevin_fixed_point(x, r, m) | Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial ,
- maximum(x) | Calculates the highest value of the time series x.
- mean(x) | Returns the mean of x
- mean_abs_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
- mean_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
- mean_second_derivate_central |
- median(x) | Returns the median of x
- minimum(x) | Calculates the lowest value of the time series x.
- number_crossing_m(x, m) | Calculates the number of crossings of x on m.
- number_cwt_peaks(x, n) | This feature calculator searches for different peaks in x.
- number_peaks(x, n) | Calculates the number of peaks of at least support n in the time series x.
- partial_autocorrelation(x, param) | Calculates the value of the partial autocorrelation function at the given lag.
- percentage_of_reoccurring_datapoints_to_all_datapoints(x) | Returns the percentage of unique values, that are present in the time series more than once.
- percentage_of_reoccurring_values_to_all_values(x) | Returns the ratio of unique values, that are present in the time series more than once.
- quantile(x, q) | Calculates the q quantile of x.
- range_count(x, min, max) | Count observed values within the interval [min, max).
- ratio_beyond_r_sigma(x, r) | Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x.
- ratio_value_number_to_time_series_length(x) | Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case.
- sample_entropy(x) | Calculate and return sample entropy of x.
- set_property(key, value) | This method returns a decorator that sets the property key of the function to value
- skewness(x) | Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
- spkt_welch_density(x, param) | This feature calculator estimates the cross power spectral density of the time series x at different frequencies.
- standard_deviation(x) | Returns the standard deviation of x
- sum_of_reoccurring_data_points(x) | Returns the sum of all data points, that are present in the time series more than once.
- sum_of_reoccurring_values(x) | Returns the sum of all values, that are present in the time series more than once.
- sum_values(x) | Calculates the sum over the time series values
- symmetry_looking(x, param) | Boolean variable denoting if the distribution of x looks symmetric.
- time_reversal_asymmetry_statistic(x, lag) | This function calculates the value of
- value_count(x, value) | Count occurrences of value in time series x.
- variance(x) | Returns the variance of x
- variance_larger_than_standard_deviation(x) | Boolean variable denoting if the variance of x is greater than its standard deviation.abs_energy(x) | Returns the absolute energy of the time series which is the sum over the squared values
- absolute_sum_of_changes(x) | Returns the sum over the absolute value of consecutive changes in the series x
- agg_autocorrelation(x, param) | Calculates the value of an aggregation function f_agg (e.g.
- agg_linear_trend(x, param) | Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one.
- approximate_entropy(x, m, r) | Implements a vectorized Approximate entropy algorithm.
- ar_coefficient(x, param) | This feature calculator fits the unconditional maximum likelihood of an autoregressive AR(k) process.
- augmented_dickey_fuller(x, param) | The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample.
- autocorrelation(x, lag) | Calculates the autocorrelation of the specified lag, according to the formula [1]
- binned_entropy(x, max_bins) | First bins the values of x into max_bins equidistant bins.
- c3(x, lag) | This function calculates the value of
- change_quantiles(x, ql, qh, isabs, f_agg) | First fixes a corridor given by the quantiles ql and qh of the distribution of x.
- count_above_mean(x) | Returns the number of values in x that are higher than the mean of x
- count_below_mean(x) | Returns the number of values in x that are lower than the mean of x
- cwt_coefficients(x, param) | Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is
- energy_ratio_by_chunks(x, param) | Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole
- fft_coefficient(x, param) | Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast
- first_location_of_maximum(x) | Returns the first location of the maximum value of x.
- first_location_of_minimum(x) | Returns the first location of the minimal value of x.
- friedrich_coefficients(x, param) | Coefficients of polynomial , which has been fitted to
- has_duplicate(x) | Checks if any value in x occurs more than once
- has_duplicate_max(x) | Checks if the maximum value of x is observed more than once
- has_duplicate_min(x) | Checks if the minimal value of x is observed more than once
- index_mass_quantile(x, param) | Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i.
- kurtosis(x) | Returns the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).
- large_standard_deviation(x, r) | Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x.
- last_location_of_maximum(x) | Returns the relative last location of the maximum value of x.
- last_location_of_minimum(x) | Returns the last location of the minimal value of x.
- length(x) | Returns the length of x
- linear_trend(x, param) | Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one.
- longest_strike_above_mean(x) | Returns the length of the longest consecutive subsequence in x that is bigger than the mean of x
- longest_strike_below_mean(x) | Returns the length of the longest consecutive subsequence in x that is smaller than the mean of x
- max_langevin_fixed_point(x, r, m) | Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial ,
- maximum(x) | Calculates the highest value of the time series x.
- mean(x) | Returns the mean of x
- mean_abs_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
- mean_change(x) | Returns the mean over the absolute differences between subsequent time series values which is
- mean_second_derivate_central |
- median(x) | Returns the median of x
- minimum(x) | Calculates the lowest value of the time series x.
- number_crossing_m(x, m) | Calculates the number of crossings of x on m.
- number_cwt_peaks(x, n) | This feature calculator searches for different peaks in x.
- number_peaks(x, n) | Calculates the number of peaks of at least support n in the time series x.
- partial_autocorrelation(x, param) | Calculates the value of the partial autocorrelation function at the given lag.
- percentage_of_reoccurring_datapoints_to_all_datapoints(x) | Returns the percentage of unique values, that are present in the time series more than once.
- percentage_of_reoccurring_values_to_all_values(x) | Returns the ratio of unique values, that are present in the time series more than once.
- quantile(x, q) | Calculates the q quantile of x.
- range_count(x, min, max) | Count observed values within the interval [min, max).
- ratio_beyond_r_sigma(x, r) | Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x.
- ratio_value_number_to_time_series_length(x) | Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case.
- sample_entropy(x) | Calculate and return sample entropy of x.
- set_property(key, value) | This method returns a decorator that sets the property key of the function to value
- skewness(x) | Returns the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).
- spkt_welch_density(x, param) | This feature calculator estimates the cross power spectral density of the time series x at different frequencies.
- standard_deviation(x) | Returns the standard deviation of x
- sum_of_reoccurring_data_points(x) | Returns the sum of all data points, that are present in the time series more than once.
- sum_of_reoccurring_values(x) | Returns the sum of all values, that are present in the time series more than once.
- sum_values(x) | Calculates the sum over the time series values
- symmetry_looking(x, param) | Boolean variable denoting if the distribution of x looks symmetric.
- time_reversal_asymmetry_statistic(x, lag) | This function calculates the value of
- value_count(x, value) | Count occurrences of value in time series x.
- variance(x) | Returns the variance of x
- variance_larger_than_standard_deviation(x) | Boolean variable denoting if the variance of x is greater than its standard deviation.
Inspiration
What generally improves a model's score more on average, feature engineering or hyperparameter tuning? Feature engineering, without a doubt.
I have included tsfresh in the platform
https://github.com/produvia/cryptocurrency-trading-platform/commit/91895ad5c9e2eae3b55c04e954851a17c6da4ecd
We could add technical analysis features too (ta-lib). Sounds good to you?
Yes! Let's use technical analysis (ta-lib) as features (see #64) for machine learning.
I'm working on adding technical analysis features.
This article could be useful for us in order to add more features. Let me know if you agree to will work on this.
I'm working on adding technical analysis features.
I am looking forward to it
This article could be useful for us in order to add more features.
Thanks for sharing this practical article on the enigma data marketplace
Let me know if you agree to will work on this.
Let's implement Kryptos existing datasets as features. We already support Google Trends and Quadl data sources (see #8).
Let's add non-pricing datasets as features. We can use cryptocurrency volume data, Blockchain Info, and Google Search Volume.
I have added some external data sources (Google Search Volume and Blockchain Info) as features for Machine Learning models.
However, I don't completely understand you with cryptocurrency volume data:
We are already using the volume as a feature: https://github.com/produvia/cryptocurrency-trading-platform/blob/49951f284edbc13c77689d5a69ab67a30b59353e/kryptos/platform/strategy/strategy.py#L227
Edit: At this moment Google Search Volume is fine, but Quant dataset is unstable. So, we can use:
$ strat -d google -c "bitcoin" -c "btc" -ml xgboost
or
$ strat -ml xgboost -d google -c "bitcoin" -c "btc"
I have added some external data sources (Google Search Volume and Blockchain Info) as features for Machine Learning models.
Excellent! Since we now have multiple machine learning models, let's compare the differences between them in terms of accuracy.
We are already using the volume as a feature.
Perfect!
At this moment Google Search Volume is fine, but Quant dataset is unstable.
Can you clarify what you mean by Quandl dataset being unstable?
There were some bugs merging Quandl dataset on the system. Now it is fine. Some examples:
strat -ml xgboost -d google -c "bitcoin" -c "btc"
strat -ml xgboost -d quandl -c 'MKTCP' -c 'NTRAN'
Excellent work! Now we can combine all of our existing datasets, including:
- google dataset (see manager.py#L243) (use google search terms: "btc usd" associated with the btc/usd cryptoasset),
- quandl datasets (see manager.py#L398) (there are currently 32 datasets)
- pricing & volume datasets (see manager.py#L57).