time-series-machine-learning why shuffling data?

Hello, Nice and interresting work, I learned a lot. During train and testing dataset building process, why are you shuffling data? I though that regarding time serie we should not shuffling data.

data_utils.py

def split_dataset(dataset, ratio=None): size = dataset.size if ratio is None: ratio = _choose_optimal_train_ratio(size)

mask = np.zeros(size, dtype=np.bool_) train_size = int(size * ratio) mask[:train_size] = True np.random.shuffle(mask)

train_x = dataset.x[mask, :] train_y = dataset.y[mask]

mask = np.invert(mask) test_x = dataset.x[mask, :] test_y = dataset.y[mask]

return DataSet(train_x, train_y), DataSet(test_x, test_y)

Regards,

Apr 20 '19 07:04 ochoch

Hi @ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

Apr 20 '19 09:04 maxim5

Hi Maxim, Thanks for your reply. I played a bit with your implementation and add a provider (FXCM), using pyfxcm ( https://github.com/fxcm/RestAPI/tree/master/fxcmpy).

At the end, as it is time consumming to connect to FXCM servers and they are not delivering the last bar(!), I integrate your python scripts with MT4. On each tick I mn providing the last data (replacement of get_latest_data method), I am providing a csv file, and replace raw_df dataframe with a read_csv method. Then I run predict.py and get prediction for the next bar and draw the result on a chart...

[image: image.png]

At this stage, I am also calculating some accuracy... And to be honest it is quit hard to get some tradable predictions...

I have more or less following accuracy on forward testing :

TF High Accuracy (%) Low Accuracy (%) m15 57.25 56.29 H4 56.25 63.55 D1 65.63 57.29 W1 52.08 58.33 Maybe we should add some additionnal features with selection feature algorithm. Any insights?

Regards,

och

Le sam. 20 avr. 2019 à 11:19, Maxim Podkolzine [email protected] a écrit :

Hi @ochoch https://github.com/ochoch I think you're right. At that time I thought it was a good idea to shuffle the data, but I now I'd say it leads to overfitting and forward-looking bias.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/maxim5/time-series-machine-learning/issues/7#issuecomment-485076384, or mute the thread https://github.com/notifications/unsubscribe-auth/ABTHQD4XEF6YQBYEWAKYTLTPRLNZ7ANCNFSM4HHJFWYA .

Apr 24 '19 13:04 ochoch

Hi @ochoch sorry for the delay.

Unfortunately that's the way it is: there is so much noise and so little signal in financial data. If you are able to find a reliable signal more than 50% accurate, it's good enough and you can make money.

In terms of features: that's the key question. All ML algorithms that make money boil down to features. I haven't worked much on crypto data since then. Do you have any ideas in mind?

May 16 '19 13:05 maxim5

time-series-machine-learning time-series-machine-learning copied to clipboard

why shuffling data?

time-series-machine-learning
time-series-machine-learning copied to clipboard