ML
ML copied to clipboard
Predict market trend up or down, unlabeled?
Hello,
I'm starting with ML and trying to predict stock trend up or down based on stock history. There are two challenges which I cannot seem to solve at the moment.
I have my stock history, this is data containing the price, volume and amount of trades at a certain point of time. I think I need to class this as Unlabeled data as I have not labeled them what trend a certain datapoint is in. Am I correct in this? When training the history data I get a warning it's missing labels. So I'm kind of lost how to handle/train unlabeled data.
Secondly, a timeline is also in play. I do not know how to handle this in the library.
Any help is much appreciated.
Thanks, Bastiaan
Hi @BasvanH thanks for the great question
See this issue regarding time-series datasets https://github.com/RubixML/RubixML/issues/35 - in short, we do not directly support time-series data yet
Since stock price is non-stationary, you will get best results from an algorithm that directly supports time series
Having that said ...
Supervised learners such as classifiers and regressors require a training signal in the form of labels
Unsupervised learner such as clusterers and anomaly detectors do not require labels
Your problem can be viewed as a classification one, in which the prediction will be trend 'up' or 'down,' or a regression problem where the prediction is the direction (+/-) and degree of trend from a baseline (ex. 0).
Your problem can also potentially fit into a clustering one, in which case, you can try to isolate clusters of up and down trend. You can also use an anomaly detector to predict when a stock is abnormally trending up or down.
So there are multiple ways, and also combinations of methods, that you can go about building a stock predicting system. I would avoid the unsupervised methods for now and focus on the supervised methods to start. Again, you will need a good Labeled dataset.
Are you able to automate the labeling process in any way?
Can you discretize the 'price' variable such that, if it is above a rolling (windowed) average, the label will be 'up' and in contradistinction 'down' if it is below the moving average?
Hi @andrewdalpino,
Thank you for taking the time to write such a detailed answer, much appreciated!
I have PHP experience, so therefore I have chosen your library as I think it's the most enhanced and complete one in PHP. Looking at other libraries in other languages would mean much more time for me to learn ML. So I will stick with you despite not having the time based algorithm yet :-) . You already done a great job!
So labeling is the way to go. Yes, I can process the history with a moving average, and determine trend based on price be up or below. I will move ahead and write this part.
First I want to start relatively simple, so with a classifier. Do you have an advice in which one to use?
@BasvanH No problem, welcome to our community!
Are you able to obtain more features for your dataset or do you just have the 3 that you mentioned?
How many samples do you have?
I would recommend starting with either Logistic Regression or Random Forest.
Logistic Regression is a simple linear classifier that has an associated tutorial here. The nice thing about Logistic Regression is that it can be partially trained (implements the Online interface) - thus, you can train it with new data as soon as it comes in. This will help the model to compensate for the fact that the data is non-stationary.
Random Forest is a non-linear ensemble method that you can try if you need a more flexible model.
Once you have enough labeled data, make sure to set about 20% of it aside to use as testing data and validate your model. The F Beta metric will give you a good idea as to how well it performs.
My dataset has a datapoints every full minute and contains the following features:
- start (date of data point)
- open (candle open value)
- high (candle highest value during interval)
- low (candle lowest value during interval)
- close (candle close value)
- volume (trade volume during interval)
- trades (amount of trades during interval)
I calculate SMA on each datapoint based on 30 datapoints/minutes ahead.
I'm adding trend to my dataset:
- trend (sma > current price = up else down)
I will use trend
as my label, but I'm also interested if it would make sense to add the difference as a label to indicate how much up or down trend we are having.
- difference (open / sma, indicator of how much up or down trend)
I have my dataset ready, and I'm going to move forward to read into Logistic Regression classifier.
Looks like you are well on your way @BasvanH
Keep us updated with your progress and don't hesitate to follow up with questions
Also, given the recent interest (https://github.com/RubixML/RubixML/issues/40, https://github.com/RubixML/RubixML/issues/35), we may start implementing time series features if they will better serve our users
Hey @BasvanH did you have any luck with this? I'm trying to also use the RandomForest algo but the link above is broken and I did not find any examples using this algo on any demo pages.
Hey @BasvanH did you have any luck with this? I'm trying to also use the RandomForest algo but the link above is broken and I did not find any examples using this algo on any demo pages.
Here is a link to the current Random Forest documentation
https://docs.rubixml.com/latest/classifiers/random-forest.html