sona
sona copied to clipboard
gbdt training error
and i got
when i train
Originally posted by @wqh17101 in https://github.com/Angel-ML/sona/issues/48#issuecomment-539522052
my script
Please confirm if the data contains only labels with no features
@rachelsunrh this is your sample data for multiclass_classification
also there is no data contains only labels with no features in the dataset
please set smaller resources because the dataset is very small, and check the num.class.
maybe you can set
--driver-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1g
please have a try.
let me try. Some advice, you can offer your test submit script and dataset in the repo. So that everybody can run the sample and get the same result if nothing is wrong. Document is very important for the user. Make sure that it is right.
well it worked. @rachelsunrh so do you know why it happened.
thanks your advice, I will check the docs latter. as for this problem, because there is some executor without data.
why my feature_importance is empty after training @rachelsunrh
also , it seems that there is no parameter which is like ml.log.path to save the training log from yarn logs ,isn`t it?
yes, training log is not saved in a path feature_importance empty maybe because the number of features is too little, you can use other dataset with more features to try again.
is there any way to save training log? it is hard to read the log from yarn logs which have too many system logs that i don`t care @rachelsunrh
there was a ml.log.path parameter in the old version of Angel , why you cancel it
also , could you teach me how to calculate and set these parameters
--driver-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1g
maybe you can add the way into the doc @rachelsunrh
now i use a big dataset about 250GB
with script
and i got this
@rachelsunrh what can i do ,29g per executor is the max memory i can get
the dataset detail .
i set larger feature index range to avoid out of bound
also when i set parallel mode = fp and the other parameters same as above , i got JavaNullPoint Error
and when
set the parrameter as this script
i got this
@wqh17101 the GBDT initialization requires loading the full dataset in memory and turning floating-point feature values into integer histogram bin indexes (binning). Due to the memory management in Spark, we cannot unpersist the original dataset until the binning one is cached. So I am afraid your executor memory cannot support the 250GB dataset.
Maybe you can try to persist the datasets in memory and disk level.
so, how to persist the datasets in memory and disk level ? and how many executor memory can support this , maybe i can try to apply more resource @ccchengff
@wqh17101
Change the cache()
to persist(StorageLevel.MEMORY_AND_DISK)
here and there
We will also try to support large scale datasets with a two-phase loading
so i found that you just recommand me to modify the DPGBDTTrainer , how about the FPGBDTTrainer , should i modify it ,too? Because it will cause the same problem as i mentioned above @ccchengff
@wqh17101 yep, change the storage level or apply for more resources
may I ask where the null pointer error occurs?
sure
in the log, the last msg printed is 'collect label' .
It should be the same error, the dataset is not cached
but it caught different error . it makes me confused
maybe it should be the same error @ccchengff
after i modify , dp mode starts to occur this error
it makes me confused. why when i increase the number of executors , it will occur javanullpoint, and when i reduce the number of executors, it will occur memory error
@wqh17101 How many training instances do you have?
Besides, can you train with a part of the dataset (e.g., use 10% of the dataset, also decrease the number of executors) and see whether the same error occurs again?
well it can load data now and started to train ! when i just reduce the number of executors.
But it still failed for memory even didnt complete one round training.
it seems that too many executors will occur javanullpoint. modify the code as you said can fix memory error.
so , it seems that they are not the same error . Do you agree that?
script:
@ccchengff
how to check the
training instances