sona icon indicating copy to clipboard operation
sona copied to clipboard

gbdt training error

Open wqh17101 opened this issue 5 years ago • 48 comments

and i got image when i train

Originally posted by @wqh17101 in https://github.com/Angel-ML/sona/issues/48#issuecomment-539522052

wqh17101 avatar Oct 08 '19 13:10 wqh17101

image my script

wqh17101 avatar Oct 08 '19 13:10 wqh17101

Please confirm if the data contains only labels with no features

rachelsunrh avatar Oct 10 '19 07:10 rachelsunrh

@rachelsunrh this is your sample data for multiclass_classification

wqh17101 avatar Oct 10 '19 08:10 wqh17101

also there is no data contains only labels with no features in the dataset

wqh17101 avatar Oct 10 '19 08:10 wqh17101

please set smaller resources because the dataset is very small, and check the num.class. maybe you can set
--driver-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1g
please have a try.

rachelsunrh avatar Oct 10 '19 09:10 rachelsunrh

let me try. Some advice, you can offer your test submit script and dataset in the repo. So that everybody can run the sample and get the same result if nothing is wrong. Document is very important for the user. Make sure that it is right.

wqh17101 avatar Oct 10 '19 09:10 wqh17101

well it worked. @rachelsunrh so do you know why it happened.

wqh17101 avatar Oct 10 '19 09:10 wqh17101

thanks your advice, I will check the docs latter. as for this problem, because there is some executor without data.

rachelsunrh avatar Oct 10 '19 09:10 rachelsunrh

why my feature_importance is empty after training @rachelsunrh image

wqh17101 avatar Oct 10 '19 09:10 wqh17101

also , it seems that there is no parameter which is like ml.log.path to save the training log from yarn logs ,isn`t it?

wqh17101 avatar Oct 10 '19 09:10 wqh17101

yes, training log is not saved in a path feature_importance empty maybe because the number of features is too little, you can use other dataset with more features to try again.

rachelsunrh avatar Oct 10 '19 10:10 rachelsunrh

is there any way to save training log? it is hard to read the log from yarn logs which have too many system logs that i don`t care @rachelsunrh

wqh17101 avatar Oct 10 '19 11:10 wqh17101

there was a ml.log.path parameter in the old version of Angel , why you cancel it

wqh17101 avatar Oct 10 '19 11:10 wqh17101

also , could you teach me how to calculate and set these parameters

--driver-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1g

maybe you can add the way into the doc @rachelsunrh

wqh17101 avatar Oct 10 '19 12:10 wqh17101

now i use a big dataset about 250GB with script image and i got this image

@rachelsunrh what can i do ,29g per executor is the max memory i can get

wqh17101 avatar Oct 10 '19 12:10 wqh17101

image the dataset detail . i set larger feature index range to avoid out of bound

wqh17101 avatar Oct 10 '19 12:10 wqh17101

also when i set parallel mode = fp and the other parameters same as above , i got JavaNullPoint Error

wqh17101 avatar Oct 11 '19 06:10 wqh17101

and when set the parrameter as this script image i got this image

wqh17101 avatar Oct 11 '19 06:10 wqh17101

@wqh17101 the GBDT initialization requires loading the full dataset in memory and turning floating-point feature values into integer histogram bin indexes (binning). Due to the memory management in Spark, we cannot unpersist the original dataset until the binning one is cached. So I am afraid your executor memory cannot support the 250GB dataset.

Maybe you can try to persist the datasets in memory and disk level.

ccchengff avatar Oct 11 '19 06:10 ccchengff

so, how to persist the datasets in memory and disk level ? and how many executor memory can support this , maybe i can try to apply more resource @ccchengff

wqh17101 avatar Oct 11 '19 07:10 wqh17101

@wqh17101 Change the cache() to persist(StorageLevel.MEMORY_AND_DISK) here and there

We will also try to support large scale datasets with a two-phase loading

ccchengff avatar Oct 11 '19 07:10 ccchengff

so i found that you just recommand me to modify the DPGBDTTrainer , how about the FPGBDTTrainer , should i modify it ,too? Because it will cause the same problem as i mentioned above @ccchengff

wqh17101 avatar Oct 11 '19 07:10 wqh17101

@wqh17101 yep, change the storage level or apply for more resources

may I ask where the null pointer error occurs?

ccchengff avatar Oct 11 '19 07:10 ccchengff

sure image in the log, the last msg printed is 'collect label' .

wqh17101 avatar Oct 11 '19 07:10 wqh17101

It should be the same error, the dataset is not cached

ccchengff avatar Oct 11 '19 07:10 ccchengff

but it caught different error . it makes me confused

wqh17101 avatar Oct 11 '19 08:10 wqh17101

image image maybe it should be the same error @ccchengff after i modify , dp mode starts to occur this error

wqh17101 avatar Oct 11 '19 08:10 wqh17101

it makes me confused. why when i increase the number of executors , it will occur javanullpoint, and when i reduce the number of executors, it will occur memory error

wqh17101 avatar Oct 11 '19 08:10 wqh17101

@wqh17101 How many training instances do you have?

Besides, can you train with a part of the dataset (e.g., use 10% of the dataset, also decrease the number of executors) and see whether the same error occurs again?

ccchengff avatar Oct 11 '19 08:10 ccchengff

well it can load data now and started to train ! when i just reduce the number of executors. But it still failed for memory even didnt complete one round training. image image

image

it seems that too many executors will occur javanullpoint. modify the code as you said can fix memory error.

so , it seems that they are not the same error . Do you agree that? script: image @ccchengff how to check the training instances

wqh17101 avatar Oct 11 '19 08:10 wqh17101