sona gbdt training error

and i got when i train

Originally posted by @wqh17101 in https://github.com/Angel-ML/sona/issues/48#issuecomment-539522052

Oct 08 '19 13:10 wqh17101

my script

Oct 08 '19 13:10 wqh17101

Please confirm if the data contains only labels with no features

Oct 10 '19 07:10 rachelsunrh

@rachelsunrh this is your sample data for multiclass_classification

Oct 10 '19 08:10 wqh17101

also there is no data contains only labels with no features in the dataset

Oct 10 '19 08:10 wqh17101

please set smaller resources because the dataset is very small, and check the num.class. maybe you can set
--driver-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1g
please have a try.

Oct 10 '19 09:10 rachelsunrh

let me try. Some advice, you can offer your test submit script and dataset in the repo. So that everybody can run the sample and get the same result if nothing is wrong. Document is very important for the user. Make sure that it is right.

Oct 10 '19 09:10 wqh17101

well it worked. @rachelsunrh so do you know why it happened.

Oct 10 '19 09:10 wqh17101

thanks your advice, I will check the docs latter. as for this problem, because there is some executor without data.

Oct 10 '19 09:10 rachelsunrh

why my feature_importance is empty after training @rachelsunrh

Oct 10 '19 09:10 wqh17101

also , it seems that there is no parameter which is like ml.log.path to save the training log from yarn logs ,isn`t it?

Oct 10 '19 09:10 wqh17101

yes, training log is not saved in a path feature_importance empty maybe because the number of features is too little, you can use other dataset with more features to try again.

Oct 10 '19 10:10 rachelsunrh

is there any way to save training log? it is hard to read the log from yarn logs which have too many system logs that i don`t care @rachelsunrh

Oct 10 '19 11:10 wqh17101

there was a ml.log.path parameter in the old version of Angel , why you cancel it

Oct 10 '19 11:10 wqh17101

also , could you teach me how to calculate and set these parameters

--driver-memory 1g \
--num-executors 2 \
--executor-cores 1 \
--executor-memory 1g

maybe you can add the way into the doc @rachelsunrh

Oct 10 '19 12:10 wqh17101

now i use a big dataset about 250GB with script and i got this

@rachelsunrh what can i do ,29g per executor is the max memory i can get

Oct 10 '19 12:10 wqh17101

the dataset detail . i set larger feature index range to avoid out of bound

Oct 10 '19 12:10 wqh17101

also when i set parallel mode = fp and the other parameters same as above , i got JavaNullPoint Error

Oct 11 '19 06:10 wqh17101

and when set the parrameter as this script i got this

Oct 11 '19 06:10 wqh17101

@wqh17101 the GBDT initialization requires loading the full dataset in memory and turning floating-point feature values into integer histogram bin indexes (binning). Due to the memory management in Spark, we cannot unpersist the original dataset until the binning one is cached. So I am afraid your executor memory cannot support the 250GB dataset.

Maybe you can try to persist the datasets in memory and disk level.

Oct 11 '19 06:10 ccchengff

so, how to persist the datasets in memory and disk level ? and how many executor memory can support this , maybe i can try to apply more resource @ccchengff

Oct 11 '19 07:10 wqh17101

@wqh17101 Change the cache() to persist(StorageLevel.MEMORY_AND_DISK) here and there

We will also try to support large scale datasets with a two-phase loading

Oct 11 '19 07:10 ccchengff

so i found that you just recommand me to modify the DPGBDTTrainer , how about the FPGBDTTrainer , should i modify it ,too? Because it will cause the same problem as i mentioned above @ccchengff

Oct 11 '19 07:10 wqh17101

@wqh17101 yep, change the storage level or apply for more resources

may I ask where the null pointer error occurs?

Oct 11 '19 07:10 ccchengff

sure in the log, the last msg printed is 'collect label' .

Oct 11 '19 07:10 wqh17101

It should be the same error, the dataset is not cached

Oct 11 '19 07:10 ccchengff

but it caught different error . it makes me confused

Oct 11 '19 08:10 wqh17101

maybe it should be the same error @ccchengff after i modify , dp mode starts to occur this error

Oct 11 '19 08:10 wqh17101

it makes me confused. why when i increase the number of executors , it will occur javanullpoint, and when i reduce the number of executors, it will occur memory error

Oct 11 '19 08:10 wqh17101

@wqh17101 How many training instances do you have?

Besides, can you train with a part of the dataset (e.g., use 10% of the dataset, also decrease the number of executors) and see whether the same error occurs again?

Oct 11 '19 08:10 ccchengff

well it can load data now and started to train ! when i just reduce the number of executors. But it still failed for memory even didnt complete one round training.

it seems that too many executors will occur javanullpoint. modify the code as you said can fix memory error.

so , it seems that they are not the same error . Do you agree that? script: @ccchengff how to check the training instances

Oct 11 '19 08:10 wqh17101

sona sona copied to clipboard

gbdt training error

sona
sona copied to clipboard