ProgLearn icon indicating copy to clipboard operation
ProgLearn copied to clipboard

add binning capabilities

Open jovo opened this issue 4 years ago • 7 comments

for each feature, rather than sampling all possible splits, downsample data into 128 bins (for example), and then choose the best bin to split on.

jovo avatar Jul 31 '20 16:07 jovo

could i be added onto this issue for sprint 1?

emilyachang avatar Sep 10 '20 20:09 emilyachang

CatBoost Paper Also interested in this issue!

p-teng avatar Sep 10 '20 22:09 p-teng

will somebody let me know whether the binning idea is in fact in that paper?

On Thu, Sep 10, 2020 at 6:56 PM p-teng [email protected] wrote:

  •  External Email - Use Caution      *
    

https://arxiv.org/abs/1706.09516 https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F1706.09516&data=02%7C01%7Cjovo%40jhu.edu%7C075560c0b9fa4fcf30e508d855dcba6c%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353753788665729&sdata=462De%2BamXjk4HH9%2Fn%2F%2FG4yioPR%2BlFXArRWa%2FeNObtoo%3D&reserved=0 Also interested in this issue!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F33%23issuecomment-690774167&data=02%7C01%7Cjovo%40jhu.edu%7C075560c0b9fa4fcf30e508d855dcba6c%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353753788675725&sdata=qzWVVndOLzsBoKBVxIENPjsuyRhOPpFQPhCe9O8Kca0%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4TBFFEOR3VBYYFQFQ3SFFKRBANCNFSM4PQR5VNQ&data=02%7C01%7Cjovo%40jhu.edu%7C075560c0b9fa4fcf30e508d855dcba6c%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353753788675725&sdata=aEq2XHbiJuLs4K7%2Fuhw961k9BZB%2FVqzwg%2F0k8vpwdJ4%3D&reserved=0 .

-- With gratitude,

Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!

jovo avatar Sep 10 '20 23:09 jovo

From what I've read so far, CatBoost is their name for the combination of ordered boosting (their modification to standard gradient boosting) and a modified procedure for processing categorical features, so it appears that binning isn't the main idea of the paper.

main idea

p-teng avatar Sep 11 '20 00:09 p-teng

oh right! ok, the correct paper to read for this is LightGBM: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf

the Catboost paper has another cool idea called 'folding', which brings forests and deep nets 1 step closer together, also very cool. maybe one of you can work on folding, and the other on binning?

On Thu, Sep 10, 2020 at 8:41 PM p-teng [email protected] wrote:

  •  External Email - Use Caution      *
    

From what I've read so far, CatBoost is their name for the combination of ordered boosting (their modification to standard gradient boosting) and a modified procedure for processing categorical features, so it appears that binning isn't the [image: main idea] https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F54685205%2F92832561-cc3e2680-f3a5-11ea-96c6-5b48dff5c13b.png&data=02%7C01%7Cjovo%40jhu.edu%7C7095397691264eb8594c08d855eb5f31%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353816684727479&sdata=dX3jiDh1oYKoq7XJSpdj%2BEnkIys53qo0eXmQZufRJoc%3D&reserved=0 of the paper.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F33%23issuecomment-690805061&data=02%7C01%7Cjovo%40jhu.edu%7C7095397691264eb8594c08d855eb5f31%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353816684727479&sdata=1k3RnZ2HzB9FMRzo%2FlCLiDrMLVRJoQjTmgLTLoFw8AY%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4WBUXSOR72BUY6CL7TSFFW2DANCNFSM4PQR5VNQ&data=02%7C01%7Cjovo%40jhu.edu%7C7095397691264eb8594c08d855eb5f31%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353816684737477&sdata=nKqmLyKhO2zprMsy8GKJ0DmVLUIVC%2FC0u8wz4%2FblJiU%3D&reserved=0 .

-- With gratitude,

Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!

jovo avatar Sep 11 '20 01:09 jovo

see https://github.com/neurodata/progressive-learning/issues/24

jovo avatar Sep 11 '20 16:09 jovo

Would the DoD for this issue be:

Write a binning function that takes a feature of X, Y, and bin number as arguments and downsamples via bin means, and returns best bin boundary to split on?

I'm now realizing I may need more guidance on this issue. How would best bin split be calculated? Would we use same number of bins for all features?

p-teng avatar Sep 25 '20 13:09 p-teng