ProgLearn
ProgLearn copied to clipboard
add binning capabilities
for each feature, rather than sampling all possible splits, downsample data into 128 bins (for example), and then choose the best bin to split on.
could i be added onto this issue for sprint 1?
CatBoost Paper Also interested in this issue!
will somebody let me know whether the binning idea is in fact in that paper?
On Thu, Sep 10, 2020 at 6:56 PM p-teng [email protected] wrote:
External Email - Use Caution *
https://arxiv.org/abs/1706.09516 https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F1706.09516&data=02%7C01%7Cjovo%40jhu.edu%7C075560c0b9fa4fcf30e508d855dcba6c%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353753788665729&sdata=462De%2BamXjk4HH9%2Fn%2F%2FG4yioPR%2BlFXArRWa%2FeNObtoo%3D&reserved=0 Also interested in this issue!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F33%23issuecomment-690774167&data=02%7C01%7Cjovo%40jhu.edu%7C075560c0b9fa4fcf30e508d855dcba6c%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353753788675725&sdata=qzWVVndOLzsBoKBVxIENPjsuyRhOPpFQPhCe9O8Kca0%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4TBFFEOR3VBYYFQFQ3SFFKRBANCNFSM4PQR5VNQ&data=02%7C01%7Cjovo%40jhu.edu%7C075560c0b9fa4fcf30e508d855dcba6c%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353753788675725&sdata=aEq2XHbiJuLs4K7%2Fuhw961k9BZB%2FVqzwg%2F0k8vpwdJ4%3D&reserved=0 .
-- With gratitude,
Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!
From what I've read so far, CatBoost is their name for the combination of ordered boosting (their modification to standard gradient boosting) and a modified procedure for processing categorical features, so it appears that binning isn't the main idea of the paper.
oh right! ok, the correct paper to read for this is LightGBM: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
the Catboost paper has another cool idea called 'folding', which brings forests and deep nets 1 step closer together, also very cool. maybe one of you can work on folding, and the other on binning?
On Thu, Sep 10, 2020 at 8:41 PM p-teng [email protected] wrote:
External Email - Use Caution *
From what I've read so far, CatBoost is their name for the combination of ordered boosting (their modification to standard gradient boosting) and a modified procedure for processing categorical features, so it appears that binning isn't the [image: main idea] https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F54685205%2F92832561-cc3e2680-f3a5-11ea-96c6-5b48dff5c13b.png&data=02%7C01%7Cjovo%40jhu.edu%7C7095397691264eb8594c08d855eb5f31%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353816684727479&sdata=dX3jiDh1oYKoq7XJSpdj%2BEnkIys53qo0eXmQZufRJoc%3D&reserved=0 of the paper.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F33%23issuecomment-690805061&data=02%7C01%7Cjovo%40jhu.edu%7C7095397691264eb8594c08d855eb5f31%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353816684727479&sdata=1k3RnZ2HzB9FMRzo%2FlCLiDrMLVRJoQjTmgLTLoFw8AY%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4WBUXSOR72BUY6CL7TSFFW2DANCNFSM4PQR5VNQ&data=02%7C01%7Cjovo%40jhu.edu%7C7095397691264eb8594c08d855eb5f31%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637353816684737477&sdata=nKqmLyKhO2zprMsy8GKJ0DmVLUIVC%2FC0u8wz4%2FblJiU%3D&reserved=0 .
-- With gratitude,
Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!
see https://github.com/neurodata/progressive-learning/issues/24
Would the DoD for this issue be:
Write a binning function that takes a feature of X, Y, and bin number as arguments and downsamples via bin means, and returns best bin boundary to split on?
I'm now realizing I may need more guidance on this issue. How would best bin split be calculated? Would we use same number of bins for all features?