datatable
datatable copied to clipboard
[FR] Option to add a name to grouping in ``by``, especially for boolean expressions
trafficstars
- Instead of a default
C0, it would be nice to have some relevant name
Example:
from datatable import dt, f, by
grades = [48, 99, 75, 80, 42, 80, 72, 68, 36, 78]
data = {'ID': ["x%d" % r for r in range(10)],
'Gender': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
'ExamYear': [2007, 2007, 2007, 2008, 2008,
2008, 2008, 2009, 2009, 2009],
'Class': ['algebra', 'stats', 'bio', 'algebra',
'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
'Participated': ['yes', 'yes', 'yes', 'yes', 'no',
'yes', 'yes', 'yes', 'yes', 'yes'],
'Passed': ['yes' if x > 50 else 'no' for x in grades],
'Employed': [True, True, True, False,
False, False, False, True, True, False],
'Grade': grades}
df = dt.Frame(data)
df[:, dt.mean(f.Grade), by(f.ExamYear < 2009)]
| C0 | Grade
---+----+---------
0 | 0 | 60.6667
1 | 1 | 70.8571
Suggested form:
df[:, dt.mean(f.Grade), by(name = f.ExamYear < 2009)]
@samukweku what if name is not provided (similar to how we do it now), any suggestions what column name to use then?
@pradkrish if no name is provided, then we use datatable's form - C0 or C1, ... similar to the example shared above.