datahelix icon indicating copy to clipboard operation
datahelix copied to clipboard

Generate value for a nullable column with a percentage

Open semisft opened this issue 5 years ago • 3 comments

Some column values must be filled by a percentage, for example one field must be 10% filled, another 30% in the same profile. For %10 I tried a field from weighted inSet file and used in an if statement. but results seem to give %50. How can I configure this?

percent10.csv

1,10 0,90

profile.json

{
	"fields": [
		{
			"name": "percent10",
			"type": "integer"
		},
		{
			"name": "name",
			"type": "firstname",
			"nullable": true
		}
	],
	"constraints": [
		{
			"field": "percent10",
			"inSet": "percent10.csv"
		},
		{
			"if": {
				"field": "percent10",
				"equalTo": 1
			},
			"then": {
				"field": "name",
				"isNull": false
			},
			"else": {
				"field": "name",
				"isNull": true
			}
		}
	]
}

semisft avatar Sep 10 '20 06:09 semisft

Hi @semisft, this appears to be a bug with the datahelix. I have raised an issue for it here https://github.com/finos/datahelix/issues/1705

Tom-hayden avatar Sep 11 '20 11:09 Tom-hayden

I've tried this issue with the above profile given the latest edition of the code (to verify if the issue still exists). An example of the output (30 rows) is below:

percent10,name
1,Rory
1,Lily
1,Finn
0,
0,
0,
0,
0,
1,Amelia
1,Thea
1,Zara
1,Christina
1,Jake
0,
1,Maya
1,Liam
0,
1,Zac
1,Hamish
0,
0,
0,
0,
1,Lila
0,
0,
0,
1,Frank
0,
1,Phoebe

This shows a 50% spread of each of the values for percent10, where there should be 10% (3 rows) with 0 and 90% (27 rows) with 1. The issue is still confirmed to be valid - will investigate further.

sl-slaing avatar Jan 14 '21 09:01 sl-slaing

Investigation: In RandomRowSpecDecisionTreeWalker a list of rowSpecs are generate that represent the rows that can be generated. These are generated as:

  1. name=not null & in (names) and percent10=not null & in (1)
  2. name=null and percent10=not null & in (0)

The generator will then randomly select between the two items above to generate rows. The items above do not have any weighting however (which could have been inherited from the value for percent10) so the generator generates (randomly) an even spread of rows from the two specs above.

Either of the below (or something more elegant) would be required:

  1. The items above need to indicate their weighting, i.e. item1 = 10% and item2 = 90% and use this in the getRandomRowSpec() method
  2. The items above are duplicated as many times as appropriate to create a representative spread, i.e. create 9 item2's for every 1 item1. Then there would be a sample of row specs that can be randomly selected from
  3. something else

sl-slaing avatar Jan 14 '21 12:01 sl-slaing