CausalTransformer
CausalTransformer copied to clipboard
Help with data structures in a dictionary produced by SyntheticCancerDataset
Good day @Valentyn1997 ,
I am priviliged to explore your excellent paper and its implementation for my thesis work!
My current aim is to transform Tumor Growth dataset into a tabular format so that I can use it in the training of another model. However, I struggle to comprehend data structures that are produced by an instance of SyntheticCancerDataset.
For example, when I run a simple snippet like this:
import pandas as pd
import numpy as np
from src.data.cancer_sim.dataset import SyntheticCancerDatasetCollection
from src.data.cancer_sim.dataset import SyntheticCancerDataset
# Define the parameters
chemo_coeff = 0.5
radio_coeff = 0.5
num_patients = 10
seed = 5
window_size = 15
seq_length = 10
subset_name = 'train'
mode = 'factual'
projection_horizon = 10
lag = 0
cf_seq_mode = 'sliding_treatment'
treatment_mode = 'multiclass'
# Create an instance of the class
df = SyntheticCancerDataset(
chemo_coeff,
radio_coeff,
num_patients,
window_size,
seq_length,
subset_name,
mode,
projection_horizon,
seed,
lag,
cf_seq_mode,
treatment_mode
)
scaling_params = df.get_scaling_params()
df.process_data(scaling_params)
# Get the data for the first patient
first_patient_data = df[0]
print(first_patient_data)
I get a dictionary with multiple arrays of a different length:
- cancer_volume: 10
- chemo_dosage: 10
- radio_dosage: 10
- chemo_application: 10
- radio_application: 10
- chemo_probabilities: 10
- radio_probabilities: 10
- sequence_lengths: Not an iterable
- death_flags: 10
- recovery_flags: 10
- patient_types: Not an iterable
- prev_treatments: 9
- current_treatments: 9
- current_covariates: 9
- outputs: 9
- active_entries: 9
- unscaled_outputs: 9
- prev_outputs: 9
- static_features: 1
Could you help me understand why some arrays have 10 items, whereas other only 9? Similarly, could you give me pointers how to transform this simple dictionary with data for one patient to a tabular format? I am mainly interested in one-hot encoded covariates for historical radio/chemo application and historical tumour volume.
Thank you very much in advance!