How to use GenerativeProteomics¶
If your main goal is simply to just impute a general dataset, the most straightforward and simplest way to use GenerativeProteomics is to run:
python generativeproteomics.py -i /path/to/file_to_impute.csv
By running it in this manner, it will result in two separate training phases.
- Evaluation run:
In this run a percentage of the values (10% by default) are concealed during the training phase and then the dataset is imputed. The RMSE (Root Mean Square-Error) is calculated with those hidden values as targets and at the end of the training phase a test_imputed.csv file will be created containing the original hidden values and the resulting imputation. This way you can have an estimation of the imputation accuracy.
- Imputation run:
Afterwards, a proper training phase takes place using the entire dataset. An imputed.csv file will be created containing the imputed dataset.
However, there might be a few arguments which you may want to change. You can do this using a parameters.json file
(you may find an example in GenerativeProteomics/breast/parameters.json) or you can choose them directly in the command line.
Run with a parameters.json file:
python generativeproteomics.py --parameters /path/to/parameters.json
Run with command line arguments:
python generativeproteomics.py -i /path/to/file_to_impute.csv -o imputed_name --ofolder ./results/ --it 2001
Arguments:
-i: Path to file to impute
-o: Name of imputed file
–ofolder: Path to the output folder
–it: Number of iterations to train the model
–miss: The percentage of values to be concealed during the evaluation run (from 0 to 1)
–outall: Set this argument to 1 if you want to output every metric
–override: Set this argument to 1 if you want to delete the previously created files when writing the new output
–model: Choose the model to use (None if GenerativeProteomics, otherwise provide name of the pre-trained model)
If you want to assess the efficiency of the code you may provide a reference file containing a complete version of the dataset (without missing values):
python generativeproteomics.py -i /path/to/file_to_impute.csv --ref /path/to/complete_dataset.csv
Running this way will calculate the RMSE of the imputation in relation to the complete dataset.