How to use GenerativeProteomics

If your main goal is simply to just impute a general dataset, the most straightforward and simplest way to use GenerativeProteomics is to run:

python generativeproteomics.py -i /path/to/file_to_impute.csv

By running it in this manner, it will result in two separate training phases.

  1. Evaluation run:

    In this run a percentage of the values (10% by default) are concealed during the training phase and then the dataset is imputed. The RMSE (Root Mean Square-Error) is calculated with those hidden values as targets and at the end of the training phase a test_imputed.csv file will be created containing the original hidden values and the resulting imputation. This way you can have an estimation of the imputation accuracy.

  2. Imputation run:

    Afterwards, a proper training phase takes place using the entire dataset. An imputed.csv file will be created containing the imputed dataset.

However, there might be a few arguments which you may want to change. You can do this using a parameters.json file (you may find an example in GenerativeProteomics/breast/parameters.json) or you can choose them directly in the command line.

Run with a parameters.json file:

python generativeproteomics.py --parameters /path/to/parameters.json

Run with command line arguments:

python generativeproteomics.py -i /path/to/file_to_impute.csv -o imputed_name --ofolder ./results/ --it 2001

Arguments:

  • -i: Path to file to impute

  • -o: Name of imputed file

  • –ofolder: Path to the output folder

  • –it: Number of iterations to train the model

  • –miss: The percentage of values to be concealed during the evaluation run (from 0 to 1)

  • –outall: Set this argument to 1 if you want to output every metric

  • –override: Set this argument to 1 if you want to delete the previously created files when writing the new output

  • –model: Choose the model to use (None if GenerativeProteomics, otherwise provide name of the pre-trained model)

If you want to assess the efficiency of the code you may provide a reference file containing a complete version of the dataset (without missing values):

python generativeproteomics.py -i /path/to/file_to_impute.csv --ref /path/to/complete_dataset.csv

Running this way will calculate the RMSE of the imputation in relation to the complete dataset.