Utils Class

Here you will find the class Utils used during the process of data preprocessing and evaluation of the imputation quality. The Utils class is responsible for handling various utility functions such as data normalization, data scaling, data splitting, and evaluation metrics calculation.

Functions

  • create_csv(data, name: str, header)

    Creates a CSV file from the given dataset.

    Steps:

    1. Converts data into a Pandas DataFrame.

    2. Saves the DataFrame as a CSV file with the given name.

    3. Uses the provided header for column naming.

  • create_dist(size: int, dim: int, name: str)

    Generates a synthetic dataset based on a normal distribution.

    Steps:

    1. Creates a size x dim matrix of normally distributed values.

    2. Applies a linear transformation (A matrix) and shift (b vector).

    3. Saves the transformed dataset as a CSV file.

  • create_missing(data, miss_rate: float, name: str, header)

    Generates a dataset with missing values by randomly removing observations.

    Steps:

    1. Initializes a zero matrix mask of the same shape as data.

    2. Iterates over each feature, generating a probability for missing values.

    chance = torch.rand(size)
    
    miss = chance > miss_rate
    
    1. Masks values according to miss_rate, replacing them with NaN.

    mask[:, i] = miss
    
    missing_data = np.where(mask < 1, np.nan, data)
    
    1. Saves the new dataset with missing values as a CSV file.

  • create_output(data, path: str, override: int)

    Handles the output file generation, either overriding or appending new data.

    Steps:

    1. If override is 1, saves data as a new CSV file.

    2. Otherwise, if the file exists, reads it and appends the new data columns.

    3. Concatenates the updated DataFrame and saves it.

  • output()

    Stores multiple outputs related to training, including metrics and system resource usage.

    Steps:

    1. Calls create_csv to save imputed data.

    2. Calls create_output for:
      • Discriminator loss (lossD.csv).

      • Generator loss (lossG.csv).

      • Training and test loss (lossMSE_train.csv, lossMSE_test.csv).

      • System performance logs (cpu.csv, ram.csv, ram_percentage.csv).

  • sample_idx(m, n)

    Generates a random sample of n indices from m elements.

    Steps:

    1. Creates a random permutation of integers from 0 to m-1.

    2. Selects the first n elements as the sampled indices.

    3. Returns the selected indices.

  • build_protein_matrix(tsv_file)

    Processes a TSV file containing proteomics data and restructures it into a matrix.

    Steps:

    1. Reads the TSV file, skipping initial metadata rows.

    2. Extracts relevant columns (protein, sample_accession, ribaq).

    3. Converts data into a pivoted matrix (proteins as index, samples as columns).

    4. Returns the formatted matrix.

  • handle_parquet(parquet_file)

    Reads a Parquet file.

    Steps

    1. Use polars to read the Parquet file.

    2. Process the data in the file

    3. Return the resulting DataFrame.