Utils Class
========================

.. automodule:: ProtoGain.utils
   :members:
   :undoc-members:
   :show-inheritance:

Here you will find the class `Utils` used during the process of data preprocessing and evaluation of the imputation quality.
The `Utils` class is responsible for handling various utility functions such as data normalization, data scaling, data splitting, 
and evaluation metrics calculation.

Functions
----------


- create_csv(data, name: str, header)

    Creates a CSV file from the given dataset.

    **Steps:**

    1. Converts data into a Pandas DataFrame.
    2. Saves the DataFrame as a CSV file with the given name.
    3. Uses the provided header for column naming.

- create_dist(size: int, dim: int, name: str)

    Generates a synthetic dataset based on a normal distribution.

    **Steps:**

    1. Creates a size x dim matrix of normally distributed values.
    2. Applies a linear transformation (A matrix) and shift (b vector).
    3. Saves the transformed dataset as a CSV file.

- create_missing(data, miss_rate: float, name: str, header)

    Generates a dataset with missing values by randomly removing observations.

    **Steps:**

    1. Initializes a zero matrix mask of the same shape as data.
    2. Iterates over each feature, generating a probability for missing values.

    .. code-block:: python

        chance = torch.rand(size)

        miss = chance > miss_rate

    3. Masks values according to miss_rate, replacing them with NaN.

    .. code-block:: python

        mask[:, i] = miss

        missing_data = np.where(mask < 1, np.nan, data)

    4. Saves the new dataset with missing values as a CSV file.

- create_output(data, path: str, override: int)

    Handles the output file generation, either overriding or appending new data.

    **Steps:**

    1. If override is 1, saves data as a new CSV file.
    2. Otherwise, if the file exists, reads it and appends the new data columns.
    3. Concatenates the updated DataFrame and saves it.

- output()

    Stores multiple outputs related to training, including metrics and system resource usage.

    **Steps:**

    1. Calls create_csv to save imputed data.
    2. Calls create_output for:
        - Discriminator loss (lossD.csv).
        - Generator loss (lossG.csv).
        - Training and test loss (lossMSE_train.csv, lossMSE_test.csv).
        - System performance logs (cpu.csv, ram.csv, ram_percentage.csv).

- sample_idx(m, n)

    Generates a random sample of n indices from m elements.

    **Steps:**

    1. Creates a random permutation of integers from 0 to m-1.
    2. Selects the first n elements as the sampled indices.
    3. Returns the selected indices.

- build_protein_matrix(tsv_file)

    Processes a TSV file containing proteomics data and restructures it into a matrix.

    **Steps:**

    1. Reads the TSV file, skipping initial metadata rows.
    2. Extracts relevant columns (protein, sample_accession, ribaq).
    3. Converts data into a pivoted matrix (proteins as index, samples as columns).
    4. Returns the formatted matrix.

- handle_parquet(parquet_file)

    Reads a Parquet file.

    **Steps**

    1. Use polars to read the Parquet file.
    2. Process the data in the file 
    3. Return the resulting DataFrame.