evomap.preprocessing
====================

.. py:module:: evomap.preprocessing

.. autoapi-nested-parse::

   Module for data pre-processing, including transformations between different data formats.


Functions
---------

.. autoapisummary::

   evomap.preprocessing.diss2sim
   evomap.preprocessing.sim2diss
   evomap.preprocessing.coocc2sim
   evomap.preprocessing.edgelist2matrix
   evomap.preprocessing.edgelist2matrices
   evomap.preprocessing.normalize_diss_mat
   evomap.preprocessing.normalize_diss_mats
   evomap.preprocessing.expand_matrices
   evomap.preprocessing.calc_distances


Module Contents
---------------

.. py:function:: diss2sim(diss_mat, transformation='inverse', eps=0.001)

   Transform a dissimilarity matrix to a similarity matrix.

   :param diss_mat: Matrix of pairwise dissimilarities.
   :type diss_mat: ndarray of shape (n_samples, n_samples)
   :param transformation: Transformation function, either 'inverse' or 'mirror', by default 'inverse'
   :type transformation: str, optional
   :param eps: Incremental constant to avoid division by zero, by default 1e-3
   :type eps: float, optional

   :returns: Matrix of pairwise similarities.
   :rtype: ndarray of shape (n_samples, n_samples)


.. py:function:: sim2diss(sim_mat, transformation='inverse', eps=0.0001)

   Transform a similarity matrix to a dissimilarity matrix.

   :param sim_mat: Matrix of pairwise similarities.
   :type sim_mat: ndarray of shape (n_samples, n_samples)
   :param transformation: Transformation function, either 'inverse' or 'mirror', by default 'inverse'.
                          'inverse' - Transforms by taking the reciprocal of the similarity scores.
                          'mirror' - Transforms by reflecting the similarity scores about 0.5 (1 - similarity).
   :type transformation: str, optional
   :param eps: Incremental constant to avoid division by zero, by default 1e-4
   :type eps: float, optional

   :returns: Matrix of pairwise dissimilarities.
   :rtype: ndarray of shape (n_samples, n_samples)


.. py:function:: coocc2sim(coocc_mat)

   Transform a matrix with co-occurrence counts to a similarity matrix.

   :param coocc_mat: Matrix of co-occurrence counts. Assumes a non-negative, symmetric matrix where
                     the diagonal can be ignored (usually representing self-co-occurrence).
   :type coocc_mat: ndarray of shape (n_samples, n_samples)

   :returns: Matrix of pairwise similarities, normalized such that each element is a proportion
             of the maximum co-occurrence for that row.
   :rtype: ndarray of shape (n_samples, n_samples)

   .. rubric:: Notes

   The function sets the diagonal to zero to prevent self-similarity from skewing the results.
   If a row's total co-occurrence is zero, it sets the entire row's similarity to zero to avoid
   division by zero.


.. py:function:: edgelist2matrix(df, score_var, id_var_i, id_var_j, time_var=None, time_selected=None)

   Transform an edgelist to a relationship matrix.
   :param df: Data containing the edgelist. Each row should include a pair. Must contain
              two id variables and a score variable. Can optionally include a time variable.
   :type df: DataFrame
   :param score_var: The score variable.
   :type score_var: string
   :param id_var_i: The first id variable.
   :type id_var_i: string
   :param id_var_j: The second id variable.
   :type id_var_j: string
   :param time_var: The time variable, by default None.
   :type time_var: string, optional
   :param time_selected: The selected time, by default None.
   :type time_selected: int, optional

   :returns: * **S** (*ndarray of shape (n_samples, n_samples)*) -- A matrix of pairwise relationships.
             * **ids** (*ndarray*) -- Identifiers for each element of the matrix.

   :raises ValueError:: If required columns are missing in the DataFrame.


.. py:function:: edgelist2matrices(df, score_var, id_var_i, id_var_j, time_var)

   Transform a time-indexed edgelist into a sequence of relationship matrices.

   :param df: Data containing the edgelist. Each row should include a pair and must contain
              two id variables, a score variable, and a time variable.
   :type df: DataFrame
   :param score_var: The score variable used to assign values in the matrix.
   :type score_var: string
   :param id_var_i: The first id variable corresponding to rows in the matrix.
   :type id_var_i: string
   :param id_var_j: The second id variable corresponding to columns in the matrix.
   :type id_var_j: string
   :param time_var: The time variable used to split the data into different matrices.
   :type time_var: string

   :returns: * **S_t** (*list of ndarray*) -- A list of relationship matrices, each corresponding to a different time period.
             * **ids_t** (*ndarray*) -- Array of identifiers for each element of the matrices.

   :raises ValueError:: If the DataFrame is missing any required columns or if there are no valid entries for any time period.


.. py:function:: normalize_diss_mat(D)

   Normalize a dissimilarity matrix by the maximum dissimilarity observed in the matrix.

   :param D: A dissimilarity matrix.
   :type D: ndarray of shape (n_samples, n_samples)

   :returns: Normalized dissimilarity matrix.
   :rtype: ndarray of shape (n_samples, n_samples)

   :raises ValueError: If the input matrix is not square or if the maximum dissimilarity is zero.


.. py:function:: normalize_diss_mats(D_ts)

   Normalize a sequence of dissimilarity matrices by a common factor
   (the maximum dissimilarity within the sequence).

   :param D_ts: Sequence of dissimilarity matrices.
   :type D_ts: list of ndarray, each of shape (n_samples, n_samples)

   :returns: Sequence of dissimilarity matrices, normalized by the maximum dissimilarity within
             the input sequence.
   :rtype: list of ndarray

   :raises ValueError: If any matrix is not square or if the list is empty.


.. py:function:: expand_matrices(X_ts, labels_ts)

   Expand a list of similarity matrices (X_ts) to equal shape and calculate inclusion vectors.

   :param X_ts: List of similarity matrices for each time point.
   :type X_ts: list of ndarray
   :param labels_ts: List of labels corresponding to each matrix in X_ts.
   :type labels_ts: list of list

   :returns: Contains a list of expanded similarity matrices, inclusion vectors, and all labels.
   :rtype: tuple


.. py:function:: calc_distances(X, metric='euclidean')

   Calculate matrix of pairwise distances among the rows of an input matrix.

   :param X: Input matrix containing samples for which pairwise distances will be calculated.
   :type X: ndarray of shape (n_samples, n_dims)
   :param metric: The distance metric to use. Can be any of those supported by `scipy.spatial.distance.pdist`,
                  such as 'euclidean', 'cityblock', 'cosine', etc. Defaults to 'euclidean'.
   :type metric: str, optional

   :returns: A matrix of pairwise distances, where each element (i, j) is the distance
             between the i-th and j-th rows of the input matrix X according to the specified metric.
   :rtype: ndarray of shape (n_samples, n_samples)

   :raises ValueError: If the metric specified is not supported by `scipy.spatial.distance.pdist`.