evomap.preprocessing#

Module for data pre-processing, including transformations between different data formats.

Functions#

diss2sim(diss_mat[, transformation, eps])

Transform a dissimilarity matrix to a similarity matrix.

sim2diss(sim_mat[, transformation, eps])

Transform a similarity matrix to a dissimilarity matrix.

coocc2sim(coocc_mat)

Transform a matrix with co-occurrence counts to a similarity matrix.

edgelist2matrix(df, score_var, id_var_i, id_var_j[, ...])

Transform an edgelist to a relationship matrix.

edgelist2matrices(df, score_var, id_var_i, id_var_j, ...)

Transform a time-indexed edgelist into a sequence of relationship matrices.

normalize_diss_mat(D)

Normalize a dissimilarity matrix by the maximum dissimilarity observed in the matrix.

normalize_diss_mats(D_ts)

Normalize a sequence of dissimilarity matrices by a common factor

expand_matrices(X_ts, labels_ts)

Expand a list of similarity matrices (X_ts) to equal shape and calculate inclusion vectors.

calc_distances(X[, metric])

Calculate matrix of pairwise distances among the rows of an input matrix.

Module Contents#

evomap.preprocessing.diss2sim(diss_mat, transformation='inverse', eps=0.001)[source]#

Transform a dissimilarity matrix to a similarity matrix.

Parameters:
  • diss_mat (ndarray of shape (n_samples, n_samples)) – Matrix of pairwise dissimilarities.

  • transformation (str, optional) – Transformation function, either ‘inverse’ or ‘mirror’, by default ‘inverse’

  • eps (float, optional) – Incremental constant to avoid division by zero, by default 1e-3

Returns:

Matrix of pairwise similarities.

Return type:

ndarray of shape (n_samples, n_samples)

evomap.preprocessing.sim2diss(sim_mat, transformation='inverse', eps=0.0001)[source]#

Transform a similarity matrix to a dissimilarity matrix.

Parameters:
  • sim_mat (ndarray of shape (n_samples, n_samples)) – Matrix of pairwise similarities.

  • transformation (str, optional) – Transformation function, either ‘inverse’ or ‘mirror’, by default ‘inverse’. ‘inverse’ - Transforms by taking the reciprocal of the similarity scores. ‘mirror’ - Transforms by reflecting the similarity scores about 0.5 (1 - similarity).

  • eps (float, optional) – Incremental constant to avoid division by zero, by default 1e-4

Returns:

Matrix of pairwise dissimilarities.

Return type:

ndarray of shape (n_samples, n_samples)

evomap.preprocessing.coocc2sim(coocc_mat)[source]#

Transform a matrix with co-occurrence counts to a similarity matrix.

Parameters:

coocc_mat (ndarray of shape (n_samples, n_samples)) – Matrix of co-occurrence counts. Assumes a non-negative, symmetric matrix where the diagonal can be ignored (usually representing self-co-occurrence).

Returns:

Matrix of pairwise similarities, normalized such that each element is a proportion of the maximum co-occurrence for that row.

Return type:

ndarray of shape (n_samples, n_samples)

Notes

The function sets the diagonal to zero to prevent self-similarity from skewing the results. If a row’s total co-occurrence is zero, it sets the entire row’s similarity to zero to avoid division by zero.

evomap.preprocessing.edgelist2matrix(df, score_var, id_var_i, id_var_j, time_var=None, time_selected=None)[source]#

Transform an edgelist to a relationship matrix. :param df: Data containing the edgelist. Each row should include a pair. Must contain

two id variables and a score variable. Can optionally include a time variable.

Parameters:
  • score_var (string) – The score variable.

  • id_var_i (string) – The first id variable.

  • id_var_j (string) – The second id variable.

  • time_var (string, optional) – The time variable, by default None.

  • time_selected (int, optional) – The selected time, by default None.

Returns:

  • S (ndarray of shape (n_samples, n_samples)) – A matrix of pairwise relationships.

  • ids (ndarray) – Identifiers for each element of the matrix.

Raises:

ValueError: – If required columns are missing in the DataFrame.

evomap.preprocessing.edgelist2matrices(df, score_var, id_var_i, id_var_j, time_var)[source]#

Transform a time-indexed edgelist into a sequence of relationship matrices.

Parameters:
  • df (DataFrame) – Data containing the edgelist. Each row should include a pair and must contain two id variables, a score variable, and a time variable.

  • score_var (string) – The score variable used to assign values in the matrix.

  • id_var_i (string) – The first id variable corresponding to rows in the matrix.

  • id_var_j (string) – The second id variable corresponding to columns in the matrix.

  • time_var (string) – The time variable used to split the data into different matrices.

Returns:

  • S_t (list of ndarray) – A list of relationship matrices, each corresponding to a different time period.

  • ids_t (ndarray) – Array of identifiers for each element of the matrices.

Raises:

ValueError: – If the DataFrame is missing any required columns or if there are no valid entries for any time period.

evomap.preprocessing.normalize_diss_mat(D)[source]#

Normalize a dissimilarity matrix by the maximum dissimilarity observed in the matrix.

Parameters:

D (ndarray of shape (n_samples, n_samples)) – A dissimilarity matrix.

Returns:

Normalized dissimilarity matrix.

Return type:

ndarray of shape (n_samples, n_samples)

Raises:

ValueError – If the input matrix is not square or if the maximum dissimilarity is zero.

evomap.preprocessing.normalize_diss_mats(D_ts)[source]#

Normalize a sequence of dissimilarity matrices by a common factor (the maximum dissimilarity within the sequence).

Parameters:

D_ts (list of ndarray, each of shape (n_samples, n_samples)) – Sequence of dissimilarity matrices.

Returns:

Sequence of dissimilarity matrices, normalized by the maximum dissimilarity within the input sequence.

Return type:

list of ndarray

Raises:

ValueError – If any matrix is not square or if the list is empty.

evomap.preprocessing.expand_matrices(X_ts, labels_ts)[source]#

Expand a list of similarity matrices (X_ts) to equal shape and calculate inclusion vectors.

Parameters:
  • X_ts (list of ndarray) – List of similarity matrices for each time point.

  • labels_ts (list of list) – List of labels corresponding to each matrix in X_ts.

Returns:

Contains a list of expanded similarity matrices, inclusion vectors, and all labels.

Return type:

tuple

evomap.preprocessing.calc_distances(X, metric='euclidean')[source]#

Calculate matrix of pairwise distances among the rows of an input matrix.

Parameters:
  • X (ndarray of shape (n_samples, n_dims)) – Input matrix containing samples for which pairwise distances will be calculated.

  • metric (str, optional) – The distance metric to use. Can be any of those supported by scipy.spatial.distance.pdist, such as ‘euclidean’, ‘cityblock’, ‘cosine’, etc. Defaults to ‘euclidean’.

Returns:

A matrix of pairwise distances, where each element (i, j) is the distance between the i-th and j-th rows of the input matrix X according to the specified metric.

Return type:

ndarray of shape (n_samples, n_samples)

Raises:

ValueError – If the metric specified is not supported by scipy.spatial.distance.pdist.