evomap.preprocessing#
Module for data pre-processing, including transformations between different data formats.
Functions#
|
Transform a dissimilarity matrix to a similarity matrix. |
|
Transform a similarity matrix to a dissimilarity matrix. |
|
Transform a matrix with co-occurrence counts to a similarity matrix. |
|
Transform an edgelist to a relationship matrix. |
|
Transform a time-indexed edgelist into a sequence of relationship matrices. |
Normalize a dissimilarity matrix by the maximum dissimilarity observed in the matrix. |
|
|
Normalize a sequence of dissimilarity matrices by a common factor |
|
Expand a list of similarity matrices (X_ts) to equal shape and calculate inclusion vectors. |
|
Calculate matrix of pairwise distances among the rows of an input matrix. |
Module Contents#
- evomap.preprocessing.diss2sim(diss_mat, transformation='inverse', eps=0.001)[source]#
Transform a dissimilarity matrix to a similarity matrix.
- Parameters:
diss_mat (ndarray of shape (n_samples, n_samples)) – Matrix of pairwise dissimilarities.
transformation (str, optional) – Transformation function, either ‘inverse’ or ‘mirror’, by default ‘inverse’
eps (float, optional) – Incremental constant to avoid division by zero, by default 1e-3
- Returns:
Matrix of pairwise similarities.
- Return type:
ndarray of shape (n_samples, n_samples)
- evomap.preprocessing.sim2diss(sim_mat, transformation='inverse', eps=0.0001)[source]#
Transform a similarity matrix to a dissimilarity matrix.
- Parameters:
sim_mat (ndarray of shape (n_samples, n_samples)) – Matrix of pairwise similarities.
transformation (str, optional) – Transformation function, either ‘inverse’ or ‘mirror’, by default ‘inverse’. ‘inverse’ - Transforms by taking the reciprocal of the similarity scores. ‘mirror’ - Transforms by reflecting the similarity scores about 0.5 (1 - similarity).
eps (float, optional) – Incremental constant to avoid division by zero, by default 1e-4
- Returns:
Matrix of pairwise dissimilarities.
- Return type:
ndarray of shape (n_samples, n_samples)
- evomap.preprocessing.coocc2sim(coocc_mat)[source]#
Transform a matrix with co-occurrence counts to a similarity matrix.
- Parameters:
coocc_mat (ndarray of shape (n_samples, n_samples)) – Matrix of co-occurrence counts. Assumes a non-negative, symmetric matrix where the diagonal can be ignored (usually representing self-co-occurrence).
- Returns:
Matrix of pairwise similarities, normalized such that each element is a proportion of the maximum co-occurrence for that row.
- Return type:
ndarray of shape (n_samples, n_samples)
Notes
The function sets the diagonal to zero to prevent self-similarity from skewing the results. If a row’s total co-occurrence is zero, it sets the entire row’s similarity to zero to avoid division by zero.
- evomap.preprocessing.edgelist2matrix(df, score_var, id_var_i, id_var_j, time_var=None, time_selected=None)[source]#
Transform an edgelist to a relationship matrix. :param df: Data containing the edgelist. Each row should include a pair. Must contain
two id variables and a score variable. Can optionally include a time variable.
- Parameters:
score_var (string) – The score variable.
id_var_i (string) – The first id variable.
id_var_j (string) – The second id variable.
time_var (string, optional) – The time variable, by default None.
time_selected (int, optional) – The selected time, by default None.
- Returns:
S (ndarray of shape (n_samples, n_samples)) – A matrix of pairwise relationships.
ids (ndarray) – Identifiers for each element of the matrix.
- Raises:
ValueError: – If required columns are missing in the DataFrame.
- evomap.preprocessing.edgelist2matrices(df, score_var, id_var_i, id_var_j, time_var)[source]#
Transform a time-indexed edgelist into a sequence of relationship matrices.
- Parameters:
df (DataFrame) – Data containing the edgelist. Each row should include a pair and must contain two id variables, a score variable, and a time variable.
score_var (string) – The score variable used to assign values in the matrix.
id_var_i (string) – The first id variable corresponding to rows in the matrix.
id_var_j (string) – The second id variable corresponding to columns in the matrix.
time_var (string) – The time variable used to split the data into different matrices.
- Returns:
S_t (list of ndarray) – A list of relationship matrices, each corresponding to a different time period.
ids_t (ndarray) – Array of identifiers for each element of the matrices.
- Raises:
ValueError: – If the DataFrame is missing any required columns or if there are no valid entries for any time period.
- evomap.preprocessing.normalize_diss_mat(D)[source]#
Normalize a dissimilarity matrix by the maximum dissimilarity observed in the matrix.
- Parameters:
D (ndarray of shape (n_samples, n_samples)) – A dissimilarity matrix.
- Returns:
Normalized dissimilarity matrix.
- Return type:
ndarray of shape (n_samples, n_samples)
- Raises:
ValueError – If the input matrix is not square or if the maximum dissimilarity is zero.
- evomap.preprocessing.normalize_diss_mats(D_ts)[source]#
Normalize a sequence of dissimilarity matrices by a common factor (the maximum dissimilarity within the sequence).
- Parameters:
D_ts (list of ndarray, each of shape (n_samples, n_samples)) – Sequence of dissimilarity matrices.
- Returns:
Sequence of dissimilarity matrices, normalized by the maximum dissimilarity within the input sequence.
- Return type:
list of ndarray
- Raises:
ValueError – If any matrix is not square or if the list is empty.
- evomap.preprocessing.expand_matrices(X_ts, labels_ts)[source]#
Expand a list of similarity matrices (X_ts) to equal shape and calculate inclusion vectors.
- Parameters:
X_ts (list of ndarray) – List of similarity matrices for each time point.
labels_ts (list of list) – List of labels corresponding to each matrix in X_ts.
- Returns:
Contains a list of expanded similarity matrices, inclusion vectors, and all labels.
- Return type:
tuple
- evomap.preprocessing.calc_distances(X, metric='euclidean')[source]#
Calculate matrix of pairwise distances among the rows of an input matrix.
- Parameters:
X (ndarray of shape (n_samples, n_dims)) – Input matrix containing samples for which pairwise distances will be calculated.
metric (str, optional) – The distance metric to use. Can be any of those supported by scipy.spatial.distance.pdist, such as ‘euclidean’, ‘cityblock’, ‘cosine’, etc. Defaults to ‘euclidean’.
- Returns:
A matrix of pairwise distances, where each element (i, j) is the distance between the i-th and j-th rows of the input matrix X according to the specified metric.
- Return type:
ndarray of shape (n_samples, n_samples)
- Raises:
ValueError – If the metric specified is not supported by scipy.spatial.distance.pdist.