API#

chame.util.seq.count_gc(sequence)#

Counts the frequency of capital G and C letters in a given string or array of strings.

There is no check that only A, T, G and C are in the string.

Args:
sequence: str or array

A sequence or an array of sequences

Returns:

A string if a string was provided. A numpy array in an array was provided.

Return type

float

chame.util.seq.sequence_to_onehot(sequence, mapping={'A': 0, 'C': 1, 'G': 2, 'T': 3}, map_unknown_to_x=False)#

Maps the sequence into a one-hot encoded matrix.

Follows the interface in AlphaFold.

Args:
sequence:

A sequence such as a sequence of nucleotides

mapping (optional):

A dictionary mapping possible sequence items (nucleotides) to integers, { ACGT -> 0123 } by default.

map_unknown_to_x (optional):

Items not in the mapping will be mapped to “X”. If there is no “X” in the mapping, an error will be thrown. False by default.

Returns:

A numpy array of shape (seq_len, num_unique_items) with one-hot encoding of the sequence.

Raises:
ValueError:

If the mapping doesn’t contain values from 0 to num_unique_items - 1 without gaps.

Return type

ndarray