ngs_tools.sequence

Module Contents

Functions

alignment_to_cigar(→ str)

Convert an alignment to a CIGAR string.

complement_sequence(→ str)

Complement the given sequence, with optional reversing.

_sequence_to_array(→ numpy.ndarray)

_qualities_to_array(→ numpy.ndarray)

_most_likely_array(→ numpy.ndarray)

_most_likely_sequence(→ str)

_disambiguate_sequence(→ List[str])

_calculate_positional_probs(→ numpy.ndarray)

call_consensus(sequences[, proportion])

Call consensus sequences from a set of sequences. Internally, this

call_consensus_with_qualities(→ Union[Tuple[List[str], ...)

Given a list of sequences and their base qualities, constructs a set of consensus

levenshtein_distance(→ int)

Calculate the Levenshtein (edit) distance between two sequences.

_mismatch_mask(→ int)

_mismatch_masks(→ numpy.ndarray)

_hamming_distance(→ int)

hamming_distance(→ int)

Calculate the hamming distance between two sequences.

_hamming_distances(→ numpy.ndarray)

hamming_distances(→ numpy.ndarray)

Calculate the hamming distance between a sequence and a list of sequences.

_hamming_distance_matrix(→ numpy.ndarray)

hamming_distance_matrix(→ numpy.ndarray)

Calculate all pairwise hamming distances between two lists of sequences.

_pairwise_hamming_distances(→ numpy.ndarray)

pairwise_hamming_distances(→ numpy.ndarray)

Calculate all pairwise hamming distances between combinations of sequences

_correct_to_whitelist(→ Tuple[int, float])

correct_sequences_to_whitelist(→ List[Union[str, None]])

Correct a list of sequences to a whitelist within d hamming distance.

correct_sequences_to_whitelist_simple(→ Dict[str, ...)

Correct a list of sequences to a whitelist within d hamming distance.

Attributes

NeedlemanWunsch

NUCLEOTIDES_STRICT

NUCLEOTIDES_PERMISSIVE

NUCLEOTIDES

NUCLEOTIDES_AMBIGUOUS

NUCLEOTIDE_COMPLEMENT

NUCLEOTIDE_MASKS

MASK_TO_NUCLEOTIDE

LEVENSHTEIN_DISTANCE_ALIGNER

SEQUENCE_PARSER

ngs_tools.sequence.NeedlemanWunsch
ngs_tools.sequence.NUCLEOTIDES_STRICT = ['A', 'C', 'G', 'T']
ngs_tools.sequence.NUCLEOTIDES_PERMISSIVE = ['R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N', '-']
ngs_tools.sequence.NUCLEOTIDES
ngs_tools.sequence.NUCLEOTIDES_AMBIGUOUS
ngs_tools.sequence.NUCLEOTIDE_COMPLEMENT
ngs_tools.sequence.NUCLEOTIDE_MASKS
ngs_tools.sequence.MASK_TO_NUCLEOTIDE
ngs_tools.sequence.LEVENSHTEIN_DISTANCE_ALIGNER
ngs_tools.sequence.SEQUENCE_PARSER
exception ngs_tools.sequence.SequenceError

Bases: Exception

Common base class for all non-exit exceptions.

ngs_tools.sequence.alignment_to_cigar(reference: str, query: str, mismatch: bool = False) str

Convert an alignment to a CIGAR string.

The CIGAR is always constructed relative to the reference (i.e. as insertions/deletions from the reference). The provided sequences must represent their alignments, containing the “-” character at positions where there are gaps in one or another.

Parameters:
  • reference – The reference sequence alignment

  • query – The query sequence alignment

  • mismatch – Whether or not to use the “X” CIGAR operation in places of mismatches. Defaults to True, such that all non-gaps are considered matches with the “M” CIGAR operation. When there are ambiguous characters, a mismatch occurs only when the two sets of possible nucleotides are exclusive.

Returns:

The CIGAR string representing the provided alignment

Raises:

SequenceError – if the reference and query do not have the same length, or if there are gaps in both alignments at the same position

ngs_tools.sequence.complement_sequence(sequence: str, reverse: bool = False) str

Complement the given sequence, with optional reversing.

Parameters:
  • sequence – Input sequence

  • reverse – Whether or not to perform reverse complementation

Returns:

Complemented (and optionally reversed) string

ngs_tools.sequence._sequence_to_array(sequence: str, l: Optional[int] = None) numpy.ndarray
ngs_tools.sequence._qualities_to_array(qualities: Union[str, array.array], l: Optional[int] = None) numpy.ndarray
ngs_tools.sequence._most_likely_array(positional_probs: numpy.ndarray) numpy.ndarray
ngs_tools.sequence._most_likely_sequence(positional_probs: numpy.ndarray, allow_ambiguous: bool = False) str
ngs_tools.sequence._disambiguate_sequence(sequence: numpy.ndarray) List[str]
ngs_tools.sequence._calculate_positional_probs(sequences: numpy.ndarray, qualities: numpy.ndarray) numpy.ndarray
ngs_tools.sequence.call_consensus(sequences: List[str], proportion: float = 0.05)

Call consensus sequences from a set of sequences. Internally, this function calls call_consensus_with_qualities() with all qualities set to 1 and q_threshold set to 1. See the documentation of this function for details on how consensuses are called.

Parameters:
  • sequences – Sequences to call consensus for

  • proportion – Proportion of each sequence to allow mismatched bases to be above q_threshold

Returns:

List of consensus sequences Numpy array of assignments for each sequence in sequences

ngs_tools.sequence.call_consensus_with_qualities(sequences: List[str], qualities: Union[List[str], List[array.array], List[numpy.ndarray]], q_threshold: Optional[int] = None, proportion: float = 0.05, allow_ambiguous: bool = False, return_qualities: bool = False) Union[Tuple[List[str], numpy.ndarray], Tuple[List[str], numpy.ndarray, List[str]]]

Given a list of sequences and their base qualities, constructs a set of consensus sequences by iteratively constructing a consensus (by selecting the most likely base at each position) and assigning sequences with match probability <= max(min(match probability), q_threshold * (proportion * length of longest sequence)) to this consensus. Then, the consensus is updated by constructing the consensus only among these sequences. The match probability of a sequence to a consensus is the sum of the quality values where they do not match (equivalent to negative log probability that all mismatches were sequencing errors).

Note

This function does not perform any alignment among consensus sequences. To detect any insertions/deletions, call this function and then perform alignment among the called consensus sequences.

Parameters:
  • sequences – Sequences to call consensus for

  • qualities – Quality scores for the sequences

  • q_threshold – Quality threshold. Defaults to the median quality score among all bases in all sequences.

  • proportion – Proportion of each sequence to allow mismatched bases to be above q_threshold. Defaults to 0.05.

  • allow_ambiguous – Allow ambiguous bases in the consensus sequences. Defaults to False, which, on ties, selects a single base lexicographically. This option only has an effect when constructing the final consensus sequence as a string, not when calculating error probabilities.

  • return_qualities – Whether or not to return qualities for the consensuses. Defaults to False.

Returns:

List of consensus sequences Numpy array of assignments for each sequence in sequences Qualities for each of the consensus sequences, if return_qualities is True

Raises:

SequenceError – if any sequence-quality pair have different lengths or number of provided sequences does not match number of provided qualities

ngs_tools.sequence.levenshtein_distance(sequence1: str, sequence2: str) int

Calculate the Levenshtein (edit) distance between two sequences. This is calculated by calling pyseq_align.NeedlemanWunsch.align() with the appropriate scores/penalties.

Parameters:
  • sequence1 – First sequence

  • sequence2 – Second sequence

Returns:

Levenshtein distance

ngs_tools.sequence._mismatch_mask(sequence1: numpy.ndarray, sequence2: numpy.ndarray) int
ngs_tools.sequence._mismatch_masks(sequence: numpy.ndarray, whitelist: numpy.ndarray, d: int) numpy.ndarray
ngs_tools.sequence._hamming_distance(sequence1: numpy.ndarray, sequence2: numpy.ndarray) int
ngs_tools.sequence.hamming_distance(sequence1: str, sequence2: str) int

Calculate the hamming distance between two sequences.

Parameters:
  • sequence1 – First sequence

  • sequence2 – Second sequence

Returns:

Hamming distance

Raises:

SequenceError – When the sequences are unequal lengths

ngs_tools.sequence._hamming_distances(sequence: numpy.ndarray, sequences: numpy.ndarray) numpy.ndarray
ngs_tools.sequence.hamming_distances(sequence: str, sequences: List[str]) numpy.ndarray

Calculate the hamming distance between a sequence and a list of sequences.

Parameters:
  • sequence – Sequence

  • sequences – List of sequences

Returns:

Numpy array of hamming distances, where each index i corresponds to the hamming distance between sequence and sequences[i]

Raises:

SequenceError – When any of the sequences are unequal length

ngs_tools.sequence._hamming_distance_matrix(sequences1: numpy.ndarray, sequences2: numpy.ndarray) numpy.ndarray
ngs_tools.sequence.hamming_distance_matrix(sequences1: List[str], sequences2: List[str]) numpy.ndarray

Calculate all pairwise hamming distances between two lists of sequences.

Parameters:
  • sequences1 – List of sequences

  • sequences2 – List of sequences

Returns:

Numpy array of hamming distances, where each index (i, j) corresponds to the hamming distance between sequences1[i] and sequences2[j]

Raises:

SequenceError – When any of the sequences are unequal length

ngs_tools.sequence._pairwise_hamming_distances(sequences: numpy.ndarray) numpy.ndarray
ngs_tools.sequence.pairwise_hamming_distances(sequences: List[str]) numpy.ndarray

Calculate all pairwise hamming distances between combinations of sequences from a single list.

Parameters:

sequences – List of sequences

Returns:

Numpy array of hamming distances, where each index (i, j) corresponds to the hamming distance between sequences[i] and sequences[j]

Raises:

SequenceError – When any of the sequences are unequal length

ngs_tools.sequence._correct_to_whitelist(qualities: numpy.ndarray, indices: numpy.ndarray, masks: numpy.ndarray, log10_proportions: numpy.ndarray) Tuple[int, float]
ngs_tools.sequence.correct_sequences_to_whitelist(sequences: List[str], qualities: Union[List[str], List[array.array]], whitelist: List[str], d: int = 1, confidence: float = 0.95, n_threads: int = 1, show_progress: bool = False) List[Union[str, None]]

Correct a list of sequences to a whitelist within d hamming distance. Note that sequences can contain duplicates, but whitelist can not.

For a given sequence, if there are multiple barcodes in the whitelist to which its distance is <= d, the sequence is assigned to a barcode by using the prior probability that the sequence originated from the barcode. If the confidence that the sequence originated from the most likely barcode is less than confidence, assignment is skipped.

This procedure follows the barcode correction procedure in Cell Ranger by 10X Genomics. Some modifications were made to support ambiguous bases and prevent floating-point underflow. https://kb.10xgenomics.com/hc/en-us/articles/115003822406-How-does-Cell-Ranger-correct-barcode-sequencing-errors

Note

Only hamming distance is supported (not Levenshtein distance).

Parameters:
  • sequences – List of sequence strings

  • qualities – List of quality strings or list of array of qualities (as returned by pysam.qualitystring_to_array())

  • whitelist – List of whitelist sequences to correct to

  • d – Hamming distance threshold. Sequences will be corrected to the whitelist with hamming distance <= d. Defaults to 1.

  • confidence – When a sequence has the same minimum hamming distance to multiple whitelisted sequences, the sequence is assigned to the best whitelisted sequence (using prior probabilities) if the likelihood ratio of this and the sum of all likelihoods is greater than or equal to this value. Defaults to 0.95.

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to True.

Raises:

SequenceError – If all the lengths of each sequence, qualitiy and whitelisted sequence are not equal, the number of sequences and qualities provided are not equal or the whitelist contains duplicates.

Returns:

The corrections as a list of whitelisted sequences. For sequences that could not be corrected, the corresponding position contains None.

ngs_tools.sequence.correct_sequences_to_whitelist_simple(sequences: List[str], whitelist: List[str], d: int = 1, n_threads: int = 1, show_progress: bool = False) Dict[str, Union[str, None]]

Correct a list of sequences to a whitelist within d hamming distance. Note that sequences can contain duplicates, but whitelist can not. Unlike correct_sequences_to_whitelist(), this function takes a naive approach and discards any sequences that can be corrected to multiple whitelisted sequences.

Parameters:
  • sequences – List of sequence strings

  • whitelist – List of whitelist sequences to correct to

  • d – Hamming distance threshold. Sequences will be corrected to the whitelist with hamming distance <= d. Defaults to 1.

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to True.

Raises:

SequenceError – If all the lengths of each sequence, qualitiy and whitelisted sequence are not equal, the number of sequences and qualities provided are not equal or the whitelist contains duplicates.

Returns:

The corrections as a dict of sequence to correction mappings. Note that the return type is different from correct_sequences_to_whitelist().