ngs_tools.sequence
Module Contents
Functions
|
Convert an alignment to a CIGAR string. |
|
Complement the given sequence, with optional reversing. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Call consensus sequences from a set of sequences. Internally, this |
|
Given a list of sequences and their base qualities, constructs a set of consensus |
|
Calculate the Levenshtein (edit) distance between two sequences. |
|
|
|
|
|
|
|
Calculate the hamming distance between two sequences. |
|
|
|
Calculate the hamming distance between a sequence and a list of sequences. |
|
|
|
Calculate all pairwise hamming distances between two lists of sequences. |
|
|
|
Calculate all pairwise hamming distances between combinations of sequences |
|
|
|
Correct a list of sequences to a whitelist within d hamming distance. |
|
Correct a list of sequences to a whitelist within d hamming distance. |
Attributes
- ngs_tools.sequence.NeedlemanWunsch
- ngs_tools.sequence.NUCLEOTIDES_STRICT = ['A', 'C', 'G', 'T']
- ngs_tools.sequence.NUCLEOTIDES_PERMISSIVE = ['R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N', '-']
- ngs_tools.sequence.NUCLEOTIDES
- ngs_tools.sequence.NUCLEOTIDES_AMBIGUOUS
- ngs_tools.sequence.NUCLEOTIDE_COMPLEMENT
- ngs_tools.sequence.NUCLEOTIDE_MASKS
- ngs_tools.sequence.MASK_TO_NUCLEOTIDE
- ngs_tools.sequence.LEVENSHTEIN_DISTANCE_ALIGNER
- ngs_tools.sequence.SEQUENCE_PARSER
- exception ngs_tools.sequence.SequenceError
Bases:
Exception
Common base class for all non-exit exceptions.
- ngs_tools.sequence.alignment_to_cigar(reference: str, query: str, mismatch: bool = False) str
Convert an alignment to a CIGAR string.
The CIGAR is always constructed relative to the reference (i.e. as insertions/deletions from the reference). The provided sequences must represent their alignments, containing the “-” character at positions where there are gaps in one or another.
- Parameters:
reference – The reference sequence alignment
query – The query sequence alignment
mismatch – Whether or not to use the “X” CIGAR operation in places of mismatches. Defaults to True, such that all non-gaps are considered matches with the “M” CIGAR operation. When there are ambiguous characters, a mismatch occurs only when the two sets of possible nucleotides are exclusive.
- Returns:
The CIGAR string representing the provided alignment
- Raises:
SequenceError – if the reference and query do not have the same length, or if there are gaps in both alignments at the same position
- ngs_tools.sequence.complement_sequence(sequence: str, reverse: bool = False) str
Complement the given sequence, with optional reversing.
- Parameters:
sequence – Input sequence
reverse – Whether or not to perform reverse complementation
- Returns:
Complemented (and optionally reversed) string
- ngs_tools.sequence._sequence_to_array(sequence: str, l: Optional[int] = None) numpy.ndarray
- ngs_tools.sequence._qualities_to_array(qualities: Union[str, array.array], l: Optional[int] = None) numpy.ndarray
- ngs_tools.sequence._most_likely_array(positional_probs: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence._most_likely_sequence(positional_probs: numpy.ndarray, allow_ambiguous: bool = False) str
- ngs_tools.sequence._disambiguate_sequence(sequence: numpy.ndarray) List[str]
- ngs_tools.sequence._calculate_positional_probs(sequences: numpy.ndarray, qualities: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.call_consensus(sequences: List[str], proportion: float = 0.05)
Call consensus sequences from a set of sequences. Internally, this function calls
call_consensus_with_qualities()
with all qualities set to 1 andq_threshold
set to 1. See the documentation of this function for details on how consensuses are called.- Parameters:
sequences – Sequences to call consensus for
proportion – Proportion of each sequence to allow mismatched bases to be above
q_threshold
- Returns:
List of consensus sequences Numpy array of assignments for each sequence in
sequences
- ngs_tools.sequence.call_consensus_with_qualities(sequences: List[str], qualities: Union[List[str], List[array.array], List[numpy.ndarray]], q_threshold: Optional[int] = None, proportion: float = 0.05, allow_ambiguous: bool = False, return_qualities: bool = False) Union[Tuple[List[str], numpy.ndarray], Tuple[List[str], numpy.ndarray, List[str]]]
Given a list of sequences and their base qualities, constructs a set of consensus sequences by iteratively constructing a consensus (by selecting the most likely base at each position) and assigning sequences with match probability <= max(min(match probability), q_threshold * (proportion * length of longest sequence)) to this consensus. Then, the consensus is updated by constructing the consensus only among these sequences. The match probability of a sequence to a consensus is the sum of the quality values where they do not match (equivalent to negative log probability that all mismatches were sequencing errors).
Note
This function does not perform any alignment among consensus sequences. To detect any insertions/deletions, call this function and then perform alignment among the called consensus sequences.
- Parameters:
sequences – Sequences to call consensus for
qualities – Quality scores for the sequences
q_threshold – Quality threshold. Defaults to the median quality score among all bases in all sequences.
proportion – Proportion of each sequence to allow mismatched bases to be above
q_threshold
. Defaults to 0.05.allow_ambiguous – Allow ambiguous bases in the consensus sequences. Defaults to False, which, on ties, selects a single base lexicographically. This option only has an effect when constructing the final consensus sequence as a string, not when calculating error probabilities.
return_qualities – Whether or not to return qualities for the consensuses. Defaults to False.
- Returns:
List of consensus sequences Numpy array of assignments for each sequence in
sequences
Qualities for each of the consensus sequences, ifreturn_qualities
is True- Raises:
SequenceError – if any sequence-quality pair have different lengths or number of provided sequences does not match number of provided qualities
- ngs_tools.sequence.levenshtein_distance(sequence1: str, sequence2: str) int
Calculate the Levenshtein (edit) distance between two sequences. This is calculated by calling
pyseq_align.NeedlemanWunsch.align()
with the appropriate scores/penalties.- Parameters:
sequence1 – First sequence
sequence2 – Second sequence
- Returns:
Levenshtein distance
- ngs_tools.sequence._mismatch_mask(sequence1: numpy.ndarray, sequence2: numpy.ndarray) int
- ngs_tools.sequence._mismatch_masks(sequence: numpy.ndarray, whitelist: numpy.ndarray, d: int) numpy.ndarray
- ngs_tools.sequence._hamming_distance(sequence1: numpy.ndarray, sequence2: numpy.ndarray) int
- ngs_tools.sequence.hamming_distance(sequence1: str, sequence2: str) int
Calculate the hamming distance between two sequences.
- Parameters:
sequence1 – First sequence
sequence2 – Second sequence
- Returns:
Hamming distance
- Raises:
SequenceError – When the sequences are unequal lengths
- ngs_tools.sequence._hamming_distances(sequence: numpy.ndarray, sequences: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.hamming_distances(sequence: str, sequences: List[str]) numpy.ndarray
Calculate the hamming distance between a sequence and a list of sequences.
- Parameters:
sequence – Sequence
sequences – List of sequences
- Returns:
Numpy array of hamming distances, where each index
i
corresponds to the hamming distance betweensequence
andsequences[i]
- Raises:
SequenceError – When any of the sequences are unequal length
- ngs_tools.sequence._hamming_distance_matrix(sequences1: numpy.ndarray, sequences2: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.hamming_distance_matrix(sequences1: List[str], sequences2: List[str]) numpy.ndarray
Calculate all pairwise hamming distances between two lists of sequences.
- Parameters:
sequences1 – List of sequences
sequences2 – List of sequences
- Returns:
Numpy array of hamming distances, where each index
(i, j)
corresponds to the hamming distance betweensequences1[i]
andsequences2[j]
- Raises:
SequenceError – When any of the sequences are unequal length
- ngs_tools.sequence._pairwise_hamming_distances(sequences: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.pairwise_hamming_distances(sequences: List[str]) numpy.ndarray
Calculate all pairwise hamming distances between combinations of sequences from a single list.
- Parameters:
sequences – List of sequences
- Returns:
Numpy array of hamming distances, where each index
(i, j)
corresponds to the hamming distance betweensequences[i]
andsequences[j]
- Raises:
SequenceError – When any of the sequences are unequal length
- ngs_tools.sequence._correct_to_whitelist(qualities: numpy.ndarray, indices: numpy.ndarray, masks: numpy.ndarray, log10_proportions: numpy.ndarray) Tuple[int, float]
- ngs_tools.sequence.correct_sequences_to_whitelist(sequences: List[str], qualities: Union[List[str], List[array.array]], whitelist: List[str], d: int = 1, confidence: float = 0.95, n_threads: int = 1, show_progress: bool = False) List[Union[str, None]]
Correct a list of sequences to a whitelist within d hamming distance. Note that sequences can contain duplicates, but whitelist can not.
For a given sequence, if there are multiple barcodes in the whitelist to which its distance is <= d, the sequence is assigned to a barcode by using the prior probability that the sequence originated from the barcode. If the confidence that the sequence originated from the most likely barcode is less than confidence, assignment is skipped.
This procedure follows the barcode correction procedure in Cell Ranger by 10X Genomics. Some modifications were made to support ambiguous bases and prevent floating-point underflow. https://kb.10xgenomics.com/hc/en-us/articles/115003822406-How-does-Cell-Ranger-correct-barcode-sequencing-errors
Note
Only hamming distance is supported (not Levenshtein distance).
- Parameters:
sequences – List of sequence strings
qualities – List of quality strings or list of array of qualities (as returned by
pysam.qualitystring_to_array()
)whitelist – List of whitelist sequences to correct to
d – Hamming distance threshold. Sequences will be corrected to the whitelist with hamming distance <=
d
. Defaults to 1.confidence – When a sequence has the same minimum hamming distance to multiple whitelisted sequences, the sequence is assigned to the best whitelisted sequence (using prior probabilities) if the likelihood ratio of this and the sum of all likelihoods is greater than or equal to this value. Defaults to 0.95.
n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to True.
- Raises:
SequenceError – If all the lengths of each sequence, qualitiy and whitelisted sequence are not equal, the number of sequences and qualities provided are not equal or the whitelist contains duplicates.
- Returns:
The corrections as a list of whitelisted sequences. For sequences that could not be corrected, the corresponding position contains None.
- ngs_tools.sequence.correct_sequences_to_whitelist_simple(sequences: List[str], whitelist: List[str], d: int = 1, n_threads: int = 1, show_progress: bool = False) Dict[str, Union[str, None]]
Correct a list of sequences to a whitelist within d hamming distance. Note that sequences can contain duplicates, but whitelist can not. Unlike
correct_sequences_to_whitelist()
, this function takes a naive approach and discards any sequences that can be corrected to multiple whitelisted sequences.- Parameters:
sequences – List of sequence strings
whitelist – List of whitelist sequences to correct to
d – Hamming distance threshold. Sequences will be corrected to the whitelist with hamming distance <=
d
. Defaults to 1.n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to True.
- Raises:
SequenceError – If all the lengths of each sequence, qualitiy and whitelisted sequence are not equal, the number of sequences and qualities provided are not equal or the whitelist contains duplicates.
- Returns:
The corrections as a dict of sequence to correction mappings. Note that the return type is different from
correct_sequences_to_whitelist()
.