ngs-tools: Reusable tools for working with next-generation sequencing (NGS) data
ngs_tools
Subpackages
ngs_tools.binary
Submodules
ngs_tools.binary.Argument
Module Contents
Classes
Abstract class representing a single argument. |
|
Abstract class representing a single argument. |
|
Abstract class representing a single argument. |
|
Abstract class representing a single argument. |
- exception ngs_tools.binary.Argument.ArgumentError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.binary.Argument.Argument(name: str, pre_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), post_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), required: bool = True)
Bases:
abc.ABC
Abstract class representing a single argument.
- property name: str
- property required: bool
- pre_execute(arg: str)
- post_execute(arg: str)
- render(arg: str) List[str]
- class ngs_tools.binary.Argument.PositionalArgument(name: str, pre_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), post_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), required: bool = True)
Bases:
Argument
Abstract class representing a single argument.
- class ngs_tools.binary.Argument.ConstantArgument(name: str, pre_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), post_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), required: bool = True)
Bases:
Argument
Abstract class representing a single argument.
- render(arg: Any) List[str]
- class ngs_tools.binary.Argument.NamedArgument(name: str, pre_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), post_validator: ngs_tools.binary.ArgumentValidator.ArgumentValidator = NoValidator(), required: bool = True)
Bases:
Argument
Abstract class representing a single argument.
- render(arg: str) List[str]
ngs_tools.binary.ArgumentValidator
Module Contents
Classes
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
|
Abstract argument validator. |
Attributes
- class ngs_tools.binary.ArgumentValidator.ArgumentValidator
Bases:
abc.ABC
Abstract argument validator.
- abstract __call__(arg) bool
- __and__(other: ArgumentValidator) ArgumentValidator
- __or__(other: ArgumentValidator) ArgumentValidator
- __invert__() ArgumentValidator
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.AndValidator(left: ArgumentValidator, right: ArgumentValidator)
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.OrValidator(left: ArgumentValidator, right: ArgumentValidator)
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.NotValidator(validator: ArgumentValidator)
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.NoValidator
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.IntegerValidator
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.FloatValidator
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.PositiveValidator
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.FileValidator
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.DirValidator
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- class ngs_tools.binary.ArgumentValidator.ConstantValidator(constant: str)
Bases:
ArgumentValidator
Abstract argument validator.
- __call__(arg) bool
- __str__() str
Return str(self).
- ngs_tools.binary.ArgumentValidator.IsInteger
- ngs_tools.binary.ArgumentValidator.IsFloat
- ngs_tools.binary.ArgumentValidator.IsPositive
- ngs_tools.binary.ArgumentValidator.IsPositiveInteger
- ngs_tools.binary.ArgumentValidator.IsDir
- ngs_tools.binary.ArgumentValidator.IsFile
ngs_tools.binary.Binary
Module Contents
Classes
Typed version of namedtuple. |
|
Wrapper around a binary file that provides a object-oriented interface to |
|
Class that handles execution of binaries with specified arguments. |
- exception ngs_tools.binary.Binary.BinaryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.binary.Binary.BinaryResult
Bases:
NamedTuple
Typed version of namedtuple.
Usage in Python versions >= 3.6:
class Employee(NamedTuple): name: str id: int
This is equivalent to:
Employee = collections.namedtuple('Employee', ['name', 'id'])
The resulting class has extra __annotations__ and _field_types attributes, giving an ordered dict mapping field names to types. __annotations__ should be preferred, while _field_types is kept to maintain pre PEP 526 compatibility. (The field names are in the _fields attribute, which is part of the namedtuple API.) Alternative equivalent keyword syntax is also accepted:
Employee = NamedTuple('Employee', name=str, id=int)
In Python versions <= 3.5 use:
Employee = NamedTuple('Employee', [('name', str), ('id', int)])
- command: List[str]
- process: subprocess.Popen
- stdout: Optional[str]
- stderr: Optional[str]
- class ngs_tools.binary.Binary.Binary(path: str)
Wrapper around a binary file that provides a object-oriented interface to
run_executable()
.- property path: str
- __str__() str
Return str(self).
- pre_execute(command: str, *args, **kwargs)
- post_execute(command: str, result: BinaryResult, *args, **kwargs)
- __call__(command: List[str], *args, **kwargs) BinaryResult
Wrapper around the
run_executable()
function. The first element of the command argument to the function will always be the path to the current binary.
- class ngs_tools.binary.Binary.BinaryExecutor(binary: Binary, *args)
Class that handles execution of binaries with specified arguments.
- pre_execute(command: str)
- post_execute(command: str, result: BinaryResult)
- __call__(values: Dict[str, Optional[str]], *args, **kwargs)
Run the binary with the specified values for each of the arguments.
ngs_tools.chemistry
Submodules
ngs_tools.chemistry.Chemistry
Module Contents
Classes
Definition of a subsequence. This class is used to parse a subsequence out from |
|
Class that uses a collection of |
|
Base class to represent any kind of chemistry. |
|
Generic enumeration. |
|
Base class to represent a sequencing chemistry. |
Attributes
- ngs_tools.chemistry.Chemistry.WHITELISTS_DIR
- exception ngs_tools.chemistry.Chemistry.SubSequenceDefinitionError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.chemistry.Chemistry.SubSequenceDefinition(index: int, start: Optional[int] = None, length: Optional[int] = None)
Definition of a subsequence. This class is used to parse a subsequence out from a list of sequences.
TODO: anchoring
- _length
Length of the subsequence; for internal use only. Use
length
instead.
- property index: int
Sequence index
- property start: Optional[int]
Substring starting position
- property length: Optional[int]
Substring length. None if not provided on initialization.
- is_overlapping(other: SubSequenceDefinition) bool
Whether this subsequence overlaps with another subsequence.
- Parameters:
other – The other
SubSequenceDefinition
instance to compare to- Returns:
True if they are overlapping. False otherwise.
- parse(s: List[str]) str
Parse the given list of strings according to the arguments used to initialize this instance. If
start
andlength
was not provided, then this is simply the entire string at indexindex
. Otherwise, the substring from positionstart
of lengthlength
is extracted from the string at indexindex
.- Parameters:
s – List of strings to parse
- Returns:
The parsed string
- __eq__(other: SubSequenceDefinition)
Return self==value.
- __repr__()
Return repr(self).
- __str__()
Return str(self).
- exception ngs_tools.chemistry.Chemistry.SubSequenceParserError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.chemistry.Chemistry.SubSequenceParser(*definitions: SubSequenceDefinition)
Class that uses a collection of
SubSequenceDefinition
instances to parse an entire subsequence from a list of strings.- _definitions
List of
SubSequenceDefinition
instances; for internal use only.
- property definitions
- is_overlapping(other: SubSequenceParser) bool
Whether this parser overlaps with another parser. Checks all pairwise combinations and returns True if any two
SubSequenceDefinition
instances overlap.- Parameters:
other – The other
SubSequenceParser
instance to compare to- Returns:
True if they are overlapping. False otherwise.
- parse(sequences: List[str], concatenate: bool = False) Union[str, Tuple[str]]
Iteratively constructs a full subsequence by applying each
SubSequenceDefinition
in_definitions
on the list of provided sequences. Ifconcatenate=False
, then this function returns a tuple of length equal to the number of definitions. Each element of the tuple is a string that was parsed by each definition. Otherwise, all the parsed strings are concatenated into a single string.- Parameters:
sequences – List of sequences to parse
concatenate – Whether or not to concatenate the parsed strings. Defaults to False.
- Returns:
Concatenated parsed sequence (if
concatenate=True
). Otherwise, a tuple of parsed strings.
- parse_reads(reads: List[ngs_tools.fastq.Read.Read], concatenate: bool = False) Tuple[Union[str, Tuple[str]], Union[str, Tuple[str]]]
Behaves identically to
parse()
, but instead on a list ofngs_tools.fastq.Read
instances.parse()
is called on the read sequences and qualities separately.- Parameters:
reads – List of reads to parse
concatenate – Whether or not to concatenate the parsed strings. Defaults to False.
- Returns:
Parsed sequence from read sequences Parsed sequence from quality sequences
- __eq__(other: SubSequenceParser)
Check whether this parser equals another. The order of definitions must also be equal.
- __iter__()
- __len__()
- __getitem__(i)
- __repr__()
Return repr(self).
- __str__()
Return str(self).
- exception ngs_tools.chemistry.Chemistry.ChemistryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.chemistry.Chemistry.Chemistry(name: str, description: str, files: Optional[Dict[str, str]] = None)
Bases:
abc.ABC
Base class to represent any kind of chemistry.
- _description
Chemistry description; for internal use only. Use
description
instead.
- _files
Dictionary containing files related to this chemistry. For internal use only.
- property name: str
Chemistry name
- property description: str
Chemistry description
- get_file(name: str) bool
Get a file path by its name
- class ngs_tools.chemistry.Chemistry.SequencingStrand
Bases:
enum.Enum
Generic enumeration.
Derive from this class to define new enumerations.
- UNSTRANDED = 0
- FORWARD = 1
- REVERSE
- class ngs_tools.chemistry.Chemistry.SequencingChemistry(n: int, strand: SequencingStrand, parsers: Dict[str, SubSequenceParser], *args, **kwargs)
Bases:
Chemistry
Base class to represent a sequencing chemistry.
- property n: int
Number of sequences to parse at once
- property parsers: Dict[str, SubSequenceParser]
Retrieve a copy of the
_parsers
dictionary.
- property strand: SequencingStrand
Retrieve the strandedness of the chemistry.
- property lengths: Tuple[int, Ellipsis]
The expected length for each sequence, based on
parsers
. None indicates any length is expected.
- property has_barcode: bool
- abstract property barcode_parser: SubSequenceParser
- property has_umi: bool
- abstract property umi_parser: SubSequenceParser
- property has_whitelist: bool
- abstract property whitelist_path
- get_parser(name: str) SubSequenceParser
Get a
SubSequenceParser
by its name
- has_parser(name: str) bool
Whether
_parsers
contains a parser with the specified name
- reorder(reordering: List[int]) Chemistry
Reorder the file indices according to the
reordering
list. This list reorders the file at each index to the value at that index.- Parameters:
reordering – List containing how to reorder file indices, where the file at index
i
of this index will now be at indexreordering[i]
.- Returns:
A new
Chemistry
instance (or the subclass)
- parse(sequences: List[str], concatenate: bool = False) Dict[str, Union[str, Tuple[str]]]
Parse a list of strings using the parsers in
_parsers
and return a dictionary with keys corresponding to those in_parsers
.- Parameters:
sequences – List of strings
concatenate – Whether or not to concatenate the parsed strings. Defaults to False.
- Returns:
Dictionary containing parsed strings
- Raises:
ChemistryError – If the number sequences does not equal
n
- parse_reads(reads: List[ngs_tools.fastq.Read.Read], concatenate: bool = False, check_name: bool = True) Dict[str, Tuple[Union[str, Tuple[str]], Union[str, Tuple[str]]]]
Behaves identically to
parse()
but on a list ofngs_tools.fastq.Read
instances. The resulting dictionary contains tuple values, where the first element corresponds to the parsed read sequences, while the second corresponds to the parsed quality strings.- Parameters:
reads – List of
ngs_tools.fastq.Read
instancesconcatenate – Whether or not to concatenate the parsed strings. Defaults to False.
check_name – If True, raises
ChemistryError
if all the reads do not have the same name. Defaults to True.
- Returns:
Dictionary containing tuples of parsed read sequences and quality strings
- Raises:
ChemistryError – If the number sequences does not equal
n
, orcheck_name=True
and not all reads have the same name.
- __str__()
Return str(self).
- __repr__()
Return repr(self).
- to_kallisto_bus_arguments() Dict[str, str]
Convert this spatial chemistry definition to arguments that can be used as input to kallisto bus. https://www.kallistobus.tools/
- Returns:
A Dictionary of arguments-to-value mappings. For this particular function, the dictionary has a single -x key and the value is a custom technology definition string, as specified in the kallisto manual.
- to_starsolo_arguments() Dict[str, str]
Converts this spatial chemistry definition to arguments that can be used as input to STARsolo. https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md
- Returns:
A Dictionary of arguments-to-value mappings.
ngs_tools.chemistry.MultimodalChemistry
Module Contents
Classes
Represents any chemistry that is a combination of multiple chemistries. |
Attributes
- exception ngs_tools.chemistry.MultimodalChemistry.MultimodalChemistryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.chemistry.MultimodalChemistry.MultimodalChemistry(name: str, description: str, chemistries: Dict[str, ngs_tools.chemistry.Chemistry.Chemistry])
Represents any chemistry that is a combination of multiple chemistries. For example, 10x Multiome. Note that this is not a subclass of
Chemistry
.TODO: Add properties similar to Chemistry class.
- property name: str
Chemistry name
- property description: str
Chemistry description
- property chemistries: Dict[str, ngs_tools.chemistry.Chemistry.Chemistry]
- chemistry(key: str) ngs_tools.chemistry.Chemistry.Chemistry
- ngs_tools.chemistry.MultimodalChemistry._10X_FEATUREBARCODE
- ngs_tools.chemistry.MultimodalChemistry._10X_MULTIOME
- ngs_tools.chemistry.MultimodalChemistry.MULTIMODAL_CHEMISTRIES
ngs_tools.chemistry.SingleCellChemistry
Module Contents
Classes
Extends |
Attributes
- exception ngs_tools.chemistry.SingleCellChemistry.SingleCellChemistryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.chemistry.SingleCellChemistry.SingleCellChemistry(name: str, description: str, n: int, strand: ngs_tools.chemistry.Chemistry.SequencingStrand, cdna_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser, cell_barcode_parser: Optional[ngs_tools.chemistry.Chemistry.SubSequenceParser] = None, umi_parser: Optional[ngs_tools.chemistry.Chemistry.SubSequenceParser] = None, whitelist_path: Optional[str] = None)
Bases:
ngs_tools.chemistry.Chemistry.SequencingChemistry
Extends
SequencingChemistry
to be able to handle common single-cell chemistries.- property cell_barcode_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the cell barcode parser
- property barcode_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the cell barcode parser
- property umi_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the UMI parser
- property cdna_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the cDNA parser
- property has_cell_barcode: bool
Whether the chemistry has a cell barcode
- property has_barcode: bool
Whether the chemistry has a cell barcode
- property has_umi: bool
Whether the chemistry has a UMI
- property has_whitelist: bool
Whether the chemistry has a fixed predefined cell barcode whitelist
- property whitelist_path: str
Path to the whitelist
- ngs_tools.chemistry.SingleCellChemistry._10X_V1
- ngs_tools.chemistry.SingleCellChemistry._10X_V2
- ngs_tools.chemistry.SingleCellChemistry._10X_V3
- ngs_tools.chemistry.SingleCellChemistry._10X_V3_ULTIMA
- ngs_tools.chemistry.SingleCellChemistry._10X_FB
- ngs_tools.chemistry.SingleCellChemistry._10X_ATAC
- ngs_tools.chemistry.SingleCellChemistry._DROPSEQ
- ngs_tools.chemistry.SingleCellChemistry._CELSEQ_V1
- ngs_tools.chemistry.SingleCellChemistry._CELSEQ_V2
- ngs_tools.chemistry.SingleCellChemistry._INDROPS_V1
- ngs_tools.chemistry.SingleCellChemistry._INDROPS_V2
- ngs_tools.chemistry.SingleCellChemistry._INDROPS_V3
- ngs_tools.chemistry.SingleCellChemistry._SCRBSEQ
- ngs_tools.chemistry.SingleCellChemistry._SURECELL
- ngs_tools.chemistry.SingleCellChemistry._SMARTSEQ_V2
- ngs_tools.chemistry.SingleCellChemistry._SMARTSEQ_V3
- ngs_tools.chemistry.SingleCellChemistry._STORMSEQ
- ngs_tools.chemistry.SingleCellChemistry._SCI_FATE
- ngs_tools.chemistry.SingleCellChemistry._BDWTA
- ngs_tools.chemistry.SingleCellChemistry._SPLITSEQ
- ngs_tools.chemistry.SingleCellChemistry._PLATE_SINGLE_CELL_CHEMISTRIES
- ngs_tools.chemistry.SingleCellChemistry._DROPLET_SINGLE_CELL_CHEMISTRIES
- ngs_tools.chemistry.SingleCellChemistry._OTHER_SINGLE_CELL_CHEMISTRIES
- ngs_tools.chemistry.SingleCellChemistry.SINGLE_CELL_CHEMISTRIES
ngs_tools.chemistry.SpatialChemistry
Module Contents
Classes
Typed version of namedtuple. |
|
Extends |
|
Extends |
Attributes
- class ngs_tools.chemistry.SpatialChemistry.SpatialResolution
Bases:
NamedTuple
Typed version of namedtuple.
Usage in Python versions >= 3.6:
class Employee(NamedTuple): name: str id: int
This is equivalent to:
Employee = collections.namedtuple('Employee', ['name', 'id'])
The resulting class has extra __annotations__ and _field_types attributes, giving an ordered dict mapping field names to types. __annotations__ should be preferred, while _field_types is kept to maintain pre PEP 526 compatibility. (The field names are in the _fields attribute, which is part of the namedtuple API.) Alternative equivalent keyword syntax is also accepted:
Employee = NamedTuple('Employee', name=str, id=int)
In Python versions <= 3.5 use:
Employee = NamedTuple('Employee', [('name', str), ('id', int)])
- scale: float = 1.0
- unit: Optional[typing_extensions.Literal[nm, um, mm]]
- exception ngs_tools.chemistry.SpatialChemistry.SpatialChemistryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.chemistry.SpatialChemistry.SpatialChemistry(resolution: SpatialResolution, *args, **kwargs)
Bases:
ngs_tools.chemistry.Chemistry.Chemistry
Extends
Chemistry
to be able to handle spatial chemistries.- property resolution: SpatialResolution
Get the spatial resolution as a
SpatialResolution
object.
- class ngs_tools.chemistry.SpatialChemistry.SpatialSequencingChemistry(name: str, description: str, resolution: SpatialResolution, n: int, strand: ngs_tools.chemistry.Chemistry.SequencingStrand, cdna_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser, spot_barcode_parser: Optional[ngs_tools.chemistry.Chemistry.SubSequenceParser] = None, umi_parser: Optional[ngs_tools.chemistry.Chemistry.SubSequenceParser] = None, whitelist_path: Optional[str] = None)
Bases:
SpatialChemistry
,ngs_tools.chemistry.Chemistry.SequencingChemistry
Extends
SequencingChemistry
to be able to handle common spatial chemistries.- property spot_barcode_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the spot barcode parser
- property barcode_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the spot barcode parser
- property umi_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the UMI parser
- property cdna_parser: ngs_tools.chemistry.Chemistry.SubSequenceParser
Get the cDNA parser
- property has_spot_barcode: bool
Whether the chemistry has a spot barcode
- property has_barcode: bool
Whether the chemistry has a spot barcode
- property has_umi: bool
Whether the chemistry has a UMI
- property has_whitelist: bool
Whether the chemistry has a fixed predefined spot barcode whitelist
- property whitelist_path: Optional[str]
Path to the whitelist. None if it does not exist.
- ngs_tools.chemistry.SpatialChemistry._SLIDESEQ_V2
- ngs_tools.chemistry.SpatialChemistry._VISIUM
- ngs_tools.chemistry.SpatialChemistry._STEREOSEQ
- ngs_tools.chemistry.SpatialChemistry._COSMX
- ngs_tools.chemistry.SpatialChemistry._SEQSCOPE_HDMI_DRAI
- ngs_tools.chemistry.SpatialChemistry._SEQSCOPE_HDMI32_DRA1
- ngs_tools.chemistry.SpatialChemistry._STARMAP
- ngs_tools.chemistry.SpatialChemistry._SEQFISH
- ngs_tools.chemistry.SpatialChemistry._MERFISH
- ngs_tools.chemistry.SpatialChemistry._SEQUENCING_SPATIAL_CHEMISTRIES
- ngs_tools.chemistry.SpatialChemistry._INSITU_SPATIAL_CHEMISTRIES
- ngs_tools.chemistry.SpatialChemistry.SPATIAL_CHEMISTRIES
Package Contents
Functions
|
Internal helper function to clean chemistry names. |
|
Fetch a |
Attributes
- ngs_tools.chemistry.VERSION_PARSER
- ngs_tools.chemistry.CHEMISTRIES
- ngs_tools.chemistry._clean_name(name: str) Tuple[str, Optional[int]]
Internal helper function to clean chemistry names.
- Parameters:
name – String name of the chemistry.
- Returns:
Tuple of the cleaned name and version
- ngs_tools.chemistry.get_chemistry(name: str) Chemistry.Chemistry
Fetch a
Chemistry
definition by name. Uses some regex magic to correctly deal with chemistry versioning at the end of the name. For instance,10x2
is interpreted the same as10xv2
.See
SingleCellChemistry
andSpatialChemistry
for available chemistries.- Parameters:
name – String name of the chemistry. Any dashes (-) or capitalization are ignored.
- Returns:
The matching chemistry.
- Raises:
ChemistryError – If the chemistry could not be found.
ngs_tools.fasta
Submodules
ngs_tools.fasta.Fasta
Module Contents
Classes
Represents a single FASTA file. |
- exception ngs_tools.fasta.Fasta.FastaError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.fasta.Fasta.Fasta(path: str, mode: typing_extensions.Literal[r, w] = 'r')
Bases:
ngs_tools.utils.FileWrapper
Represents a single FASTA file.
- _header
Variable that temporarily holds the header string for the next FASTA entry; for internal use only.
- read() ngs_tools.fasta.FastaEntry.FastaEntry
Read a single FASTA entry as a
FastaEntry
instance.- Returns:
The next FASTA entry
- Raises:
FastaError – If the file was not opened for reading, or the file was closed.
StopIteration – When there are no more entries to read.
- write(entry: ngs_tools.fasta.FastaEntry.FastaEntry)
Write a single FASTA entry.
- Parameters:
entry – The FASTA entry to write
- Raises:
FastaError – If the file was not opened for writing, or the file was closed.
ngs_tools.fasta.FastaEntry
Module Contents
Classes
Represents a single FASTA entry, which consists of a header and a sequence. |
- exception ngs_tools.fasta.FastaEntry.FastaEntryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.fasta.FastaEntry.FastaEntry(header: str, sequence: str)
Represents a single FASTA entry, which consists of a header and a sequence.
- ATTRIBUTE_PARSER
Static attribute that is a compiled regex. Used to parse attributes.
- property header: str
Header string, including the
>
character
- property sequence: str
Sequence string
- property name: str
Name of the sequence, which comes immedately after
>
in the header
- property attributes: Dict[str, str]
Dictionary of entry attributes, parsed from the substring of the header after the first space character.
- ATTRIBUTE_PARSER
- static make_header(name: str, attributes: Dict[str, str]) str
Static method to construct a header string from a name and attributes.
- Parameters:
name – entry name
attributes – dictionary containing entry attributes
Package Contents
Functions
|
Split a genomic FASTA into cDNA by using gene and transcript information |
Split a genomic FASTA into introns by using gene and transcript information |
|
Split a genomic FASTA into nascent transcripts by using gene information |
- ngs_tools.fasta.split_genomic_fasta_to_cdna(fasta_path: str, out_path: str, gene_infos: dict, transcript_infos: dict, show_progress: bool = False) str
Split a genomic FASTA into cDNA by using gene and transcript information generated from extracting information from a GTF.
- Parameters:
fasta_path – Path to FASTA containing genomic sequences
out_path – Path to output FASTA that will contain cDNA sequences
gene_infos – Dictionary containing gene information, as returned by
ngs_tools.gtf.genes_and_transcripts_from_gtf()
transcript_infos – Dictionary containing transcript information, as returned by
ngs_tools.gtf.genes_and_transcripts_from_gtf()
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to written FASTA
- ngs_tools.fasta.split_genomic_fasta_to_intron(fasta_path: str, out_path: str, gene_infos: dict, transcript_infos: dict, flank: int = 30, show_progress: bool = False) str
Split a genomic FASTA into introns by using gene and transcript information generated from extracting information from a GTF. Optionally append flanking sequences and collapse introns that have overlapping flanking regions.
- Parameters:
fasta_path – Path to FASTA containing genomic sequences
out_path – Path to output FASTA that will contain cDNA sequences
gene_infos – Dictionary containing gene information, as returned by
ngs_tools.gtf.genes_and_transcripts_from_gtf()
transcript_infos – Dictionary containing transcript information, as returned by
ngs_tools.gtf.genes_and_transcripts_from_gtf()
flank – Number of flanking bases to include for each intron. Defaults to 30.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to written FASTA
- ngs_tools.fasta.split_genomic_fasta_to_nascent(fasta_path: str, out_path: str, gene_infos: dict, suffix='', show_progress: bool = False) str
Split a genomic FASTA into nascent transcripts by using gene information generated from extracting information from a GTF.
- Parameters:
fasta_path – Path to FASTA containing genomic sequences
out_path – Path to output FASTA that will contain cDNA sequences
gene_infos – Dictionary containing gene information, as returned by
ngs_tools.gtf.genes_and_transcripts_from_gtf()
suffix – Suffix to append to output FASTA entry names. Defaults to “”.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to written FASTA
ngs_tools.fastq
Submodules
ngs_tools.fastq.Fastq
Module Contents
Classes
Class that represents a FASTQ file. |
- exception ngs_tools.fastq.Fastq.FastqError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.fastq.Fastq.Fastq(path: str, mode: typing_extensions.Literal[r, w] = 'r')
Bases:
ngs_tools.utils.FileWrapper
Class that represents a FASTQ file.
- read() ngs_tools.fastq.Read.Read
Read a single FASTQ entry.
- Returns:
The next read as a
Read
instance- Raises:
FastqError – If the file was not opened for reading, or the file was closed.
StopIteration – When there are no more entries to read.
- write(entry: ngs_tools.fastq.Read.Read)
Write a single FASTQ entry.
- Parameters:
entry –
Read
instance to write to the FASTQ- Raises:
FastaError – If the file was not opened for writing, or the file was closed.
ngs_tools.fastq.Read
Module Contents
Classes
Represents a Phred33 quality string. |
|
Class that represents a FASTQ read. Once the class is initialized, with the |
- exception ngs_tools.fastq.Read.ReadError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.fastq.Read.Quality(qualities: str)
Represents a Phred33 quality string.
- property string: str
Raw quality string
- property values: str
List of quality values
- property probs: str
The quality values converted to probabilities of error
- class ngs_tools.fastq.Read.Read(header: str, sequence: str, qualities: str)
Class that represents a FASTQ read. Once the class is initialized, with the header, sequence, and quality strings, they should not be changed (hence using @property). Only Phred33 quality scores are supported, but there is no check for this.
- _qualities
Quality
instance representing the sequence qualities; for internal use only. Usequalities
instead.
- property header: str
Raw header string
- property sequence: str
Raw sequence string
- property name: str
Name of the sequence, which comes immediately after the
@
- property attributes: str
String of read attributes. This is the substring after the first space in the header.
Package Contents
Functions
|
Convert a Fastq to unmapped BAM. |
|
Convert FASTQs to an unmapped BAM according to an arbitrary function. |
|
Convert FASTQs to an unmapped BAM according to the provided |
- ngs_tools.fastq.fastq_to_bam(fastq_path: str, bam_path: str, name: Optional[str] = None, n_threads: int = 1, show_progress: bool = False) str
Convert a Fastq to unmapped BAM.
- Parameters:
fastq_path – Path to the input FASTQ
bam_path – Path to the output BAM
name – Name for this set of reads. Defaults to None. If not provided, a random string is generated by calling
shortuuid.uuid()
. This value is added as the read group (RG tag) for all the reads in the BAM.n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to BAM
- ngs_tools.fastq.fastqs_to_bam(fastq_paths: List[str], parse_func: Callable[[Tuple[Read.Read, Ellipsis], pysam.AlignmentHeader], pysam.AlignedSegment], bam_path: str, name: Optional[str] = None, n_threads: int = 1, show_progress: bool = False) str
Convert FASTQs to an unmapped BAM according to an arbitrary function.
- Parameters:
fastq_paths – List of FASTQ paths.
parse_func – Function that accepts a tuple of
ngs_tools.fastq.Read
objects (one from each FASTQ) and apysam.AlignmentHeader
object as the second argument, and returns a newpysam.AlignedSegment
object to write into the BAM. Note that the second argument must be used for the header argument when initializing the newpysam.AlignedSegment
. Whenever this function returns None, the read will not be written to the BAM.name – Name for this set of reads. Defaults to None. If not provided, a random string is generated by calling
shortuuid.uuid()
. This value is added as the read group (RG tag) for all the reads in the BAM.bam_path – Path to the output BAM
n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to BAM
- ngs_tools.fastq.fastqs_to_bam_with_chemistry(fastq_paths: List[str], chemistry: ngs_tools.chemistry.Chemistry, tag_map: Dict[str, Tuple[str, str]], bam_path: str, name: Optional[str] = None, sequence_key: str = 'cdna', n_threads: int = 1, show_progress: bool = False) str
Convert FASTQs to an unmapped BAM according to the provided
ngs_tools.chemistry.Chemistry
instance.Note that any split features (i.e. split barcode where barcode is in multiple positions) are concatenated.
- Parameters:
fastq_paths – List of FASTQ paths. The order must match that of the chemistry.
chemistry –
ngs_tools.chemistry.Chemistry
instance to use to parse the reads.tag_map – Mapping of parser names to their corresponding BAM tags. The keys are the parser names, and the values must be a tuple of
(sequence BAM tag, quality BAM tag)
, where the former is the tag that will be used for the nucleotide sequence, and the latter is the tag that will be used for the quality scores.bam_path – Path to the output BAM
name – Name for this set of reads. Defaults to None. If not provided, a random string is generated by calling
shortuuid.uuid()
. This value is added as the read group (RG tag) for all the reads in the BAM.sequence_key – Parser key to use as the actual alignment sequence. Defaults to cdna.
n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to BAM
- Raises:
FastqError – If the number of FASTQs provided does not meet the number required for the specified chemistry, if the tag map provides keys that do not exist for the chemistry, or if the tag map contains multiple BAM tags.
ngs_tools.gtf
Submodules
ngs_tools.gtf.Gtf
Module Contents
Classes
Class that represents a GTF file. |
- exception ngs_tools.gtf.Gtf.GtfError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.gtf.Gtf.Gtf(path: str, mode: typing_extensions.Literal[r, w] = 'r')
Bases:
ngs_tools.utils.FileWrapper
Class that represents a GTF file.
- read() ngs_tools.gtf.GtfEntry.GtfEntry
Read a single GTF entry as a
GtfEntry
instance.- Returns:
The next GTF entry
- Raises:
GtfError – If the file was not opened for reading, or the file was closed.
StopIteration – When there are no more entries to read.
- write(entry: ngs_tools.gtf.GtfEntry.GtfEntry)
Write a single GTF entry.
- Parameters:
entry – The GTF entry to write
- Raises:
FastaError – If the file was not opened for writing, or the file was closed.
ngs_tools.gtf.GtfEntry
Module Contents
Classes
Represents a single GTF entry. |
- exception ngs_tools.gtf.GtfEntry.GtfEntryError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.gtf.GtfEntry.GtfEntry(line: str)
Represents a single GTF entry.
- PARSER
Static attribute that contains a compiled regex. Used to parse a GTF line.
- ATTRIBUTE_PARSER
Static attribute that contains a compiled regex. Used to parse GTF entry attributes.
- _chromosome
Chromosome; for internal use only. Use
chromosome
instead.
- _attribute_str
Raw GTF entry attribute string; for internal use only. Use
attributes
instead.
- property line: str
Raw GTF line
- property chromosome: str
Chromosome
- property feature: str
Feature
- property start: int
Start, 1-indexed
- property end: int
End, 1-indexed, inclusive
- property strand: str
Strand
- property attributes: Dict[str, str]
Dictionary of attributes
- PARSER
- ATTRIBUTE_PARSER
- to_segment() ngs_tools.gtf.Segment.Segment
Convert this GTF entry into a
Segment
.- Returns:
The new segment
ngs_tools.gtf.Segment
Module Contents
Classes
Class to represent an integer interval segment, zero-indexed, start-inclusive |
- exception ngs_tools.gtf.Segment.SegmentError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.gtf.Segment.Segment(start: int, end: int)
Class to represent an integer interval segment, zero-indexed, start-inclusive and end-exclusive.
- property start: int
Segment start
- property end: int
Segment end
- property width: int
Segment width
- is_in(i: int) bool
Evaluate whether an integer is contained within the segment.
- Parameters:
i – Integer number to check
- Returns:
True or False
- is_exclusive(segment: Segment) bool
Evaluate whether this segment is exclusive of another segment.
- Parameters:
segment –
Segment
object to check- Returns:
True or False
- is_overlapping(segment: Segment) bool
Evaluate whether this segment overlaps with another segment.
- Parameters:
segment –
Segment
object to check- Returns:
True or False
- is_subset(segment: Segment) bool
Evaluate whether this segment is a subset of another segment.
- Parameters:
segment –
Segment
object to check- Returns:
True or False
- is_superset(segment: Segment) bool
Evaluate whether this segment is a superset of another segment.
- Parameters:
segment –
Segment
object to check- Returns:
True or False
- flank(l: int, left: Optional[int] = None, right: Optional[int] = None) Segment
Construct a new segment with start and end flanks of length
l
. Optionally, limit the size of the new segment.- Parameters:
l – Flank length
left – Clip the resulting segment’s start to this value. Defaults to None. If not provided, no clipping is performed.
right – Clip the resulting segment’s end to this value. Defaults to None. If not provided, no clipping is performed.
- Returns:
The new segment
- __iter__()
- __eq__(other)
Return self==value.
- __lt__(other)
Return self<value.
- __gt__(other)
Return self>value.
- __repr__()
Return repr(self).
ngs_tools.gtf.SegmentCollection
Module Contents
Classes
Class to represent a collection of integer interval segments, zero-indexed. |
- exception ngs_tools.gtf.SegmentCollection.SegmentCollectionError
Bases:
Exception
Common base class for all non-exit exceptions.
- class ngs_tools.gtf.SegmentCollection.SegmentCollection(segments: Optional[List[ngs_tools.gtf.Segment.Segment]] = None)
Class to represent a collection of integer interval segments, zero-indexed. The segments are always sorted and overlaps are collapsed.
- property segments: List[ngs_tools.gtf.Segment.Segment]
Get a list of
Segment
instances
- property start: int
Leftmost value of all segments. 0 if there are no segments.
- property end: int
Rightmost value of all segments. 0 if there are no segments.
- add_segment(segment: ngs_tools.gtf.Segment.Segment)
Add a segment to the collection.
- Parameters:
segment – Segment to add
- add_collection(collection: SegmentCollection)
Add a collection of segments to the collection.
- Parameters:
collection – Collection to add
- __iter__()
- __getitem__(i)
- __len__()
- __bool__()
- __eq__(other: SegmentCollection)
Return self==value.
- invert(bounds: ngs_tools.gtf.Segment.Segment) SegmentCollection
Invert this
SegmentCollection
within the givenbounds
.- Parameters:
bounds – The bounds to invert with respect to.
- Returns:
A new
SegmentCollection
that is inverted- Raises:
SegmentCollectionError – If
bounds
does not entirely contain this collection
- collapse()
Collapse the segments in this collection such that there are no overlapping segments. Any overlapping segments are merged into a single large segment.
- span_is_exclusive(collection: SegmentCollection) bool
Evaluate whether the span of this collection is exclusive of that of another collection.
- Parameters:
collection –
SegmentCollection
object to check- Returns:
True or False
- is_overlapping(collection: SegmentCollection) bool
Evaluate whether this collection overlaps with another collection.
- Parameters:
collection –
SegmentCollection
object to check- Returns:
True or False
- is_subset(collection: SegmentCollection) bool
Evaluate whether this collection is a subset of another collection.
- Parameters:
collection –
SegmentCollection
object to check- Returns:
True or False
- is_superset(collection: SegmentCollection) bool
Evaluate whether this collection is a superset of another collection.
- Parameters:
collection –
SegmentCollection
object to check- Returns:
True or False
- flank(l: int, left: Optional[int] = None, right: Optional[int] = None) SegmentCollection
Construct a new segment collection where all the segments have start and end flanks of length
l
. Optionally, limit the span of the new segment collection. This is done by callingSegment.flank()
on all the segments and initializing a newSegmentCollection
. Any overlaps are collapsed.- Parameters:
l – Flank length
left – Clip the resulting collection’s start to this value. Defaults to None. If not provided, no clipping is performed.
right – Clip the resulting collection’s end to this value. Defaults to None. If not provided, no clipping is performed.
- Returns:
The new collection
- classmethod from_positions(positions: Union[List[int], Set[int]]) SegmentCollection
Initialize a new collection given a list or set of integer positions.
- Parameters:
positions – Integer positions to construct the collection from
- Returns:
The new collection
- classmethod from_collections(*collections: SegmentCollection) SegmentCollection
Initialize a new collection given an arbitrary number of collections.
- Parameters:
*collections – The collections to combine
- Returns:
The new collection
- __repr__()
Return repr(self).
Package Contents
Functions
|
Parse GTF and yield only the specified features as |
|
Parse GTF for gene and transcript information. Also, compute the introns of |
- ngs_tools.gtf.parse_gtf(gtf_path: str, filter_func: Callable[[GtfEntry.GtfEntry], bool] = lambda entry: ..., show_progress: bool = False) Generator[GtfEntry.GtfEntry, None, None]
Parse GTF and yield only the specified features as
GtfEntry
instances.- Parameters:
gtf_path – path to GTF file
filter_func – Function that takes a
GtfEntry
instance and returnsTrue
for entries to process andFalse
for entries to ignore. Defaults to no filtering.show_progress – Whether to display a progress bar. Defaults to False.
- Yields:
GTF entries
- ngs_tools.gtf.genes_and_transcripts_from_gtf(gtf_path: str, use_version: bool = False, filter_func: Callable[[GtfEntry.GtfEntry], bool] = lambda entry: ..., show_progress: bool = False) Tuple[dict, dict]
Parse GTF for gene and transcript information. Also, compute the introns of each transcript.
- Parameters:
gtf_path – path to GTF file
use_version – whether or not to use gene and transcript versions
filter_func – Function that takes a
GtfEntry
instance and returnsTrue
for entries to process andFalse
for entries to ignore. Defaults to no filtering.show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Dictionary containing gene information Dictionary containing transcript information
Submodules
ngs_tools.bam
Module Contents
Functions
|
Generator to map an arbitrary function to every read and return its return |
|
Apply an arbitrary function to every read in a BAM. Reads for which the |
|
Count the number of BAM entries. Optionally, a function may be provided to |
|
Split a BAM into many parts, either by the number of reads or by an |
|
Add tags to BAM entries using sequences from one or more FASTQ files. |
|
Filter a BAM by applying the given function to each |
- exception ngs_tools.bam.BamError
Bases:
Exception
Common base class for all non-exit exceptions.
- ngs_tools.bam.map_bam(bam_path: str, map_func: Callable[[pysam.AlignedSegment], Any], n_threads: int = 1, show_progress: bool = False)
Generator to map an arbitrary function to every read and return its return values.
- Parameters:
bam_path – Path to the BAM file
map_func – Function that takes a
pysam.AlignedSegment
object and returns some valuen_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Yields:
map_func
applied to each read in the BAM file
- ngs_tools.bam.apply_bam(bam_path: str, apply_func: Callable[[pysam.AlignedSegment], Optional[pysam.AlignedSegment]], out_path: str, n_threads: int = 1, show_progress: bool = False)
Apply an arbitrary function to every read in a BAM. Reads for which the function returns None are not written to the output BAM.
- Parameters:
bam_path – Path to the BAM file
apply_func – Function that takes a
pysam.AlignedSegment
object and optionally returnspysam.AlignedSegment
objectsout_path – Path to output BAM file
n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to written BAM
- ngs_tools.bam.count_bam(bam_path: str, filter_func: Optional[Callable[[pysam.AlignedSegment], bool]] = None, n_threads: int = 1, show_progress: bool = False) int
Count the number of BAM entries. Optionally, a function may be provided to only count certain alignments.
- Parameters:
bam_path – Path to BAM
filter_func – Function that takes a
pysam.AlignedSegment
object and returns True for reads to be counted and False otherwisen_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Number of alignments in BAM
- ngs_tools.bam.split_bam(bam_path: str, split_prefix: str, split_func: Optional[Callable[[pysam.AlignedSegment], str]] = None, n: Optional[int] = None, n_threads: int = 1, check_pair_groups: bool = True, show_progress: bool = False) Dict[str, Tuple[str, int]]
Split a BAM into many parts, either by the number of reads or by an arbitrary function. Only one of
split_func
orn
must be provided. Read pairs are always written to the same file.This function makes two passes through the BAM file. The first pass is to identify which reads must be written together (i.e. are pairs). The second pass is to actually extract the reads and write them to the appropriate split.
The following procedure is used to identify pairs. 1) The
.is_paired
property is checked to be True. 2) If the read is uanligned, at most one other unaligned read with the sameread name is allowed to be in the BAM. This other read is its mate. If the read is aligned, it should have the
HI
BAM tag indicating the alignment index. If noHI
tag is present, then it is assumed only one alignment should be present for each read pair. If any of these constraints are not met, an exception is raised.- Parameters:
bam_path – Path to the BAM file
split_prefix – File path prefix to all the split BAMs
split_func – Function that takes a
pysam.AlignedSegment
object and returns a string ID that is used to group reads into splits. All reads with a given ID will be written to a single BAM. Defaults to None.n – Number of BAMs to split into. Defaults to None.
n_threads – Number of threads to use. Only affects reading. Writing is still serialized. Defaults to 1.
check_pair_groups – When using
split_func
, make sure that paired reads are assigned the same ID (and thus are split into the same BAM). Defaults to True.show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Dictionary of tuples, where the first element is the path to a split BAM, and the second element is the number of BAM entries written to that split. The keys are either the string ID of each split (if
split_func
is used) or the split index (ifn
is used), and the values are paths.- Raises:
BamError – If any pair constraints are not met.
- ngs_tools.bam.tag_bam_with_fastq(bam_path: str, fastq_path: Union[str, List[str]], tag_func: Union[Callable[[ngs_tools.fastq.Read], dict], List[Callable[[ngs_tools.fastq.Read], dict]]], out_path: str, check_name: bool = True, n_threads: int = 1, show_progress: bool = False)
Add tags to BAM entries using sequences from one or more FASTQ files.
Internally, this function calls
apply_bam()
.Note
The tag keys generated from tag_func must contain unique keys of at most 2 characters.
- Parameters:
bam_path – Path to the BAM file
fastq_path – Path to FASTQ file. This option may be a list to extract tags from multiple FASTQ files. In this case, tag_func must also be a list of functions.
tag_func – Function that takes a
ngs_tools.fastq.Read
object and returns a dictionary of tags. When multiple FASTQs are being parsed simultaneously, each function needs to produce unique dictionary keys. Additionally, BAM tag keys may only be at most 2 characters. However, neither of these conditions are checked in favor of runtime.out_path – Path to output BAM file
check_name – Whether or not to raise a
BamError
if the FASTQ does not contain a read in the BAMn_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to written BAM
- Raises:
BamError – If only one of fastq_path and tag_func is a list, if both are lists but they have different lengths, if check_name=True but there are missing tags.
- ngs_tools.bam.filter_bam(bam_path: str, filter_func: Callable[[pysam.AlignedSegment], bool], out_path: str, n_threads: int = 1, show_progress: bool = False)
Filter a BAM by applying the given function to each
pysam.AlignedSegment
object. When the function returns False, the read is not written to the output BAM.Internally, this function calls
apply_bam()
.- Parameters:
bam_path – Path to the BAM file
filter_func – Function that takes a
pysam.AlignedSegment
object and returns False for reads to be filtered outout_path – Path to output BAM file
n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to False.
- Returns:
Path to written BAM
ngs_tools.logging
Module Contents
Classes
Custom logger that provides namespacing functionality. |
Functions
|
Given a logger name, silence it completely. |
|
Set the logger to the provided |
Attributes
- ngs_tools.logging.silence_logger(name: str)
Given a logger name, silence it completely.
- Parameters:
name – Logger name to silence
- class ngs_tools.logging.Logger(name: Optional[str] = None)
Custom logger that provides namespacing functionality.
- The following loggers are silenced by default:
anndata, h5py, numba, pysam, pystan
- FORMAT
Static attribute that encodes the logging format string
- logger
The logger object
- ch
A
logging.Streamhandler
object that sets the format
- FORMAT = '[%(asctime)s] %(levelname)7s %(message)s'
- namespaced(namespace: str)
Function decorator to set the logging namespace for the duration of the function.
- Parameters:
namespace – The namespace
- namespaced_context(namespace: str)
Context manager to set the logging namespace.
- Parameters:
namespace – The namespace
- namespace_message(message: str) str
Add namespace information at the beginning of the logging message.
- Parameters:
message – The logging message
- Returns:
The namespaced message
- addHandler(hdlr: logging.Handler, format: bool = True)
- removeHandler(*args, **kwargs)
- setLevel(*args, **kwargs)
- debug(message, *args, **kwargs)
- info(message, *args, **kwargs)
- warning(message, *args, **kwargs)
- exception(message, *args, **kwargs)
- critical(message, *args, **kwargs)
- error(message, *args, **kwargs)
- ngs_tools.logging.logger
ngs_tools.progress
Module Contents
- ngs_tools.progress.progress
ngs_tools.sequence
Module Contents
Functions
|
Convert an alignment to a CIGAR string. |
|
Complement the given sequence, with optional reversing. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Call consensus sequences from a set of sequences. Internally, this |
|
Given a list of sequences and their base qualities, constructs a set of consensus |
|
Calculate the Levenshtein (edit) distance between two sequences. |
|
|
|
|
|
|
|
Calculate the hamming distance between two sequences. |
|
|
|
Calculate the hamming distance between a sequence and a list of sequences. |
|
|
|
Calculate all pairwise hamming distances between two lists of sequences. |
|
|
|
Calculate all pairwise hamming distances between combinations of sequences |
|
|
|
Correct a list of sequences to a whitelist within d hamming distance. |
|
Correct a list of sequences to a whitelist within d hamming distance. |
Attributes
- ngs_tools.sequence.NeedlemanWunsch
- ngs_tools.sequence.NUCLEOTIDES_STRICT = ['A', 'C', 'G', 'T']
- ngs_tools.sequence.NUCLEOTIDES_PERMISSIVE = ['R', 'Y', 'S', 'W', 'K', 'M', 'B', 'D', 'H', 'V', 'N', '-']
- ngs_tools.sequence.NUCLEOTIDES
- ngs_tools.sequence.NUCLEOTIDES_AMBIGUOUS
- ngs_tools.sequence.NUCLEOTIDE_COMPLEMENT
- ngs_tools.sequence.NUCLEOTIDE_MASKS
- ngs_tools.sequence.MASK_TO_NUCLEOTIDE
- ngs_tools.sequence.LEVENSHTEIN_DISTANCE_ALIGNER
- ngs_tools.sequence.SEQUENCE_PARSER
- exception ngs_tools.sequence.SequenceError
Bases:
Exception
Common base class for all non-exit exceptions.
- ngs_tools.sequence.alignment_to_cigar(reference: str, query: str, mismatch: bool = False) str
Convert an alignment to a CIGAR string.
The CIGAR is always constructed relative to the reference (i.e. as insertions/deletions from the reference). The provided sequences must represent their alignments, containing the “-” character at positions where there are gaps in one or another.
- Parameters:
reference – The reference sequence alignment
query – The query sequence alignment
mismatch – Whether or not to use the “X” CIGAR operation in places of mismatches. Defaults to True, such that all non-gaps are considered matches with the “M” CIGAR operation. When there are ambiguous characters, a mismatch occurs only when the two sets of possible nucleotides are exclusive.
- Returns:
The CIGAR string representing the provided alignment
- Raises:
SequenceError – if the reference and query do not have the same length, or if there are gaps in both alignments at the same position
- ngs_tools.sequence.complement_sequence(sequence: str, reverse: bool = False) str
Complement the given sequence, with optional reversing.
- Parameters:
sequence – Input sequence
reverse – Whether or not to perform reverse complementation
- Returns:
Complemented (and optionally reversed) string
- ngs_tools.sequence._sequence_to_array(sequence: str, l: Optional[int] = None) numpy.ndarray
- ngs_tools.sequence._qualities_to_array(qualities: Union[str, array.array], l: Optional[int] = None) numpy.ndarray
- ngs_tools.sequence._most_likely_array(positional_probs: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence._most_likely_sequence(positional_probs: numpy.ndarray, allow_ambiguous: bool = False) str
- ngs_tools.sequence._disambiguate_sequence(sequence: numpy.ndarray) List[str]
- ngs_tools.sequence._calculate_positional_probs(sequences: numpy.ndarray, qualities: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.call_consensus(sequences: List[str], proportion: float = 0.05)
Call consensus sequences from a set of sequences. Internally, this function calls
call_consensus_with_qualities()
with all qualities set to 1 andq_threshold
set to 1. See the documentation of this function for details on how consensuses are called.- Parameters:
sequences – Sequences to call consensus for
proportion – Proportion of each sequence to allow mismatched bases to be above
q_threshold
- Returns:
List of consensus sequences Numpy array of assignments for each sequence in
sequences
- ngs_tools.sequence.call_consensus_with_qualities(sequences: List[str], qualities: Union[List[str], List[array.array], List[numpy.ndarray]], q_threshold: Optional[int] = None, proportion: float = 0.05, allow_ambiguous: bool = False, return_qualities: bool = False) Union[Tuple[List[str], numpy.ndarray], Tuple[List[str], numpy.ndarray, List[str]]]
Given a list of sequences and their base qualities, constructs a set of consensus sequences by iteratively constructing a consensus (by selecting the most likely base at each position) and assigning sequences with match probability <= max(min(match probability), q_threshold * (proportion * length of longest sequence)) to this consensus. Then, the consensus is updated by constructing the consensus only among these sequences. The match probability of a sequence to a consensus is the sum of the quality values where they do not match (equivalent to negative log probability that all mismatches were sequencing errors).
Note
This function does not perform any alignment among consensus sequences. To detect any insertions/deletions, call this function and then perform alignment among the called consensus sequences.
- Parameters:
sequences – Sequences to call consensus for
qualities – Quality scores for the sequences
q_threshold – Quality threshold. Defaults to the median quality score among all bases in all sequences.
proportion – Proportion of each sequence to allow mismatched bases to be above
q_threshold
. Defaults to 0.05.allow_ambiguous – Allow ambiguous bases in the consensus sequences. Defaults to False, which, on ties, selects a single base lexicographically. This option only has an effect when constructing the final consensus sequence as a string, not when calculating error probabilities.
return_qualities – Whether or not to return qualities for the consensuses. Defaults to False.
- Returns:
List of consensus sequences Numpy array of assignments for each sequence in
sequences
Qualities for each of the consensus sequences, ifreturn_qualities
is True- Raises:
SequenceError – if any sequence-quality pair have different lengths or number of provided sequences does not match number of provided qualities
- ngs_tools.sequence.levenshtein_distance(sequence1: str, sequence2: str) int
Calculate the Levenshtein (edit) distance between two sequences. This is calculated by calling
pyseq_align.NeedlemanWunsch.align()
with the appropriate scores/penalties.- Parameters:
sequence1 – First sequence
sequence2 – Second sequence
- Returns:
Levenshtein distance
- ngs_tools.sequence._mismatch_mask(sequence1: numpy.ndarray, sequence2: numpy.ndarray) int
- ngs_tools.sequence._mismatch_masks(sequence: numpy.ndarray, whitelist: numpy.ndarray, d: int) numpy.ndarray
- ngs_tools.sequence._hamming_distance(sequence1: numpy.ndarray, sequence2: numpy.ndarray) int
- ngs_tools.sequence.hamming_distance(sequence1: str, sequence2: str) int
Calculate the hamming distance between two sequences.
- Parameters:
sequence1 – First sequence
sequence2 – Second sequence
- Returns:
Hamming distance
- Raises:
SequenceError – When the sequences are unequal lengths
- ngs_tools.sequence._hamming_distances(sequence: numpy.ndarray, sequences: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.hamming_distances(sequence: str, sequences: List[str]) numpy.ndarray
Calculate the hamming distance between a sequence and a list of sequences.
- Parameters:
sequence – Sequence
sequences – List of sequences
- Returns:
Numpy array of hamming distances, where each index
i
corresponds to the hamming distance betweensequence
andsequences[i]
- Raises:
SequenceError – When any of the sequences are unequal length
- ngs_tools.sequence._hamming_distance_matrix(sequences1: numpy.ndarray, sequences2: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.hamming_distance_matrix(sequences1: List[str], sequences2: List[str]) numpy.ndarray
Calculate all pairwise hamming distances between two lists of sequences.
- Parameters:
sequences1 – List of sequences
sequences2 – List of sequences
- Returns:
Numpy array of hamming distances, where each index
(i, j)
corresponds to the hamming distance betweensequences1[i]
andsequences2[j]
- Raises:
SequenceError – When any of the sequences are unequal length
- ngs_tools.sequence._pairwise_hamming_distances(sequences: numpy.ndarray) numpy.ndarray
- ngs_tools.sequence.pairwise_hamming_distances(sequences: List[str]) numpy.ndarray
Calculate all pairwise hamming distances between combinations of sequences from a single list.
- Parameters:
sequences – List of sequences
- Returns:
Numpy array of hamming distances, where each index
(i, j)
corresponds to the hamming distance betweensequences[i]
andsequences[j]
- Raises:
SequenceError – When any of the sequences are unequal length
- ngs_tools.sequence._correct_to_whitelist(qualities: numpy.ndarray, indices: numpy.ndarray, masks: numpy.ndarray, log10_proportions: numpy.ndarray) Tuple[int, float]
- ngs_tools.sequence.correct_sequences_to_whitelist(sequences: List[str], qualities: Union[List[str], List[array.array]], whitelist: List[str], d: int = 1, confidence: float = 0.95, n_threads: int = 1, show_progress: bool = False) List[Union[str, None]]
Correct a list of sequences to a whitelist within d hamming distance. Note that sequences can contain duplicates, but whitelist can not.
For a given sequence, if there are multiple barcodes in the whitelist to which its distance is <= d, the sequence is assigned to a barcode by using the prior probability that the sequence originated from the barcode. If the confidence that the sequence originated from the most likely barcode is less than confidence, assignment is skipped.
This procedure follows the barcode correction procedure in Cell Ranger by 10X Genomics. Some modifications were made to support ambiguous bases and prevent floating-point underflow. https://kb.10xgenomics.com/hc/en-us/articles/115003822406-How-does-Cell-Ranger-correct-barcode-sequencing-errors
Note
Only hamming distance is supported (not Levenshtein distance).
- Parameters:
sequences – List of sequence strings
qualities – List of quality strings or list of array of qualities (as returned by
pysam.qualitystring_to_array()
)whitelist – List of whitelist sequences to correct to
d – Hamming distance threshold. Sequences will be corrected to the whitelist with hamming distance <=
d
. Defaults to 1.confidence – When a sequence has the same minimum hamming distance to multiple whitelisted sequences, the sequence is assigned to the best whitelisted sequence (using prior probabilities) if the likelihood ratio of this and the sum of all likelihoods is greater than or equal to this value. Defaults to 0.95.
n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to True.
- Raises:
SequenceError – If all the lengths of each sequence, qualitiy and whitelisted sequence are not equal, the number of sequences and qualities provided are not equal or the whitelist contains duplicates.
- Returns:
The corrections as a list of whitelisted sequences. For sequences that could not be corrected, the corresponding position contains None.
- ngs_tools.sequence.correct_sequences_to_whitelist_simple(sequences: List[str], whitelist: List[str], d: int = 1, n_threads: int = 1, show_progress: bool = False) Dict[str, Union[str, None]]
Correct a list of sequences to a whitelist within d hamming distance. Note that sequences can contain duplicates, but whitelist can not. Unlike
correct_sequences_to_whitelist()
, this function takes a naive approach and discards any sequences that can be corrected to multiple whitelisted sequences.- Parameters:
sequences – List of sequence strings
whitelist – List of whitelist sequences to correct to
d – Hamming distance threshold. Sequences will be corrected to the whitelist with hamming distance <=
d
. Defaults to 1.n_threads – Number of threads to use. Defaults to 1.
show_progress – Whether to display a progress bar. Defaults to True.
- Raises:
SequenceError – If all the lengths of each sequence, qualitiy and whitelisted sequence are not equal, the number of sequences and qualities provided are not equal or the whitelist contains duplicates.
- Returns:
The corrections as a dict of sequence to correction mappings. Note that the return type is different from
correct_sequences_to_whitelist()
.
ngs_tools.utils
Module Contents
Classes
A context manager for doing a "deep suppression" of stdout and stderr in |
|
Wrapper around joblib.Parallel that uses tqdm to print execution progress. |
|
Wrapper around |
|
Generic wrapper class for file-formats. Used to wrap file-format-specific |
Functions
|
Utility function to retry a function some number of times, with optional |
|
Function decorator to retry a function on exceptions. |
|
Execute a single shell command. |
|
Check if a string is a remote URL. |
|
Check if a file is Gzipped by checking the magic string. |
|
Open a (possibly gzipped) file in text mode. |
|
Decompress a gzip file to provided file path. |
|
Compress a file into gzip. |
|
Concatenates an arbitrary number of files into one file. |
|
Concatenates an arbitrary number of files into one TEXT file. |
|
Download a remote file to the provided path while displaying a progress bar. |
|
A context manager that creates a FIFO file to use for piping remote files |
|
Check whether all provided paths exist. |
|
Wrapper for |
|
Pickle a Python object and compress with Gzip. |
|
Load a Python pickle that was compressed with Gzip. |
|
Generator that flattens the given dictionary into 2-element tuples |
|
Generator that flattens the given iterable, except for strings. |
|
Merge two dictionaries, applying an arbitrary function f to duplicate keys. |
|
Extract all values from a nested dictionary. |
|
Set the permissions of a file to be executable. |
- class ngs_tools.utils.suppress_stdout_stderr
A context manager for doing a “deep suppression” of stdout and stderr in Python, i.e. will suppress all print, even if the print originates in a compiled C/Fortran sub-function.
This will not suppress raised exceptions, since exceptions are printed
to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through). https://github.com/facebook/prophet/issues/223
- __enter__()
- __exit__(*_)
- ngs_tools.utils.retry(function: Callable, retries: int, args: Optional[tuple] = None, kwargs: Optional[dict] = None, retry_every: Optional[int] = None, backoff: bool = False, exceptions: Optional[Tuple[Exception]] = None) Any
Utility function to retry a function some number of times, with optional exponential backoff.
- Parameters:
function – Function to retry
retries – Number of times to retry
args – Function arguments
kwargs – Dictionary of keyword arguments
retry_every – Time to wait in seconds between retries. Defaults to no wait time.
backoff – Whether or not to exponential backoff between retries
exceptions – Tuple of exceptions to expect. Defaults to all exceptions.
- Returns:
Whatever
function
returns
- ngs_tools.utils.retry_decorator(retries: int, retry_every: Optional[int] = None, backoff: bool = False, exceptions: Optional[Tuple[Exception]] = None) Callable
Function decorator to retry a function on exceptions.
- Parameters:
retries – Number of times to retry
retry_every – Time to wait in seconds between retries. Defaults to no wait time.
backoff – Whether or not to exponential backoff between retries
exceptions – Tuple of exceptions to expect. Defaults to all exceptions.
- ngs_tools.utils.run_executable(command: List[str], stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE, wait: bool = True, stream: bool = True, quiet: bool = False, returncode: int = 0, alias: bool = True) Union[subprocess.Popen, Tuple[subprocess.Popen, List[str], List[str]]]
Execute a single shell command.
- Parameters:
command – A list representing a single shell command
stdin – Object to pass into the
stdin
argument for :class:subprocess.Popen
. Defaults to Nonestdout – Object to pass into the stdout argument for :class:
subprocess.Popen
Defaults tosubprocess.PIPE
stderr – Object to pass into the stderr argument for :class:
subprocess.Popen
, Defaults tosubprocess.PIPE
wait – Whether to wait until the command has finished. Defaults to True
stream – Whether to stream the output to the command line. Defaults to True
quiet – Whether to not display anything to the command line and not check the return code. Defaults to False
returncode – The return code expected if the command runs as intended. Defaults to 0
alias – Whether to use the basename of the first element of command. Defaults to True
- Returns:
A tuple of (the spawned process, string printed to stdout, string printed to stderr) if wait=True. Otherwise, just the spawned process.
- Raises:
subprocess.CalledProcessError – If not
quiet
and the process exited with an exit code !=exitcode
- class ngs_tools.utils.ParallelWithProgress(pbar: Optional[tqdm.tqdm] = None, total: Optional[int] = None, desc: Optional[str] = None, disable: bool = False, *args, **kwargs)
Bases:
joblib.Parallel
Wrapper around joblib.Parallel that uses tqdm to print execution progress. Taken from https://stackoverflow.com/a/61900501
- __call__(*args, **kwargs)
- print_progress()
- ngs_tools.utils.is_remote(path: str) bool
Check if a string is a remote URL.
- Parameters:
path – string to check
- Returns:
True or False
- ngs_tools.utils.is_gzip(path: str) bool
Check if a file is Gzipped by checking the magic string.
- Parameters:
path – path to file
- Returns:
True or False
- ngs_tools.utils.open_as_text(path: str, mode: typing_extensions.Literal[r, w]) TextIO
Open a (possibly gzipped) file in text mode.
- Parameters:
path – Path to file
mode – Mode to open file in. Either
r
for read orw
for write.
- Returns:
Opened file pointer that supports
read
andwrite
functions.
- ngs_tools.utils.decompress_gzip(gzip_path: str, out_path: str) str
Decompress a gzip file to provided file path.
- Parameters:
gzip_path – Path to gzip file
out_path – Path to decompressed file
- Returns:
Path to decompressed file
- ngs_tools.utils.compress_gzip(file_path: str, out_path: str) str
Compress a file into gzip.
- Parameters:
file_path – Path to file
out_dir – Path to compressed file
- Returns:
Path to compressed file
- ngs_tools.utils.concatenate_files(*paths: str, out_path: str)
Concatenates an arbitrary number of files into one file.
- Parameters:
*paths – An arbitrary number of paths to files
out_path – Path to place concatenated file
- Returns:
Path to concatenated file
- ngs_tools.utils.concatenate_files_as_text(*paths: str, out_path: str) str
Concatenates an arbitrary number of files into one TEXT file.
Only supports plaintext and gzip files.
- Parameters:
*paths – An arbitrary number of paths to files
out_path – Path to place concatenated file
- Returns:
Path to concatenated file
- class ngs_tools.utils.TqdmUpTo
Bases:
tqdm.tqdm
Wrapper around
tqdm()
so that it can be used withurlretrieve()
. https://github.com/tqdm/tqdm/blob/master/examples/tqdm_wget.py- update_to(b=1, bsize=1, tsize=None)
- ngs_tools.utils.download_file(url: str, path: str) str
Download a remote file to the provided path while displaying a progress bar.
- Parameters:
url – Remote url
path – Local path to download the file to
- Returns:
Path to downloaded file
- ngs_tools.utils.stream_file(url: str, path: str) str
A context manager that creates a FIFO file to use for piping remote files into processes. This function must be used as a context manager (the
with
keyword) so that any exceptions in the streaming thread may be captured.This function spawns a new thread to download the remote file into a FIFO file object. FIFO file objects are only supported on unix systems.
- Parameters:
url – Url to the file
path – Path to place FIFO file
- Yields:
Path to FIFO file
- Raises:
OSError – If the operating system does not support FIFO
- ngs_tools.utils.all_exists(*paths: str) bool
Check whether all provided paths exist.
- Parameters:
*paths – paths to files
- Returns:
True if all files exist, False otherwise
- class ngs_tools.utils.FileWrapper(path: str, mode: typing_extensions.Literal[r, w] = 'r')
Generic wrapper class for file-formats. Used to wrap file-format-specific implementations of reading and writing entries. This class is not designed to be initialized directly. Instead, it should be inherited by children that implements the
read
andwrite
methods appropriately.The file is opened immediately as soon as the class is initialized. This class can also be used as a context manager to safely close the file pointer with a
with
block.- path
Path to the file
- mode
Open mode. Either
r
orw
.
- fp
File pointer
- closed
Whether the file has been closed
- property is_remote: bool
- property is_gzip: bool
- property closed: bool
- __del__()
- __enter__()
- __exit__(*args, **kwargs)
- __iter__()
- _open()
Open the file
- close()
Close the (possibly already-closed) file
- reset()
Reset this wrapper by first closing the file and re-running initialization, which re-opens the file.
- tell() int
Get the current location of the file pointer
- abstract read() Any
Read a single entry. This method must be overridden by children.
- abstract write(entry: Any)
Write a single entry. This method must be overridden by children.
- ngs_tools.utils.mkstemp(dir: Optional[str] = None, delete: bool = False)
Wrapper for
tempfile.mkstemp()
that automatically closes the OS-level file descriptor. This function behaves liketempfile.mkdtemp()
but for files.- Parameters:
dir – Directory to create the temporary file. This value is passed as the
dir
kwarg oftempfile.mkstemp()
. Defaults to None.delete – Whether to delete the temporary file before returning. Defaults to False.
- Returns:
path to the temporary file
- ngs_tools.utils.write_pickle(obj: object, path: str, *args, **kwargs) str
Pickle a Python object and compress with Gzip.
Any additional arguments and keyword arguments are passed to
pickle.dump()
.- Parameters:
obj – Object to pickle
path – Path to save pickle
- Returns:
Saved pickle path
- ngs_tools.utils.read_pickle(path: str) object
Load a Python pickle that was compressed with Gzip.
- Parameters:
path – Path to pickle
- Returns:
Unpickled object
- ngs_tools.utils.flatten_dictionary(d: dict, keys: Optional[tuple] = None) Generator[Tuple[tuple, object], None, None]
Generator that flattens the given dictionary into 2-element tuples containing keys and values. For nested dictionaries, the keys are appended into a tuple.
- Parameters:
d – Dictionary to flatten
keys – Previous keys, defaults to None. Used exclusively for recursion.
- Yields:
Flattened dictionary as (keys, value)
- ngs_tools.utils.flatten_iter(it: Iterable) Generator[object, None, None]
Generator that flattens the given iterable, except for strings.
- Parameters:
lst – Iterable to flatten
- Yields:
Flattened iterable elements
- ngs_tools.utils.merge_dictionaries(d1: dict, d2: dict, f: Callable[[object, object], object] = add, default: object = 0) dict
Merge two dictionaries, applying an arbitrary function f to duplicate keys. Dictionaries may be nested.
- Parameters:
d1 – First dictionary
d2 – Second dictionary
f – Merge function. This function should take two arguments and return one, defaults to +
default – Default value or callable to use for keys not present in either dictionary, defaults to 0
- Returns:
Merged dictionary
- ngs_tools.utils.flatten_dict_values(d: dict) list
Extract all values from a nested dictionary.
- Parameters:
d – Nested dictionary from which to extract values from
- Returns:
All values from the dictionary as a list
- ngs_tools.utils.set_executable(path: str)
Set the permissions of a file to be executable.
Package Contents
- ngs_tools.__version__ = '1.8.4'