ngs_tools.bam

Module Contents

Functions

map_bam(bam_path, map_func[, n_threads, show_progress])

Generator to map an arbitrary function to every read and return its return

apply_bam(bam_path, apply_func, out_path[, n_threads, ...])

Apply an arbitrary function to every read in a BAM. Reads for which the

count_bam(→ int)

Count the number of BAM entries. Optionally, a function may be provided to

split_bam(→ Dict[str, Tuple[str, int]])

Split a BAM into many parts, either by the number of reads or by an

tag_bam_with_fastq(bam_path, fastq_path, tag_func, ...)

Add tags to BAM entries using sequences from one or more FASTQ files.

filter_bam(bam_path, filter_func, out_path[, ...])

Filter a BAM by applying the given function to each pysam.AlignedSegment

exception ngs_tools.bam.BamError

Bases: Exception

Common base class for all non-exit exceptions.

ngs_tools.bam.map_bam(bam_path: str, map_func: Callable[[pysam.AlignedSegment], Any], n_threads: int = 1, show_progress: bool = False)

Generator to map an arbitrary function to every read and return its return values.

Parameters:
  • bam_path – Path to the BAM file

  • map_func – Function that takes a pysam.AlignedSegment object and returns some value

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to False.

Yields:

map_func applied to each read in the BAM file

ngs_tools.bam.apply_bam(bam_path: str, apply_func: Callable[[pysam.AlignedSegment], Optional[pysam.AlignedSegment]], out_path: str, n_threads: int = 1, show_progress: bool = False)

Apply an arbitrary function to every read in a BAM. Reads for which the function returns None are not written to the output BAM.

Parameters:
  • bam_path – Path to the BAM file

  • apply_func – Function that takes a pysam.AlignedSegment object and optionally returns pysam.AlignedSegment objects

  • out_path – Path to output BAM file

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to False.

Returns:

Path to written BAM

ngs_tools.bam.count_bam(bam_path: str, filter_func: Optional[Callable[[pysam.AlignedSegment], bool]] = None, n_threads: int = 1, show_progress: bool = False) int

Count the number of BAM entries. Optionally, a function may be provided to only count certain alignments.

Parameters:
  • bam_path – Path to BAM

  • filter_func – Function that takes a pysam.AlignedSegment object and returns True for reads to be counted and False otherwise

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to False.

Returns:

Number of alignments in BAM

ngs_tools.bam.split_bam(bam_path: str, split_prefix: str, split_func: Optional[Callable[[pysam.AlignedSegment], str]] = None, n: Optional[int] = None, n_threads: int = 1, check_pair_groups: bool = True, show_progress: bool = False) Dict[str, Tuple[str, int]]

Split a BAM into many parts, either by the number of reads or by an arbitrary function. Only one of split_func or n must be provided. Read pairs are always written to the same file.

This function makes two passes through the BAM file. The first pass is to identify which reads must be written together (i.e. are pairs). The second pass is to actually extract the reads and write them to the appropriate split.

The following procedure is used to identify pairs. 1) The .is_paired property is checked to be True. 2) If the read is uanligned, at most one other unaligned read with the same

read name is allowed to be in the BAM. This other read is its mate. If the read is aligned, it should have the HI BAM tag indicating the alignment index. If no HI tag is present, then it is assumed only one alignment should be present for each read pair. If any of these constraints are not met, an exception is raised.

Parameters:
  • bam_path – Path to the BAM file

  • split_prefix – File path prefix to all the split BAMs

  • split_func – Function that takes a pysam.AlignedSegment object and returns a string ID that is used to group reads into splits. All reads with a given ID will be written to a single BAM. Defaults to None.

  • n – Number of BAMs to split into. Defaults to None.

  • n_threads – Number of threads to use. Only affects reading. Writing is still serialized. Defaults to 1.

  • check_pair_groups – When using split_func, make sure that paired reads are assigned the same ID (and thus are split into the same BAM). Defaults to True.

  • show_progress – Whether to display a progress bar. Defaults to False.

Returns:

Dictionary of tuples, where the first element is the path to a split BAM, and the second element is the number of BAM entries written to that split. The keys are either the string ID of each split (if split_func is used) or the split index (if n is used), and the values are paths.

Raises:

BamError – If any pair constraints are not met.

ngs_tools.bam.tag_bam_with_fastq(bam_path: str, fastq_path: Union[str, List[str]], tag_func: Union[Callable[[ngs_tools.fastq.Read], dict], List[Callable[[ngs_tools.fastq.Read], dict]]], out_path: str, check_name: bool = True, n_threads: int = 1, show_progress: bool = False)

Add tags to BAM entries using sequences from one or more FASTQ files.

Internally, this function calls apply_bam().

Note

The tag keys generated from tag_func must contain unique keys of at most 2 characters.

Parameters:
  • bam_path – Path to the BAM file

  • fastq_path – Path to FASTQ file. This option may be a list to extract tags from multiple FASTQ files. In this case, tag_func must also be a list of functions.

  • tag_func – Function that takes a ngs_tools.fastq.Read object and returns a dictionary of tags. When multiple FASTQs are being parsed simultaneously, each function needs to produce unique dictionary keys. Additionally, BAM tag keys may only be at most 2 characters. However, neither of these conditions are checked in favor of runtime.

  • out_path – Path to output BAM file

  • check_name – Whether or not to raise a BamError if the FASTQ does not contain a read in the BAM

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to False.

Returns:

Path to written BAM

Raises:

BamError – If only one of fastq_path and tag_func is a list, if both are lists but they have different lengths, if check_name=True but there are missing tags.

ngs_tools.bam.filter_bam(bam_path: str, filter_func: Callable[[pysam.AlignedSegment], bool], out_path: str, n_threads: int = 1, show_progress: bool = False)

Filter a BAM by applying the given function to each pysam.AlignedSegment object. When the function returns False, the read is not written to the output BAM.

Internally, this function calls apply_bam().

Parameters:
  • bam_path – Path to the BAM file

  • filter_func – Function that takes a pysam.AlignedSegment object and returns False for reads to be filtered out

  • out_path – Path to output BAM file

  • n_threads – Number of threads to use. Defaults to 1.

  • show_progress – Whether to display a progress bar. Defaults to False.

Returns:

Path to written BAM