Reference Documentation

FastqReader

class matcha.FastqReader(threads=None)[source]

Fastq reading, barcode matching, and optional export of barcoded fastqs

matches

After calling read_chunk, holds the quality of matches for each barcode_name.

Type

Dict[str, matcha.MatchResult]

add_sequence(sequence_name, input_path, output_path='')[source]

Add fastq input and/or output files for barcode matching

Parameters
  • sequence_name (str) – Name of sequence (typically R1, R2, I1, or I2)

  • input_path (str) – Path of fastq file for sequence

  • output_path (str) – Path of output fastq containing matching reads (optional)

add_barcode(barcode_name, matcher, sequence_name, match_start=0)[source]

Add barcode matcher on a sequence

Parameters
  • barcode_name (str) – Name of barcode (used to access match data from this barcode)

  • matcher (Matcher) – matcha.Matcher object holding the valid barcodes

  • sequence_name (str) – Name of sequence to match on (typically R1, R2, I1, or I2)

  • match_start (int) – 0-based index to start matching from in the sequence

set_output_names(pattern)[source]

Set pattern for read names in the output fastq. Available variables for substition are:

  • read_name: The original read name from the input fastq

  • Any barcodes that have been added via inputs.add_barcodes: outputs the label for best match barcode

  • lane, tile, x, y: derived from Illumina read name position info

Parameters

pattern (str) – Python format string.

read_chunk(max_chunk_size)[source]

Read and match barcodes on a chunk of data

Parameters

max_chunk_size (int) – Maximum number of reads to process in the chunk

Returns

Number of records read in chunk

get_match_result(barcode_name, field)[source]

Get barcode match results from a barcode match.

Parameters
  • barcode_name (str) – Name of the barcode as given in add_barcode

  • field (str) – Name of the field to get data for. One of label, dist, second_best_dist, or match. (See matcha.MatchResults)

Returns

numpy array with match results for most recent chunk

get_sequence_read(sequence_name, start=None, end=None)[source]

Get raw sequence reads from an input fastq.

Parameters
  • sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)

  • start (int) – 0-based start index for extracted bases (default to first base)

  • end (int) – 0-based end index for extracted bases (default to last base)

Returns

numpy array of strings containing corresponding sequences for most recent chunk

get_sequence_qual(sequence_name, start=None, end=None)[source]

Get sequence quality strings from an input fastq.

Parameters
  • sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)

  • start (int) – 0-based start index for extracted bases (default to first base)

  • end (int) – 0-based end index for extracted bases (default to last base)

Returns

numpy array of strings containing corresponding qualities for most recent chunk

get_sequence_name(sequence_name)[source]

Get sequence name strings from an input fastq.

Parameters

sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)

Returns

numpy array of strings containing corresponding names for most recent chunk

write_chunk(filter)[source]

Write the current chunk of data in fastq format to output sequence files.

Parameters

filter (array_like[bool]) – Boolean array of same length as last chunk_size. For each element set to True, the corresponding read from the last-read chunk will be output.

close()[source]

Close output files

ListMatcher

class matcha.ListMatcher(sequences, labels=None)[source]

List matcher iterates through possible matches in a list. Slow for many valid barcodes, but has no limits on the maximum number of mismatches. As a rule of thumb, this is the best choice if you have under 20 valid sequences, but you should probably not use it to match over 100 sequences

Parameters
  • sequences (List[str]) – Barcode DNA sequences

  • labels (List[str]) – Labels for barcode sequences (optional)

HashMatcher

class matcha.HashMatcher(sequences, max_mismatches, subsequence_count, labels=None)[source]

Hash matcher uses hash tables of subsequences for barcode search, using the algorithm of Norouzi et al. https://arxiv.org/pdf/1307.2982.pdf. Fast for many valid barcodes and few possible mismatches. Increasing the max_mismatches will decrease performance. For 10x-style barcodes (16bp, ~1M valid barcodes), recommended settings are max_mismatches=1, subsequence_count=2

Parameters
  • sequences (List[str]) – Barcode DNA sequences

  • max_mismatches (int) – Maximum mismatches to match against

  • subsequence_count (int) – Number of subsequence indexes to use. In general, use lower subsequence_count for larger number of valid labels, and higher subsequence_count for searching against a larger number of mismatches.

  • labels (List[str]) – Labels for barcode sequences (optional)

MatchResult

class matcha.MatchResult(match, dist, second_best_dist, labels)[source]

Container type for match results.

label

String array of labels for the best match. Only valid if labels were provided while creating the matcher object

Type

numpy.ndarray

dist

Integer array of distances (number of mismatches) to the best match

Type

numpy.ndarray

second_best_dist

Integer array of distances (number of mismatches) to the second-best match

Type

numpy.ndarray

match

Integer array of indexes of the best match in the Matcher object’s list of valid sequences

Type

numpy.ndarray