Reference Documentation¶

FastqReader¶

class matcha.FastqReader(threads=None)[source]¶

Fastq reading, barcode matching, and optional export of barcoded fastqs

matches¶

After calling read_chunk, holds the quality of matches for each barcode_name.

Type: Dict[str, matcha.MatchResult]

add_sequence(sequence_name, input_path, output_path='')[source]¶

Add fastq input and/or output files for barcode matching

Parameters

sequence_name (str) – Name of sequence (typically R1, R2, I1, or I2)
input_path (str) – Path of fastq file for sequence
output_path (str) – Path of output fastq containing matching reads (optional)

add_barcode(barcode_name, matcher, sequence_name, match_start=0)[source]¶

Add barcode matcher on a sequence

Parameters

barcode_name (str) – Name of barcode (used to access match data from this barcode)
matcher (Matcher) – matcha.Matcher object holding the valid barcodes
sequence_name (str) – Name of sequence to match on (typically R1, R2, I1, or I2)
match_start (int) – 0-based index to start matching from in the sequence

set_output_names(pattern)[source]¶

Set pattern for read names in the output fastq. Available variables for substition are:

read_name: The original read name from the input fastq

Any barcodes that have been added via inputs.add_barcodes: outputs the label for best match barcode

lane, tile, x, y: derived from Illumina read name position info

Parameters: pattern (str) – Python format string.

read_chunk(max_chunk_size)[source]¶

Read and match barcodes on a chunk of data

Parameters: max_chunk_size (int) – Maximum number of reads to process in the chunk
Returns: Number of records read in chunk

get_match_result(barcode_name, field)[source]¶

Get barcode match results from a barcode match.

Parameters

barcode_name (str) – Name of the barcode as given in add_barcode
field (str) – Name of the field to get data for. One of label, dist, second_best_dist, or match. (See matcha.MatchResults)

Returns

numpy array with match results for most recent chunk

get_sequence_read(sequence_name, start=None, end=None)[source]¶

Get raw sequence reads from an input fastq.

Parameters

sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)
start (int) – 0-based start index for extracted bases (default to first base)
end (int) – 0-based end index for extracted bases (default to last base)

Returns

numpy array of strings containing corresponding sequences for most recent chunk

get_sequence_qual(sequence_name, start=None, end=None)[source]¶

Get sequence quality strings from an input fastq.

Parameters

sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)
start (int) – 0-based start index for extracted bases (default to first base)
end (int) – 0-based end index for extracted bases (default to last base)

Returns

numpy array of strings containing corresponding qualities for most recent chunk

get_sequence_name(sequence_name)[source]¶

Get sequence name strings from an input fastq.

Parameters: sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)
Returns: numpy array of strings containing corresponding names for most recent chunk

write_chunk(filter)[source]¶

Write the current chunk of data in fastq format to output sequence files.

Parameters: filter (array_like[bool]) – Boolean array of same length as last chunk_size. For each element set to True, the corresponding read from the last-read chunk will be output.

close()[source]¶: Close output files

ListMatcher¶

class matcha.ListMatcher(sequences, labels=None)[source]¶

List matcher iterates through possible matches in a list. Slow for many valid barcodes, but has no limits on the maximum number of mismatches. As a rule of thumb, this is the best choice if you have under 20 valid sequences, but you should probably not use it to match over 100 sequences

Parameters

sequences (List[str]) – Barcode DNA sequences
labels (List[str]) – Labels for barcode sequences (optional)

HashMatcher¶

class matcha.HashMatcher(sequences, max_mismatches, subsequence_count, labels=None)[source]¶

Hash matcher uses hash tables of subsequences for barcode search, using the algorithm of Norouzi et al. https://arxiv.org/pdf/1307.2982.pdf. Fast for many valid barcodes and few possible mismatches. Increasing the max_mismatches will decrease performance. For 10x-style barcodes (16bp, ~1M valid barcodes), recommended settings are max_mismatches=1, subsequence_count=2

Parameters

sequences (List[str]) – Barcode DNA sequences
max_mismatches (int) – Maximum mismatches to match against
subsequence_count (int) – Number of subsequence indexes to use. In general, use lower subsequence_count for larger number of valid labels, and higher subsequence_count for searching against a larger number of mismatches.
labels (List[str]) – Labels for barcode sequences (optional)