Reference Documentation¶
FastqReader¶
-
class
matcha.
FastqReader
(threads=None)[source]¶ Fastq reading, barcode matching, and optional export of barcoded fastqs
-
matches
¶ After calling read_chunk, holds the quality of matches for each barcode_name.
- Type
Dict[str, matcha.MatchResult]
-
add_sequence
(sequence_name, input_path, output_path='')[source]¶ Add fastq input and/or output files for barcode matching
-
add_barcode
(barcode_name, matcher, sequence_name, match_start=0)[source]¶ Add barcode matcher on a sequence
- Parameters
barcode_name (str) – Name of barcode (used to access match data from this barcode)
matcher (Matcher) – matcha.Matcher object holding the valid barcodes
sequence_name (str) – Name of sequence to match on (typically R1, R2, I1, or I2)
match_start (int) – 0-based index to start matching from in the sequence
-
set_output_names
(pattern)[source]¶ Set pattern for read names in the output fastq. Available variables for substition are:
read_name: The original read name from the input fastq
Any barcodes that have been added via inputs.add_barcodes: outputs the label for best match barcode
lane, tile, x, y: derived from Illumina read name position info
- Parameters
pattern (str) – Python format string.
-
read_chunk
(max_chunk_size)[source]¶ Read and match barcodes on a chunk of data
- Parameters
max_chunk_size (int) – Maximum number of reads to process in the chunk
- Returns
Number of records read in chunk
-
get_sequence_read
(sequence_name, start=None, end=None)[source]¶ Get raw sequence reads from an input fastq.
- Parameters
- Returns
numpy array of strings containing corresponding sequences for most recent chunk
-
get_sequence_qual
(sequence_name, start=None, end=None)[source]¶ Get sequence quality strings from an input fastq.
- Parameters
- Returns
numpy array of strings containing corresponding qualities for most recent chunk
-
get_sequence_name
(sequence_name)[source]¶ Get sequence name strings from an input fastq.
- Parameters
sequence_name (str) – Name of sequence as given in add_sequence (typically R1, R2, I1, or I2)
- Returns
numpy array of strings containing corresponding names for most recent chunk
-
ListMatcher¶
-
class
matcha.
ListMatcher
(sequences, labels=None)[source]¶ List matcher iterates through possible matches in a list. Slow for many valid barcodes, but has no limits on the maximum number of mismatches. As a rule of thumb, this is the best choice if you have under 20 valid sequences, but you should probably not use it to match over 100 sequences
HashMatcher¶
-
class
matcha.
HashMatcher
(sequences, max_mismatches, subsequence_count, labels=None)[source]¶ Hash matcher uses hash tables of subsequences for barcode search, using the algorithm of Norouzi et al. https://arxiv.org/pdf/1307.2982.pdf. Fast for many valid barcodes and few possible mismatches. Increasing the max_mismatches will decrease performance. For 10x-style barcodes (16bp, ~1M valid barcodes), recommended settings are max_mismatches=1, subsequence_count=2
- Parameters
sequences (List[str]) – Barcode DNA sequences
max_mismatches (int) – Maximum mismatches to match against
subsequence_count (int) – Number of subsequence indexes to use. In general, use lower subsequence_count for larger number of valid labels, and higher subsequence_count for searching against a larger number of mismatches.
labels (List[str]) – Labels for barcode sequences (optional)
MatchResult¶
-
class
matcha.
MatchResult
(match, dist, second_best_dist, labels)[source]¶ Container type for match results.
-
label
¶ String array of labels for the best match. Only valid if labels were provided while creating the matcher object
- Type
-
dist
¶ Integer array of distances (number of mismatches) to the best match
- Type
-
second_best_dist
¶ Integer array of distances (number of mismatches) to the second-best match
- Type
-
match
¶ Integer array of indexes of the best match in the Matcher object’s list of valid sequences
- Type
-