iss package¶
Subpackages¶
Submodules¶
iss.abundance module¶
-
iss.abundance.
draft
(genomes, draft, distribution, output)[source]¶ Computes the abundance dictionary for a mix of complete and draft genomes
Parameters: - genomes (list) – list of all input records
- draft (list) – draft genome files
- distribution (function) – distribution function to use
- output (str) – output file
Returns: the abundance dictionary
Return type: dict
-
iss.abundance.
exponential
(record_list)[source]¶ - Generate scaled exponential abundance distribution from a number of
- records
Parameters: record_list (list) – a list of record.id Returns: a dictionary with records as keys, abundance as values Return type: dict
-
iss.abundance.
halfnormal
(record_list)[source]¶ - Generate scaled halfnormal abundance distribution from a number of
- records
Parameters: record_list (list) – a list of record.id Returns: a dictionary with records as keys, abundance as values Return type: dict
-
iss.abundance.
lognormal
(record_list)[source]¶ - Generate scaled lognormal abundance distribution from a number of
- records
Parameters: record_list (list) – a list of record.id Returns: a dictionary with records as keys, abundance as values Return type: dict
-
iss.abundance.
parse_abundance_file
(abundance_file)[source]¶ Parse an abundance or coverage file
The abundance/coverage file is a flat file of the format “genome_id<TAB>abundance” or “genome_id<TAB>coverage”
Parameters: abundance_file (string) – the path to the abundance file Returns: genome_id as keys, abundance as values Return type: dict
-
iss.abundance.
to_coverage
(total_n_reads, species_abundance, read_length, genome_size)[source]¶ Calculate the coverage of a genome in a metagenome given its size and abundance
Parameters: - total_n_reads (int) – total amount of reads in the dataset
- species_abundance (float) – abundance of the species, between 0 and 1
- read_length (int) – length of the reads in the dataset
- genome_size (int) – size of the genome
Returns: coverage of the genome
Return type: float
-
iss.abundance.
to_file
(abundance_dic, output)[source]¶ Write the abundance dictionary to a file
Parameters: - abundance_dic (dict) – the abundance dictionary
- output (str) – the output file name
iss.app module¶
iss.bam module¶
-
iss.bam.
random
() → x in the interval [0, 1).¶
-
iss.bam.
read_bam
(bam_file, n_reads=1000000)[source]¶ Bam file reader. Select random mapped reads from a bam file
Parameters: bam_file (string) – path to a bam file Yields: read – a pysam read object
-
iss.bam.
to_model
(bam_path, output)[source]¶ from a bam file, write all variables needed for modelling reads in a .npz model file
- For a brief description of the variables that will be written to the
- output file, see the bam.write_to_file function
Parameters: - bam_path (string) – path to a bam file
- output (string) – prefix of the output file
-
iss.bam.
write_to_file
(model, read_length, mean_f, mean_r, hist_f, hist_r, sub_f, sub_r, ins_f, ins_r, del_f, del_r, i_size, output)[source]¶ Write variables to a .npz file
Parameters: - model (string) – the type of error model
- read_length (int) – read length of the dataset
- mean_f (list) – list of mean bin sizes
- mean_r (list) – list of mean bin sizes
- hist_f (list) – list of cumulative distribution functions for the forward read quality
- hist_r (list) – list of cumulative distribution functions for the reverse read quality
- sub_f (list) – list of dictionaries representing the substitution probabilities for the forward reads
- sub_r (list) – list of dictionaries representing the substitution probabilities for the reverse reads
- ins_f (list) – list of dictionaries representing the insertion probabilities for the forward reads
- ins_r (list) – list of dictionaries representing the insertion probabilities for the reverse reads
- del_f (list) – list of dictionaries representing the deletion probabilities for the forward reads
- del_r (list) – list of dictionaries representing the deletion probabilities for the reverse reads
- i_size (int) – distribution of insert size for the aligned reads
- output (string) – prefix of the output file
iss.generator module¶
-
iss.generator.
reads
(record, ErrorModel, n_pairs, cpu_number, output, seed, gc_bias=False)[source]¶ Simulate reads from one genome (or sequence) according to an ErrorModel
This function makes use of the simulate_read function to simulate reads and save them in a fastq file
Parameters: - record (SeqRecord) – sequence or genome of reference
- ErrorModel (ErrorModel) – an ErrorModel
- n_pairs (int) – the number of reads to generate
- cpu_number (int) – an int indentifying the cpu that is used by the function. Is used for naming the output file
- output (str) – the output file prefix
- seed (int) – random seed to use
- gc_bias (bool) – if set, the function may skip a read due to abnormal GC content
Returns: the name of the output file
Return type: str
-
iss.generator.
simulate_read
(record, ErrorModel, i, cpu_number)[source]¶ From a read pair from one genome (or sequence) according to an ErrorModel
Each read is a SeqRecord object returns a tuple containing the forward and reverse read.
Parameters: - record (SeqRecord) – sequence or genome of reference
- ErrorModel (ErrorModel) – an ErrorModel class
- i (int) – a number identifying the read
- cpu_number (int) – cpu number. Is added to the read id.
Returns: tuple containg a forward read and a reverse read
Return type: tuple
-
iss.generator.
to_fastq
(generator, output)[source]¶ Write reads to a fastq file
- Take a generator or a list containing read pairs (tuples) and write them
- in two fastq files: output_R1.fastq and output_R2.fastq
Parameters: - generator (generator) – a read generator (or list)
- output (string) – the output files prefix
iss.modeller module¶
-
iss.modeller.
dispatch_indels
(read)[source]¶ Return the x and y position of a insertion or deletion to be inserted in the indel matrix.
The substitution matrix is a 2D array of size 301 * 9 The x axis (301) corresponds to the position in the read, while the y axis (9) represents the match or indel (see the dispatch dict in the function). Positions 0 is match or substitution, other positions in ‘N1’ are insertions, ‘N2 are deletions’
The size of x axis is 301 because we haven’t calculated the read length yet
Parameters: read (read) – an aligned read object Yields: tuple – a tuple with the x, y position for dispatching the indel in the indel matrix
-
iss.modeller.
dispatch_subst
(base, read, read_has_indels)[source]¶ Return the x and y position of a substitution to be inserted in the substitution matrix.
The substitution matrix is a 2D array of size 301 * 16 The x axis (301) corresponds to the position in the read, while the y axis (16) represents the match or substitution (see the dispatch dict in the function). Positions 0, 4, 8 and 12 are matches, other positions are substitutions
The size of x axis is 301 because we haven’t calculated the read length yet
Parameters: - base (tuple) – one base from an aligmnent object. According to the pysam documentation: an alignment is a list of tuples: aligned read (query) and reference positions. the parameter with_seq adds the ref sequence as the 3rd element of the tuples. substitutions are lower-case.
- read (read) – a read object, from which the alignment comes from
- read_has_indels (bool) – a boolean flag to keep track if the read has an indel or not
Returns: x and y position for incrementing the substitution matrix and a third element: True if an indel has been detected, False otherwise
Return type: tuple
-
iss.modeller.
divide_qualities_into_bins
(qualities, n_bins=4)[source]¶ Divides the raw quality scores in bins according to the mean phred quality of the sequence they come from
Parameters: - qualities (list) – raw count of all the phred scores and mean sequence quality
- n_bins (int) – number of bins to create (default: 4)
Returns: a list of lists containing the binned quality scores
Return type: list
-
iss.modeller.
indel_matrix_to_choices
(indel_matrix, read_length)[source]¶ Transform an indel matrix into probabilties of indels for at every position
From the raw indel count at one position, returns a dictionary with probabilties of indel
Parameters: - indel_matrix (np.array) – the substitution matrix is a 2D array of size read_length * 16. fhe x axis (read_length) corresponds to the position in the read, while the y axis (9) represents the match or indel. Positions 0 is match or substitution, other positions in ‘N1’ are insertions, ‘N2 are deletions’
- read_length (int) – read length
Returns: tuple containing two lists of dictionaries representing the insertion or deletion probabilities for a collection of reads
Return type: tuple
-
iss.modeller.
insert_size
(insert_size_distribution)[source]¶ Calculate cumulative distribution function from the raw insert size distributin. Uses 1D kernel density estimation.
Parameters: - insert_size_distribution (list) – list of insert sizes from aligned
- pairs (read) –
Returns: a cumulative density function
Return type: 1darray
-
iss.modeller.
quality_bins_to_histogram
(bin_lists)[source]¶ Wrapper function to generate cdfs for each quality bins
Generate cumulative distribution functions for a number of mean quality bins
Parameters: - bins_lists (list) – list of list containing raw count of all phred
- scores –
Returns: a list of lists containg cumulative density functions
Return type: list
-
iss.modeller.
raw_qualities_to_histogram
(qualities)[source]¶ Approximate the distribution of base quality at each position in a read using a pseudo 2d kernel density estimation
Generate cumulative distribution functions
Parameters: qualities (list) – raw count of all phred scores Returns: - list of cumulative distribution functions. One cdf per base. The
- list has the size of the read length
Return type: list
-
iss.modeller.
subst_matrix_to_choices
(substitution_matrix, read_length)[source]¶ Transform a substitution matrix into probabilties of substitutions for each base and at every position
From the raw mismatches at one position, returns a dictionary with probabilties of substitutions
Parameters: - substitution_matrix (np.array) – the substitution matrix is a 2D array of size read_length * 16. fhe x axis (read_length) corresponds to the position in the read, while the y axis (16) represents the match or substitution. Positions 0, 4, 8 and 12 are matches, other positions are substitutions
- read_length (int) – read length
Returns: - list of dictionaries representing
the substitution probabilities for a collection of reads
Return type: list
iss.util module¶
-
iss.util.
cleanup
(file_list)[source]¶ remove temporary files
Parameters: file_list (list) – a list of files to be removed
-
iss.util.
concatenate
(file_list, output)[source]¶ Concatenate files together
Parameters: - file_list (list) – the list of input files (can be a generator)
- output (string) – the output file name
-
iss.util.
convert_n_reads
(unit)[source]¶ For strings representing a number of bases and ending with k, K, m, M, g, and G converts to a plain old number
Parameters: n (str) – a string representing a number ending with a suffix Returns: a number of reads Return type: float
-
iss.util.
count_records
(fasta_file)[source]¶ Count the number of records in a fasta file and return a list of recods id
Parameters: fasta_file (string) – the path to a fasta file Returns: a list of record ids Return type: list
-
iss.util.
genome_file_exists
(filename)[source]¶ Checks if the output file from the –ncbi option already exists
Parameters: filename (str) – a file name
-
iss.util.
phred_to_prob
(q)[source]¶ Convert a phred score (Sanger or modern Illumina) in probabilty
Given a phred score q, return the probabilty p of the call being right
Parameters: q (int) – phred score Returns: probabilty of basecall being right Return type: float
-
iss.util.
prob_to_phred
(p)[source]¶ Convert a probabilty into a phred score (Sanger or modern Illumina)
Given a probabilty p of the basecall being right, return the phred score q
Parameters: p (int) – probabilty of basecall being right Returns: phred score Return type: int
-
iss.util.
reservoir
(records, record_list, n=None)[source]¶ yield a number of records from a fasta file using reservoir sampling
Parameters: records (obj) – fasta records from SeqIO.parse Yields: record (obj) – a fasta record