iss package¶
Subpackages¶
Submodules¶
iss.abundance module¶
- iss.abundance.coverage_scaling(total_n_reads, abundance_dic, genome_file, read_length)[source]¶
Scale coverage distribution according to the n_reads parameter
- Parameters:
total_n_reads (int) – total amount of reads to simulate
abundance_dic (dict) – a dictionary with records as keys, coverage as values
genome_file (str) – path to input fasta file containing genomes
read_length (int) – length of the reads in the dataset
- Returns:
scaled coverage dictionary
- Return type:
dict
- iss.abundance.draft(genomes, draft, distribution, output, mode='abundance')[source]¶
Computes the abundance dictionary for a mix of complete and draft genomes
- Parameters:
genomes (list) – list of all input records
draft (list) – draft genome files
distribution (function) – distribution function to use
output (str) – output file
- Returns:
the abundance dictionary
- Return type:
dict
- iss.abundance.expand_draft_abundance(abundance_dic, draft, mode='abundance')[source]¶
Calculate abundance for each contig of a draft genome The function takes the abundance dictionary and automatically detects draft genomes. In coverage mode the function simply assign the coverage value to each contig
- Parameters:
abundance_dic (dict) – dict with genome (paths or id) as key and abundance as value
draft (list) – draft genome files
mode (str) – abundance or coverage
- Returns:
abundance dictionary with abundance value for each contig
- Return type:
dict
- iss.abundance.exponential(record_list)[source]¶
- Generate scaled exponential abundance distribution from a number of
records
- Parameters:
record_list (list) – a list of record.id
- Returns:
a dictionary with records as keys, abundance as values
- Return type:
dict
- iss.abundance.halfnormal(record_list)[source]¶
- Generate scaled halfnormal abundance distribution from a number of
records
- Parameters:
record_list (list) – a list of record.id
- Returns:
a dictionary with records as keys, abundance as values
- Return type:
dict
- iss.abundance.lognormal(record_list)[source]¶
- Generate scaled lognormal abundance distribution from a number of
records
- Parameters:
record_list (list) – a list of record.id
- Returns:
a dictionary with records as keys, abundance as values
- Return type:
dict
- iss.abundance.parse_abundance_file(abundance_file)[source]¶
Parse an abundance or coverage file
The abundance/coverage file is a flat file of the format “genome_id<TAB>abundance” or “genome_id<TAB>coverage”
- Parameters:
abundance_file (string) – the path to the abundance file
- Returns:
genome_id as keys, abundance as values
- Return type:
dict
- iss.abundance.to_coverage(total_n_reads, species_abundance, read_length, genome_size)[source]¶
Calculate the coverage of a genome in a metagenome given its size and abundance
- Parameters:
total_n_reads (int) – total amount of reads in the dataset
species_abundance (float) – abundance of the species, between 0 and 1
read_length (int) – length of the reads in the dataset
genome_size (int) – size of the genome
- Returns:
coverage of the genome
- Return type:
float
- iss.abundance.to_file(abundance_dic, output, mode='abundance')[source]¶
Write the abundance dictionary to a file
- Parameters:
abundance_dic (dict) – the abundance dictionary
output (str) – the output file name
iss.app module¶
iss.bam module¶
- iss.bam.random() x in the interval [0, 1). ¶
- iss.bam.read_bam(bam_file, n_reads=1000000)[source]¶
Bam file reader. Select random mapped reads from a bam file
- Parameters:
bam_file (string) – path to a bam file
- Yields:
read – a pysam read object
- iss.bam.to_model(bam_path, output)[source]¶
from a bam file, write all variables needed for modelling reads in a .npz model file
- For a brief description of the variables that will be written to the
output file, see the bam.write_to_file function
- Parameters:
bam_path (string) – path to a bam file
output (string) – prefix of the output file
- iss.bam.write_to_file(model, read_length, mean_f, mean_r, hist_f, hist_r, sub_f, sub_r, ins_f, ins_r, del_f, del_r, i_size, output)[source]¶
Write variables to a .npz file
- Parameters:
model (string) – the type of error model
read_length (int) – read length of the dataset
mean_f (list) – list of mean bin sizes
mean_r (list) – list of mean bin sizes
hist_f (list) – list of cumulative distribution functions for the forward read quality
hist_r (list) – list of cumulative distribution functions for the reverse read quality
sub_f (list) – list of dictionaries representing the substitution probabilities for the forward reads
sub_r (list) – list of dictionaries representing the substitution probabilities for the reverse reads
ins_f (list) – list of dictionaries representing the insertion probabilities for the forward reads
ins_r (list) – list of dictionaries representing the insertion probabilities for the reverse reads
del_f (list) – list of dictionaries representing the deletion probabilities for the forward reads
del_r (list) – list of dictionaries representing the deletion probabilities for the reverse reads
i_size (int) – distribution of insert size for the aligned reads
output (string) – prefix of the output file
iss.generator module¶
- iss.generator.generate_work_divider(fasta_file, readcount_dic, abundance_dic, n_reads, coverage, coverage_file, error_model, output, chunk_size)[source]¶
Yields a list of tuples containing the records and the number of reads to generate for each record
- Yields:
list[tuple[SeqIO.Record, int, str]] –
- a list of tuples containing the records,
the number of reads to generate for each record and the mode
- iss.generator.load_error_model(mode, seed, model, fragment_length, fragment_length_sd, store_mutations)[source]¶
Load the error model based on the specified mode and parameters.
- Parameters:
mode (str) – The mode of the error model. Possible values are ‘kde’, ‘basic’, and ‘perfect’.
seed (int) – The random seed to use for generating random numbers.
model (str) – The model to use for the error model. Only applicable for the ‘kde’ mode.
fragment_length (float) – The mean fragment length for the error model.
fragment_length_sd (float) – The standard deviation of the fragment length for the error model.
- Returns:
The loaded error model based on the specified mode and parameters.
- Return type:
err_mod
- iss.generator.load_genomes(genomes, draft, ncbi, n_genomes_ncbi, output, n_genomes)[source]¶
Load genomes from different sources and concatenate them into a single file.
- Parameters:
genomes (list) – List of paths to genome files.
draft (list) – List of paths to draft genome files.
ncbi (bool) – Flag indicating whether to download genomes from NCBI.
n_genomes_ncbi (list) – List of tuples specifying the NCBI genome IDs and the number of genomes to download.
output (str) – Path to the output file.
n_genomes (int) – Number of genomes to select randomly from the concatenated genome file.
- Returns:
A tuple containing the number of genomes and the path to the concatenated genome file.
- Return type:
tuple
- iss.generator.load_readcount_or_abundance(readcount_file, abundance_file, coverage_file, coverage, abundance_distribution, draft, genome_list, genome_file, n_reads, output, error_model)[source]¶
Load readcount or abundance information based on the provided input parameters.
- Parameters:
readcount_file (str) – Path to the readcount file.
abundance_file (str) – Path to the abundance file.
coverage_file (str) – Path to the coverage file.
coverage (str) – Coverage distribution type.
abundance_distribution (str) – Abundance distribution type.
draft (str) – Draft mode.
genome_list (list) – List of genomes.
genome_file (str) – Path to the genome file.
n_reads (int) – Number of reads.
output (str) – Output file path.
error_model (object) – Error model object.
- Returns:
A tuple containing the readcount dictionary and abundance dictionary.
- Return type:
tuple
- iss.generator.reads_generator(n_pairs, record, error_model, cpu_number, gc_bias, sequence_type)[source]¶
- iss.generator.simulate_read(record, error_model, i, cpu_number, sequence_type)[source]¶
From a read pair from one genome (or sequence) according to an ErrorModel
Each read is a SeqRecord object returns a tuple containing the forward and reverse read.
- Parameters:
record (SeqRecord) – sequence or genome of reference
error_model (ErrorModel) – an ErrorModel class
i (int) – a number identifying the read
cpu_number (int) – cpu number. Is added to the read id.
sequence_type (str) – metagenomics or amplicon sequencing used
- Returns:
tuple containg a forward read and a reverse read
- Return type:
tuple
- iss.generator.simulate_reads(record, error_model, n_pairs, cpu_number, forward_handle, reverse_handle, mutations_handle, sequence_type, gc_bias=False, mode='default')[source]¶
Simulate reads from one genome (or sequence) according to an ErrorModel
This function makes use of the simulate_read function to simulate reads and save them in a fastq file
- Parameters:
record (SeqRecord) – sequence or genome of reference
error_model (ErrorModel) – an ErrorModel
n_pairs (int) – the number of reads to generate
cpu_number (int) – an int indentifying the cpu that is used by the function. Is used for naming the output file
forward_handle (file) – a file handle to write the forward reads to
reverse_handle (file) – a file handle to write the reverse reads to
mutations_handle (file) – a file handle to write the mutations to
sequencing_type (str) – metagenomics or amplicon sequencing used
gc_bias (bool) – if set, the function may skip a read due to abnormal GC content
- iss.generator.to_fastq(generator, output)[source]¶
Write reads to a fastq file
- Take a generator or a list containing read pairs (tuples) and write them
in two fastq files: output_R1.fastq and output_R2.fastq
- Parameters:
generator (generator) – a read generator (or list)
output (string) – the output files prefix
iss.modeller module¶
- iss.modeller.dispatch_indels(read)[source]¶
Return the x and y position of a insertion or deletion to be inserted in the indel matrix.
The substitution matrix is a 2D array of size 301 * 9 The x axis (301) corresponds to the position in the read, while the y axis (9) represents the match or indel (see the dispatch dict in the function). Positions 0 is match or substitution, other positions in ‘N1’ are insertions, ‘N2 are deletions’
The size of x axis is 301 because we haven’t calculated the read length yet
- Parameters:
read (read) – an aligned read object
- Yields:
tuple – a tuple with the x, y position for dispatching the indel in the indel matrix
- iss.modeller.dispatch_subst(base, read, read_has_indels)[source]¶
Return the x and y position of a substitution to be inserted in the substitution matrix.
The substitution matrix is a 2D array of size 301 * 16 The x axis (301) corresponds to the position in the read, while the y axis (16) represents the match or substitution (see the dispatch dict in the function). Positions 0, 4, 8 and 12 are matches, other positions are substitutions
The size of x axis is 301 because we haven’t calculated the read length yet
- Parameters:
base (tuple) – one base from an aligmnent object. According to the pysam documentation: an alignment is a list of tuples: aligned read (query) and reference positions. the parameter with_seq adds the ref sequence as the 3rd element of the tuples. substitutions are lower-case.
read (read) – a read object, from which the alignment comes from
read_has_indels (bool) – a boolean flag to keep track if the read has an indel or not
- Returns:
x and y position for incrementing the substitution matrix and a third element: True if an indel has been detected, False otherwise
- Return type:
tuple
- iss.modeller.divide_qualities_into_bins(qualities, n_bins=4)[source]¶
Divides the raw quality scores in bins according to the mean phred quality of the sequence they come from
- Parameters:
qualities (list) – raw count of all the phred scores and mean sequence quality
n_bins (int) – number of bins to create (default: 4)
- Returns:
a list of lists containing the binned quality scores
- Return type:
list
- iss.modeller.indel_matrix_to_choices(indel_matrix, read_length)[source]¶
Transform an indel matrix into probabilties of indels for at every position
From the raw indel count at one position, returns a dictionary with probabilties of indel
- Parameters:
indel_matrix (np.array) – the substitution matrix is a 2D array of size read_length * 16. fhe x axis (read_length) corresponds to the position in the read, while the y axis (9) represents the match or indel. Positions 0 is match or substitution, other positions in ‘N1’ are insertions, ‘N2 are deletions’
read_length (int) – read length
- Returns:
tuple containing two lists of dictionaries representing the insertion or deletion probabilities for a collection of reads
- Return type:
tuple
- iss.modeller.insert_size(template_length_dist, read_length)[source]¶
Calculate cumulative distribution function from the raw insert size distributin. Uses 1D kernel density estimation.
- Parameters:
template_length_dist (list) – List of template lengths from bam file.
read_length (int) – The length of the read.
- Returns:
a cumulative density function
- Return type:
1darray
- iss.modeller.quality_bins_to_histogram(bin_lists)[source]¶
Wrapper function to generate cdfs for each quality bins
Generate cumulative distribution functions for a number of mean quality bins
- Parameters:
bins_lists (list) – list of list containing raw count of all phred
scores –
- Returns:
a list of lists containg cumulative density functions
- Return type:
list
- iss.modeller.raw_qualities_to_histogram(qualities)[source]¶
Approximate the distribution of base quality at each position in a read using a pseudo 2d kernel density estimation
Generate cumulative distribution functions
- Parameters:
qualities (list) – raw count of all phred scores
- Returns:
- list of cumulative distribution functions. One cdf per base. The
list has the size of the read length
- Return type:
list
- iss.modeller.subst_matrix_to_choices(substitution_matrix, read_length)[source]¶
Transform a substitution matrix into probabilties of substitutions for each base and at every position
From the raw mismatches at one position, returns a dictionary with probabilties of substitutions
- Parameters:
substitution_matrix (np.array) – the substitution matrix is a 2D array of size read_length * 16. fhe x axis (read_length) corresponds to the position in the read, while the y axis (16) represents the match or substitution. Positions 0, 4, 8 and 12 are matches, other positions are substitutions
read_length (int) – read length
- Returns:
- list of dictionaries representing
the substitution probabilities for a collection of reads
- Return type:
list
iss.util module¶
- iss.util.cleanup(file_list)[source]¶
remove temporary files
- Parameters:
file_list (list) – a list of files to be removed
- iss.util.compress(filename, remove=True)[source]¶
gzip a file
- Parameters:
filename (string) – name of file to be compressed
- iss.util.concatenate(file_list, output, header=None)[source]¶
Concatenate files together
- Parameters:
file_list (list) – the list of input files (can be a generator)
output (string) – the output file name
- iss.util.convert_n_reads(unit)[source]¶
For strings representing a number of bases and ending with k, K, m, M, g, and G converts to a plain old number
- Parameters:
n (str) – a string representing a number ending with a suffix
- Returns:
a number of reads
- Return type:
float
- iss.util.count_records(fasta_file)[source]¶
Count the number of records in a fasta file and return a list of recods id
- Parameters:
fasta_file (string) – the path to a fasta file
- Returns:
a list of record ids
- Return type:
list
- iss.util.dump(object, output)[source]¶
dump an object, like pickle.dump. This function uses pickle.dumps to dump large objects
- Parameters:
object (object) – a python object
- iss.util.genome_file_exists(filename)[source]¶
Checks if the output file from the –ncbi option already exists
- Parameters:
filename (str) – a file name
- iss.util.load(filename)[source]¶
load a pickle from disk This function uses pickle.loads to load large objects
- Parameters:
filename (string) – the path of the pickle to load
- iss.util.phred_to_prob(q)[source]¶
Convert a phred score (Sanger or modern Illumina) in probabilty
Given a phred score q, return the probabilty p of the call being right
- Parameters:
q (int) – phred score
- Returns:
probabilty of basecall being right
- Return type:
float
- iss.util.prob_to_phred(p)[source]¶
Convert a probabilty into a phred score (Sanger or modern Illumina)
Given a probabilty p of the basecall being right, return the phred score q
- Parameters:
p (int) – probabilty of basecall being right
- Returns:
phred score
- Return type:
int
- iss.util.reservoir(records, record_list, n=None)[source]¶
yield a number of records from a fasta file using reservoir sampling
- Parameters:
records (obj) – fasta records from SeqIO.parse
- Yields:
record (obj) – a fasta record