iss package¶

Subpackages¶

iss.error_models package

Submodules¶

iss.abundance module¶

iss.abundance.draft(genomes, draft, distribution, output)[source]¶

Computes the abundance dictionary for a mix of complete and draft genomes

Parameters:	genomes (list) – list of all input records draft (list) – draft genome files distribution (function) – distribution function to use output (str) – output file
Returns:	the abundance dictionary
Return type:	dict

iss.abundance.exponential(record_list)[source]¶

Generate scaled exponential abundance distribution from a number of: records

Parameters:	record_list (list) – a list of record.id
Returns:	a dictionary with records as keys, abundance as values
Return type:	dict

iss.abundance.halfnormal(record_list)[source]¶

Generate scaled halfnormal abundance distribution from a number of: records

Parameters:	record_list (list) – a list of record.id
Returns:	a dictionary with records as keys, abundance as values
Return type:	dict

iss.abundance.lognormal(record_list)[source]¶

Generate scaled lognormal abundance distribution from a number of: records

Parameters:	record_list (list) – a list of record.id
Returns:	a dictionary with records as keys, abundance as values
Return type:	dict

iss.abundance.parse_abundance_file(abundance_file)[source]¶

Parse an abundance or coverage file

The abundance/coverage file is a flat file of the format “genome_id<TAB>abundance” or “genome_id<TAB>coverage”

Parameters:	abundance_file (string) – the path to the abundance file
Returns:	genome_id as keys, abundance as values
Return type:	dict

iss.abundance.to_coverage(total_n_reads, species_abundance, read_length, genome_size)[source]¶

Calculate the coverage of a genome in a metagenome given its size and abundance

Parameters:	total_n_reads (int) – total amount of reads in the dataset species_abundance (float) – abundance of the species, between 0 and 1 read_length (int) – length of the reads in the dataset genome_size (int) – size of the genome
Returns:	coverage of the genome
Return type:	float

iss.abundance.to_file(abundance_dic, output)[source]¶

Write the abundance dictionary to a file

Parameters:	abundance_dic (dict) – the abundance dictionary output (str) – the output file name

iss.abundance.uniform(record_list)[source]¶

Generate uniform abundance distribution from a number of records

Parameters:	record_list (list) – a list of record.id
Returns:	a dictionary with records as keys, abundance as values
Return type:	dict

iss.abundance.zero_inflated_lognormal(record_list)[source]¶

Generate scaled zero-inflated lognormal abundance distribution from a: number of records

Parameters:	record_list (list) – a list of record.id
Returns:	a dictionary with records as keys, abundance as values
Return type:	dict

iss.app module¶

iss.app.generate_reads(args)[source]¶

Main function for the iss generate submodule

This submodule generates reads from an ErrorModel and write them to: args.output + _R(1|2).fastq

Parameters:	args (object) – the command-line arguments from argparse

iss.app.main()[source]¶

iss.app.model_from_bam(args)[source]¶

Main function for the iss model submodule

This submodule write all variables necessary for building an ErrorModel to args.output + .npz

Parameters:	args (object) – the command-line arguments from argparse

iss.bam module¶

iss.bam.random() → x in the interval [0, 1).¶

iss.bam.read_bam(bam_file, n_reads=1000000)[source]¶

Bam file reader. Select random mapped reads from a bam file

Parameters:	bam_file (string) – path to a bam file
Yields:	read – a pysam read object

iss.bam.to_model(bam_path, output)[source]¶

from a bam file, write all variables needed for modelling reads in a .npz model file

For a brief description of the variables that will be written to the: output file, see the bam.write_to_file function

Parameters:	bam_path (string) – path to a bam file output (string) – prefix of the output file

iss.bam.write_to_file(model, read_length, mean_f, mean_r, hist_f, hist_r, sub_f, sub_r, ins_f, ins_r, del_f, del_r, i_size, output)[source]¶

Write variables to a .npz file

Parameters:

model (string) – the type of error model
read_length (int) – read length of the dataset
mean_f (list) – list of mean bin sizes
mean_r (list) – list of mean bin sizes
hist_f (list) – list of cumulative distribution functions for the forward read quality
hist_r (list) – list of cumulative distribution functions for the reverse read quality
sub_f (list) – list of dictionaries representing the substitution probabilities for the forward reads
sub_r (list) – list of dictionaries representing the substitution probabilities for the reverse reads
ins_f (list) – list of dictionaries representing the insertion probabilities for the forward reads
ins_r (list) – list of dictionaries representing the insertion probabilities for the reverse reads
del_f (list) – list of dictionaries representing the deletion probabilities for the forward reads
del_r (list) – list of dictionaries representing the deletion probabilities for the reverse reads
i_size (int) – distribution of insert size for the aligned reads
output (string) – prefix of the output file

iss.generator module¶

iss.generator.reads(record, ErrorModel, n_pairs, cpu_number, output, seed, gc_bias=False)[source]¶

Simulate reads from one genome (or sequence) according to an ErrorModel

This function makes use of the simulate_read function to simulate reads and save them in a fastq file

Parameters:	record (SeqRecord) – sequence or genome of reference ErrorModel (ErrorModel) – an ErrorModel n_pairs (int) – the number of reads to generate cpu_number (int) – an int indentifying the cpu that is used by the function. Is used for naming the output file output (str) – the output file prefix seed (int) – random seed to use gc_bias (bool) – if set, the function may skip a read due to abnormal GC content
Returns:	the name of the output file
Return type:	str

iss.generator.simulate_read(record, ErrorModel, i, cpu_number)[source]¶

From a read pair from one genome (or sequence) according to an ErrorModel

Each read is a SeqRecord object returns a tuple containing the forward and reverse read.

Parameters:	record (SeqRecord) – sequence or genome of reference ErrorModel (ErrorModel) – an ErrorModel class i (int) – a number identifying the read cpu_number (int) – cpu number. Is added to the read id.
Returns:	tuple containg a forward read and a reverse read
Return type:	tuple

iss.generator.to_fastq(generator, output)[source]¶

Write reads to a fastq file

Take a generator or a list containing read pairs (tuples) and write them: in two fastq files: output_R1.fastq and output_R2.fastq

Parameters:	generator (generator) – a read generator (or list) output (string) – the output files prefix

iss.modeller module¶

iss.modeller.dispatch_indels(read)[source]¶

Return the x and y position of a insertion or deletion to be inserted in the indel matrix.

The substitution matrix is a 2D array of size 301 * 9 The x axis (301) corresponds to the position in the read, while the y axis (9) represents the match or indel (see the dispatch dict in the function). Positions 0 is match or substitution, other positions in ‘N1’ are insertions, ‘N2 are deletions’

The size of x axis is 301 because we haven’t calculated the read length yet

Parameters:	read (read) – an aligned read object
Yields:	tuple – a tuple with the x, y position for dispatching the indel in the indel matrix

iss.modeller.dispatch_subst(base, read, read_has_indels)[source]¶

Return the x and y position of a substitution to be inserted in the substitution matrix.

The substitution matrix is a 2D array of size 301 * 16 The x axis (301) corresponds to the position in the read, while the y axis (16) represents the match or substitution (see the dispatch dict in the function). Positions 0, 4, 8 and 12 are matches, other positions are substitutions

The size of x axis is 301 because we haven’t calculated the read length yet

Parameters:	base (tuple) – one base from an aligmnent object. According to the pysam documentation: an alignment is a list of tuples: aligned read (query) and reference positions. the parameter with_seq adds the ref sequence as the 3rd element of the tuples. substitutions are lower-case. read (read) – a read object, from which the alignment comes from read_has_indels (bool) – a boolean flag to keep track if the read has an indel or not
Returns:	x and y position for incrementing the substitution matrix and a third element: True if an indel has been detected, False otherwise
Return type:	tuple

iss.modeller.divide_qualities_into_bins(qualities, n_bins=4)[source]¶

Divides the raw quality scores in bins according to the mean phred quality of the sequence they come from

Parameters:	qualities (list) – raw count of all the phred scores and mean sequence quality n_bins (int) – number of bins to create (default: 4)
Returns:	a list of lists containing the binned quality scores
Return type:	list

iss.modeller.indel_matrix_to_choices(indel_matrix, read_length)[source]¶

Transform an indel matrix into probabilties of indels for at every position

From the raw indel count at one position, returns a dictionary with probabilties of indel

Parameters:	indel_matrix (np.array) – the substitution matrix is a 2D array of size read_length * 16. fhe x axis (read_length) corresponds to the position in the read, while the y axis (9) represents the match or indel. Positions 0 is match or substitution, other positions in ‘N1’ are insertions, ‘N2 are deletions’ read_length (int) – read length
Returns:	tuple containing two lists of dictionaries representing the insertion or deletion probabilities for a collection of reads
Return type:	tuple

iss.modeller.insert_size(insert_size_distribution)[source]¶

Calculate cumulative distribution function from the raw insert size distributin. Uses 1D kernel density estimation.

Parameters:	insert_size_distribution (list) – list of insert sizes from aligned pairs (read) –
Returns:	a cumulative density function
Return type:	1darray

iss.modeller.quality_bins_to_histogram(bin_lists)[source]¶

Wrapper function to generate cdfs for each quality bins

Generate cumulative distribution functions for a number of mean quality bins

Parameters:	bins_lists (list) – list of list containing raw count of all phred scores –
Returns:	a list of lists containg cumulative density functions
Return type:	list

iss.modeller.raw_qualities_to_histogram(qualities)[source]¶

Approximate the distribution of base quality at each position in a read using a pseudo 2d kernel density estimation

Generate cumulative distribution functions

Parameters:	qualities (list) – raw count of all phred scores
Returns:	list of cumulative distribution functions. One cdf per base. The list has the size of the read length
Return type:	list

iss.modeller.subst_matrix_to_choices(substitution_matrix, read_length)[source]¶

Transform a substitution matrix into probabilties of substitutions for each base and at every position

From the raw mismatches at one position, returns a dictionary with probabilties of substitutions

Parameters:

substitution_matrix (np.array) – the substitution matrix is a 2D array of size read_length * 16. fhe x axis (read_length) corresponds to the position in the read, while the y axis (16) represents the match or substitution. Positions 0, 4, 8 and 12 are matches, other positions are substitutions
read_length (int) – read length

Returns:

list of dictionaries representing: the substitution probabilities for a collection of reads

Return type:

list

iss.util module¶

iss.util.cleanup(file_list)[source]¶

remove temporary files

Parameters:	file_list (list) – a list of files to be removed

iss.util.concatenate(file_list, output)[source]¶

Concatenate files together

Parameters:	file_list (list) – the list of input files (can be a generator) output (string) – the output file name

iss.util.convert_n_reads(unit)[source]¶

For strings representing a number of bases and ending with k, K, m, M, g, and G converts to a plain old number

Parameters:	n (str) – a string representing a number ending with a suffix
Returns:	a number of reads
Return type:	float

iss.util.count_records(fasta_file)[source]¶

Count the number of records in a fasta file and return a list of recods id

Parameters:	fasta_file (string) – the path to a fasta file
Returns:	a list of record ids
Return type:	list

iss.util.genome_file_exists(filename)[source]¶

Checks if the output file from the –ncbi option already exists

Parameters:	filename (str) – a file name

iss.util.nplog(type, flag)[source]¶

iss.util.phred_to_prob(q)[source]¶

Convert a phred score (Sanger or modern Illumina) in probabilty

Given a phred score q, return the probabilty p of the call being right

Parameters:	q (int) – phred score
Returns:	probabilty of basecall being right
Return type:	float

iss.util.prob_to_phred(p)[source]¶

Convert a probabilty into a phred score (Sanger or modern Illumina)

Given a probabilty p of the basecall being right, return the phred score q

Parameters:	p (int) – probabilty of basecall being right
Returns:	phred score
Return type:	int

iss.util.reservoir(records, record_list, n=None)[source]¶

yield a number of records from a fasta file using reservoir sampling

Parameters:	records (obj) – fasta records from SeqIO.parse
Yields:	record (obj) – a fasta record

iss.util.rev_comp(s)[source]¶

A simple reverse complement implementation working on strings

Parameters:	s (string) – a DNA sequence (IUPAC, can be ambiguous)
Returns:	reverse complement of the input sequence
Return type:	list

iss.util.split_list(l, n_parts=1)[source]¶

Split a list in a number of parts

Parameters:	l (list) – a list n_parts (in) – the number of parts to split the list in
Returns:	a list of n_parts lists
Return type:	list

iss package¶

Subpackages¶

Submodules¶

iss.abundance module¶

iss.app module¶

iss.bam module¶

iss.generator module¶

iss.modeller module¶

iss.util module¶

Module contents¶

InSilicoSeq

Navigation

Related Topics