Generating reads¶

InSilicoSeq comes with a set of pre-computed error models to allow the user to easily generate reads from:

HiSeq 2500
MiSeq

Per example generate 1 million MiSeq 150bp reads from a set of input genomes:

iss generate --genomes genomes.fasta --model_file MiSeq \
--output MiSeq_reads

This will create 2 files, MiSeq_reads_R1.fastq and MiSeq_reads_R2.fastq in your current directory

If you have created your custom model, change --model_file MiSeq to your custom model file:

iss generate --genomes genomes.fasta --model_file my_model.npz \
--output my_model_reads

Required input files¶

By default, InSilicoSeq only requires 1 file in order to start generating reads: 1 (multi-)fasta files containing your input genome(s).

If you don’t want to use a multi-fasta file or don’t have one at hand but are equipped with an Internet connection, you can download random genomes from the ncbi:

iss generate --ncbi bacteria --n_genomes 10 --model_file MiSeq \
--output MiSeq_ncbi

In addition the the 2 fastq files, the downloaded genomes will be in the file MiSeq_ncbi_genomes.fasta in your current directory.

Note: If possible, I recommend using InSilicoSeq with a fasta file as input. The eutils utilities from the ncbi can be slow and quirky.

Abundance distribution¶

The abundance of the input genomes is determined (by default) by a log-normal distribution.

Alternatively, you can use other distributions with the --abundance parameter: uniform, halfnormal, exponential or zero-inflated-lognormal

If you wish to fine-tune the distribution of your genomes, InSilicoSeq also accepts an abundance file:

iss generate --genomes genomes.fasta --abundance_file abundance.txt \
--model_file HiSeq2500 --output HiSeq_reads

Example abundance file for a multi-fasta containing 2 genomes: genome_A and genome_B.

genome_A    0.2
genome_B    0.8

For the abundance to make sense, the total abundance in your abundance file must equal 1.

Histograms of the different distribution (drawn with 100 samples)

Full list of options¶

–genomes¶

Input genome(s) from where the reads will originate

–ncbi¶

Download input genomes from RefSeq instead of using –genomes. Requires –n_genomes option. Can be bacteria, viruses or archaea.

–n_genomes¶

How many genomes will be downloaded from the ncbi. Required if –ncbi is set.

–abundance¶

Abundance distribution (default: lognormal). Can be uniform, halfnormal, exponential, lognormal or zero-inflated-lognormal.

–abundance_file¶

Abundance file for coverage calculations (default: None).

–n_reads¶

Number of reads to generate (default: 1000000)

–model¶

Error model. If not specified, using kernel density estimation (default: kde). Can be ‘kde’, ‘cdf’ or ‘basic’

–model_file¶

Error model file. If not specified, using a basic error model instead (default: None). Use ‘HiSeq2500’ or ‘MiSeq’ for a pre-computed error model provided with the software.

–cpus¶

Number of cpus to use. (default: 2).

–quiet¶

Disable info logging

–debug¶

Enable debug logging

–output¶

Output file prefix (Required)