.. _generate:

Generating reads
================

InSilicoSeq comes with a set of pre-computed error models to allow the user
to easily generate reads from:

- HiSeq 2500
- MiSeq

Per example generate 1 million MiSeq 150bp reads from a set of input genomes:

.. code-block:: bash

    iss generate --genomes genomes.fasta --model_file MiSeq \
    --output MiSeq_reads

This will create 2 files, `MiSeq_reads_R1.fastq` and `MiSeq_reads_R2.fastq` in
your current directory

If you have created your custom model, change ``--model_file MiSeq`` to your
custom model file:

.. code-block:: bash

    iss generate --genomes genomes.fasta --model_file my_model.npz \
    --output my_model_reads


Required input files
--------------------

By default, InSilicoSeq only requires 1 file in order to start generating
reads: 1 (multi-)fasta files containing your input genome(s).

If you don't want to use a multi-fasta file or don't have one at hand but are
equipped with an Internet connection, you can download random genomes from the
ncbi:

.. code-block:: bash

    iss generate --ncbi bacteria --n_genomes 10 --model_file MiSeq \
    --output MiSeq_ncbi

In addition the the 2 fastq files, the downloaded genomes will be in the file
`MiSeq_ncbi_genomes.fasta` in your current directory.

*Note: If possible, I recommend using InSilicoSeq with a fasta file as input.*
*The eutils utilities from the ncbi can be slow and quirky.*


Abundance distribution
----------------------

The abundance of the input genomes is determined (by default) by a log-normal
distribution.

Alternatively, you can use other distributions with the ``--abundance``
parameter: `uniform`, `halfnormal`, `exponential` or `zero-inflated-lognormal`

If you wish to fine-tune the distribution of your genomes, InSilicoSeq also
accepts an abundance file:

.. code-block:: bash

    iss generate --genomes genomes.fasta --abundance_file abundance.txt \
    --model_file HiSeq2500 --output HiSeq_reads

Example abundance file for a multi-fasta containing 2 genomes: genome_A and
genome_B.

.. code-block:: bash

    genome_A    0.2
    genome_B    0.8


For the abundance to make sense, the total abundance in your abundance file
must equal 1.

.. figure:: distributions.png

    Histograms of the different distribution (drawn with 100 samples)


Full list of options
--------------------

--genomes
^^^^^^^^^

Input genome(s) from where the reads will originate

--ncbi
^^^^^^

Download input genomes from RefSeq instead of using --genomes. Requires
--n_genomes option. Can be bacteria, viruses or archaea.

--n_genomes
^^^^^^^^^^^

How many genomes will be downloaded from the ncbi.
Required if --ncbi is set.

--abundance
^^^^^^^^^^^

Abundance distribution (default: lognormal). Can be uniform, halfnormal,
exponential, lognormal or zero-inflated-lognormal.

--abundance_file
^^^^^^^^^^^^^^^^

Abundance file for coverage calculations (default: None).

--n_reads
^^^^^^^^^

Number of reads to generate (default: 1000000)

--model
^^^^^^^

Error model. If not specified, using kernel density estimation (default: kde).
Can be 'kde', 'cdf' or 'basic'

--model_file
^^^^^^^^^^^^

Error model file. If not specified, using a basic error model instead
(default: None). Use 'HiSeq2500' or 'MiSeq' for a pre-computed error model
provided with the software.

--cpus
^^^^^^

Number of cpus to use. (default: 2).

--quiet
^^^^^^^

Disable info logging

--debug
^^^^^^^

Enable debug logging

--output
^^^^^^^^

Output file prefix (Required)