Welcome to Galaxy
It appears that you found this tool from a link outside of Galaxy. If you're not familiar with Galaxy, please consider visiting the welcome page. To learn more about what Galaxy is and what it can do for you, please visit the Galaxy wiki.

SOAPdenovo2 (Galaxy tool version 0.1)
Default settings is suitable for most mapping needs. If you want full control, use Full parameter list

What it does

SOAPdenovo2 is a short-read assembly method for building de novo draft assemblies of human-sized genomes. It is specially designed to assemble short reads from the Illumina Genome Analyzer (GA).

This genome assembler is designed for assembling large plant and animal genomes, although it also works well on bacteria and fungal genomes. The code runs on 64-bit and 32-bit Linux and MacOSX systems with a minimum of 5GB physical memory. Approximately 150 GB of memory is required to process large genomes such as those of humans.

This Galaxy tool is a wrapping of SOAPdenovo version 2.04 which can process kmers of up to 127 in size to process long reads. SOAPdenovo2 differs from version 1 since:

  1. The 63mer and 127mer executables have been merged.
  2. Creation of the sparse-pregraph module for reducing computational consumption.
  3. The Multi-kmer method has been introduced in the "contig" step to allow the utilization of small and large kmers.
  4. The scaffolding algorithm can now obtain longer and more accurate scaffolds.
  5. Asynchronous Input/Output is available to improve the performance of reading files.
  6. Information for visualization purposes has been made available after scaffolding.

Configuration of SOAPdenovo2

For large genome projects involving deep sequencing, data is usually organized as a series of read sequence files generated from multiple libraries. Since the pregraph step is the start of the SOAPdenovo2 pipeline, it requires a configuration file to inform the SOAPdenovo2 assembler where to find these files and about other relevant information required for the de novo assembly process. This configuration file is automatically generated for SOAPdenovo2 by setting parameters on this Galaxy tool page so the information below is provided for reference or if you would like to write the configuration file by hand.

The actual configuration file begins with a section for global information. Currently, only the maximum read length parameter, max_rd_len, is required in the global information section. Any reads longer than max_rd_len will be cut to this length.

Information about the sequencing data is then organized in the corresponding library sections. Each library section starts with a [LIB] tag and includes the following parameters:

avg_ins           Indicates the average insert size of this library or the
                  peak value position in the insert size distribution.

reverse_seq       A value which tells the assembler if the read  sequences
                  need to be complementarily reversed. Illumima GA produces
                  two types of paired-end libraries: a) forward-reverse,
                  generated from fragmented DNA ends with a typical insert
                  size less than 500 bp; b) forward-forward, generated  from
                  circularizing libraries with typical insert size greater
                  than 2 Kb. The reverse_seq parameter should be set to
                  indicate this: 0, forward-reverse; 1, forward-forward.

asm_flags         Indicates which part(s) of the reads are used. A value of 1
                  for only contig assembly, 2 for only scaffold assembly, 3
                  for both contig and scaffold assembly, and 4 for only gap
                  closure.

rd_len_cutof      The assembler will cut reads from the current library to
                  this length.

rank              Sets an integer value to decide the order for reads to be
                  used for scaffold assembly. Libraries with the same rank
                  are used at the same time during scaffold assembly.

pair_num_cutoff   The cutoff value for pair number for a reliable connection
                  between two contigs or pre-scaffolds. The minimum number for
                  paired-end reads and mate-pair reads is 3 and 5,
                  respectively.

map_len           This parameter is used in the "map" step and is the minimum
                  alignment length between a read and a contig required for a
                  reliable read location. The minimum length for paired-end
                  reads and mate-pair reads is 32 and 35, respectively.

The assembler accepts read files in FASTA, FASTQ and BAM formats.

Mate-pair relationship can be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. If a read in a BAM file fails platform and vendor quality checks, e.g. the flag field 0x0200 is set, itself and it's paired read will be ignored.

Single end files are indicated by "f=/path/filename" or "q=/path/filename" for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by "f1=" and "f2=", whilst paired reads in two fastq sequence files are indicated by "q1=" and "q2=". Paired reads in a single fasta sequence file is indicated by a "p=" item. Reads in BAM sequence files is indicated by "b=".

All of the above items in each library section are optional since the assembler assigns default values for most of them.


Outputs

Two files are generated by SOAPdenovo2:

  1. The contig file contains sequences without mate pair information.
  2. The scafSeq file contains scaffold portions of the genome which have been reconstructed from contigs and gaps.

FAQ

How do I set the K-mer size?

The program accepts odd numbers between 13 and 31. Larger K-mers will have a higher rate of uniqueness in the genome and make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.

How do I set the library rank?

SOAPdenovo2 will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank will be used at the same time. For example, in a data set of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.


More information

For test data and more detailed information, click here.


This tool was installed from a ToolShed, you may be able to find additional information by following this link: http://gigatoolshed.net/view/peterli/soapdenovo2