What it does
SOAPdenovo2 is a short-read assembly method for building de novo draft assemblies of human-sized genomes. It is specially designed to assemble short reads from the Illumina Genome Analyzer (GA).
This genome assembler is designed for assembling large plant and animal genomes, although it also works well on bacteria and fungal genomes. The code runs on 64-bit and 32-bit Linux and MacOSX systems with a minimum of 5GB physical memory. Approximately 150 GB of memory is required to process large genomes such as those of humans.
This Galaxy tool is a wrapping of SOAPdenovo version 2.04 which can process kmers of up to 127 in size to process long reads. SOAPdenovo2 differs from version 1 since:
Configuration of SOAPdenovo2
For large genome projects involving deep sequencing, data is usually organized as a series of read sequence files generated from multiple libraries. Since the pregraph step is the start of the SOAPdenovo2 pipeline, it requires a configuration file to inform the SOAPdenovo2 assembler where to find these files and about other relevant information required for the de novo assembly process. This configuration file is automatically generated for SOAPdenovo2 by setting parameters on this Galaxy tool page so the information below is provided for reference or if you would like to write the configuration file by hand.
The actual configuration file begins with a section for global information. Currently, only the maximum read length parameter, max_rd_len, is required in the global information section. Any reads longer than max_rd_len will be cut to this length.
Information about the sequencing data is then organized in the corresponding library sections. Each library section starts with a [LIB] tag and includes the following parameters:
avg_ins Indicates the average insert size of this library or the peak value position in the insert size distribution. reverse_seq A value which tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with a typical insert size less than 500 bp; b) forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The reverse_seq parameter should be set to indicate this: 0, forward-reverse; 1, forward-forward. asm_flags Indicates which part(s) of the reads are used. A value of 1 for only contig assembly, 2 for only scaffold assembly, 3 for both contig and scaffold assembly, and 4 for only gap closure. rd_len_cutof The assembler will cut reads from the current library to this length. rank Sets an integer value to decide the order for reads to be used for scaffold assembly. Libraries with the same rank are used at the same time during scaffold assembly. pair_num_cutoff The cutoff value for pair number for a reliable connection between two contigs or pre-scaffolds. The minimum number for paired-end reads and mate-pair reads is 3 and 5, respectively. map_len This parameter is used in the "map" step and is the minimum alignment length between a read and a contig required for a reliable read location. The minimum length for paired-end reads and mate-pair reads is 32 and 35, respectively.
The assembler accepts read files in FASTA, FASTQ and BAM formats.
Mate-pair relationship can be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. If a read in a BAM file fails platform and vendor quality checks, e.g. the flag field 0x0200 is set, itself and it's paired read will be ignored.
Single end files are indicated by "f=/path/filename" or "q=/path/filename" for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by "f1=" and "f2=", whilst paired reads in two fastq sequence files are indicated by "q1=" and "q2=". Paired reads in a single fasta sequence file is indicated by a "p=" item. Reads in BAM sequence files is indicated by "b=".
All of the above items in each library section are optional since the assembler assigns default values for most of them.
Outputs
Two files are generated by SOAPdenovo2:
FAQ
How do I set the K-mer size?
The program accepts odd numbers between 13 and 31. Larger K-mers will have a higher rate of uniqueness in the genome and make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.
How do I set the library rank?
SOAPdenovo2 will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank will be used at the same time. For example, in a data set of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.
More information
For test data and more detailed information, click here.