None
Published Pages | peterli | Using SOAPdenovo2 to assemble the R. sphaeroides genome

Using SOAPdenovo2 to assemble the R. sphaeroides genome

A workflow from Luo et al., (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18.

Introduction

In addition to the assembly results of S. aureus, Table 2 in the GigaScience SOAPdenovo2 paper also shows results from short reads sequenced from R. sphaeroides:

Input data

To construct the workflow to replicate the R. sphaeroides results in Table 2, the following short read and genome sequence data are required from the list of Data Libraries:

Workflow

The pipeline used to generate the assembly from this bacterium differs from that for S. aureus with the addition of SOAPfilter to pre-process the R. sphaeroides data set:

Tool execution

1. There are two sets of libraries in the R. sphaeroides dataset and they are both pre-processed differently prior to the SOAPdenovo2 genome assembly step. Firstly, the short-jump reads are cleaned using the SOAPfilter tool whose parameters have to be set to those shown below:

SOAPfilter produces 3 files, 2 of files are cleaned versions of the 2 short jump read files and the third is a statistical summary of the filtering process performed by SOAPfilter:

2. The R. sphaeroides frag reads have to be corrected using the KmerFreq_AR and Corrector_AR tools. The screenshot below shows the values of the inputs and parameters for KmerFreq_AR:

3. The kmerfreq.infiles, kmerfreq.cz and kmerfreq.cz.len outputs from KmerFReq_AR should be relayed to the Corrector_AR tool as shown below:

Corrector_AR will produce corrected forward and reverse reads in compressed fast format based on the frag_1 and frag_2 library files:

4. After pre-processing with SOAPfilter, KmerFreq_AR and Corrector_AR, SOAPdenovo2 is applied to the filtered and corrected reads to create a draft assembly. SOAPdenovo2 should be configured as follows:

Execution of SOAPdenovo2 with these parameters produces the following scaffold and contig files, and also a configuration file too:

5. GapCloser is then used to reduce the size of any gaps present in the scaffolds:

This results in the following scaffold file:

6. The Extract ACGT tool is used to split scaffolds containing any remaining gaps to produce a file of contigs (red box):

7. The GAGE evaluation tool is then called to calculate N50 and corrected N50 statistics:

8. The stat tool is then used to extract the statistical results from the GAGE output:

On comparing these R. sphaeroides results with Table 2 from the SOAPdenovo2 paper, you will see that they are identical: