None
Published Pages | peterli | Using SOAPdenovo2 to assemble the S. aureus genome

Using SOAPdenovo2 to assemble the S. aureus genome

A workflow from Luo et al., (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1:18.

Introduction

A new version of SOAPdenovo was published in GigaScience late in 2012. In this paper, Luo and colleagues described improvements made to their de novogenome assembler which was tested on a number of short read data sets including the YH Asian genome. Through our new GigaGalaxy platform, we have made available Galaxy workflows that replicates the data analysis procedures used by Luo et al., (2012) to assemble genomes from short read data sets obtained from S. aureus and R. sphaeroides. Specifically, these two workflows replicate the metrics shown in Table 2 of the GigaScience paper:

Input data

To construct the workflow and replicate the S. aureus results, the following data sets are required from the S. aureus shorts reads shared data library:

These data sets are stored as Data Libraries which need to be loaded into your history panel. Click on the Shared Data drop down menu and select Data Libraries:

 

A list of available datasets for analysis is now displayed. Select the S. aureus short reads and ref sequence:

The contents of this data set is now displayed. Import all of the files into your history:

Workflow

The pipeline that reproduces the above S. aureus results requires the following steps:

Tool execution

1. The first step in the workflow involves processing the S. aureus data using KmerFreq_AR to produce a kmer frequency spectrum. Select the KmerFreq_AR tool from the left hand tool panel, and configure the input and parameters as shown below:

2. A number of outputs are generated by KmerFreq_AR including a file called KmerFreq.stats which is a file containing summary statistics of the KmerFreq process. The other files are kmerfreq.infiles which contains the paths of the files processed by KmerFreq, kmerfreq.cz, kmerfreq.len and genome.estimate.

3. The kmerfreq.infiles, kmerfreq.cz and kmerfreq.cz.len outputs should be passed to Corrector_AR to perform the actual correction of short reads. A number of parameters in Corrector_AR should also be configured as shown below:

4. If the Corrector_AR tool is now executed, it will produce a number of outputs including the corrected forward and reverse reads, as well as an HTML file which provides links to other supplementary files that Corrector_AR produces:

5. We are now in a position to use SOAPdenovo2 with the corrected short read data sets. If you select SOAPdenovo2 from the tool panel, you will see its tool interface page:

What we now need to do is inform SOAPdenovo2 of the location of the short reads that it needs to work on and provide it with information about the reads. This is done by creating a configuration file via SOAPdenovo2's interface page. For the first library of short reads which were corrected by KmerFreq_AR and Corrector_AR, the parameters should be set as below:

The second set of reads, which are labelled as short_jump_1.fastq and short_jump_2.fastq in the history panel, should have its configuration file parameters set as below:

In addition, set the SOAPdenovo2 parameters; specifically the k-value, delete kmers with frequency no larger than, resolve repeats and fill gaps, as above too. If you click Execute, the SOAPdenovo2 process will run and generate 3 outputs: a file containing contigs, a file containing scaffolds and the SOAP configuration file used to configure its execution:

6. The scaffolds generated from the de novo assembly processes can contain gaps in between contigs. These gaps can be closed using the GapCloser tool present in the tool panel. The library configuration file from SOAPdenovo2 is used to control its execution whilst its remaining inputs and parameters should be set as shown below:

Two files are created by GapCloser including a new file containing scaffolds:

7. Despite the use of GapCloser, the scaffolds might still contain gaps represented by 'N's in the sequences. To obtain a more accurate result, we can remove these Ns, thereby splitting the scaffolds into smaller sequences using the Extract ACGT tool:

8. We can now evaluate the results of SOAPdenovo2 and GapCloser by calculating the N50 and corrected N50 scores. To do this, select the GAGE evaluation tool and configure its input ports as shown below:

A single output called gage.output is produced which contains statistics of the contigs and scaffolds including the N50 score, a measure of the quality of the assembly. These statistics can be extracted using the stat tool which produces this output:

9. When you compare the results produced by this pipeline with the results in Table 2 of the SOAPdenovo2 paper, you will see that they are the same: