In addition to the assembly results of S. aureus, Table 2 in the GigaScience SOAPdenovo2 paper also shows results from short reads sequenced from R. sphaeroides:
To construct the workflow to replicate the R. sphaeroides results in Table 2, the following short read and genome sequence data are required from the list of Data Libraries:
The pipeline used to generate the assembly from this bacterium differs from that for S. aureus with the addition of SOAPfilter to pre-process the R. sphaeroides data set:
1. There are two sets of libraries in the R. sphaeroides dataset and they are both pre-processed differently prior to the SOAPdenovo2 genome assembly step. Firstly, the short-jump reads are cleaned using the SOAPfilter tool whose parameters have to be set to those shown below:
SOAPfilter produces 3 files, 2 of files are cleaned versions of the 2 short jump read files and the third is a statistical summary of the filtering process performed by SOAPfilter:
2. The R. sphaeroides frag reads have to be corrected using the KmerFreq_AR and Corrector_AR tools. The screenshot below shows the values of the inputs and parameters for KmerFreq_AR:
3. The kmerfreq.infiles, kmerfreq.cz and kmerfreq.cz.len outputs from KmerFReq_AR should be relayed to the Corrector_AR tool as shown below:
Corrector_AR will produce corrected forward and reverse reads in compressed fast format based on the frag_1 and frag_2 library files:
4. After pre-processing with SOAPfilter, KmerFreq_AR and Corrector_AR, SOAPdenovo2 is applied to the filtered and corrected reads to create a draft assembly. SOAPdenovo2 should be configured as follows:
Execution of SOAPdenovo2 with these parameters produces the following scaffold and contig files, and also a configuration file too:
5. GapCloser is then used to reduce the size of any gaps present in the scaffolds:
This results in the following scaffold file:
6. The Extract ACGT tool is used to split scaffolds containing any remaining gaps to produce a file of contigs (red box):
7. The GAGE evaluation tool is then called to calculate N50 and corrected N50 statistics:
8. The stat tool is then used to extract the statistical results from the GAGE output:
On comparing these R. sphaeroides results with Table 2 from the SOAPdenovo2 paper, you will see that they are identical:
peterli
All published pages
Published pages by peterli