Welcome to Galaxy

It appears that you found this tool from a link outside of Galaxy. If you're not familiar with Galaxy, please consider visiting the welcome page. To learn more about what Galaxy is and what it can do for you, please visit the Galaxy wiki.

Alignment_mapping (Galaxy tool version 0.1)

Select a reference sequence:

What type of mapping do you want to perform?:

FASTA file:

SOAP settings to use:

Default settings is suitable for most mapping needs. If you want full control, use Full parameter list

What it does

SOAP performs efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The program is designed to handle short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology. SOAP is compatible with numerous applications, including single-read or pair-end resequencing, small RNA discovery and mRNA tag sequence mapping. SOAP supports multi-threaded parallel computing, and has a batch mode for query multiple data sets.

Single-end sequencing

SOAP will allow a certain number of mismatches or one continuous gap when aligning a read onto a reference sequence. The best hit of each read which has the minimal number of mismatches or the smallest gap is reported. For multiple equal best hits, the user can instruct SOAP to report all hits, a random one, or disregard all of them. Since the typical read length is 25-50 bp, hits with too many mismatches are unreliable and are hard to distinguish with random matches. By default, the program will allow at most two mismatches. With regards to two haplotype genome sequences, the occurrence of single nucleotide polymorphism is much higher than that of small insertions or deletions, so ungapped hits have precedence over gapped hits. For gapped alignment, only one continuous gap with a size ranging from 1 to 3 bp is accepted, while no mismatches are permitted in the flanking regions to avoid ambiguous gaps. The gap could be either insertion or deletion in the query or reference sequence.

An intrinsic characteristic of sequencing technology is the accumulation of errors during the sequencing process. Reads always exhibit a higher number of sequencing errors at the 3'-end, which sometimes inhibits them from being aligned onto reference sequences. To deal with this problem, SOAP can iteratively trim several basepairs at the 3'-end and redo the alignment, until hits can be detected or the remaining sequence is too short for specific alignment.

Pair-end sequencing

This methodology involves sequencing both ends of a DNA fragment. A pair of reads will always have a settled relative orientation and approximate distance between each other on the genome. This characteristic can significantly improve the accuracy of resequencing mapping, and is a powerful method for detecting structural variants including copy number variations (CNVs), rearrangements and inversions. SOAP is able to align a pair of reads simultaneously. A pair will be aligned when two reads are mapped with the right orientation relationship and proper distance. In a similar fashion to single-read alignment, a certain number of mismatches are allowed in one or both reads of the pair. For gapped alignment, gap is only permitted on one read, and the other end should match exactly.

Output file format

Each line corresponds to an alignment hit between a read and its reference, and has the following columns:

System Message: WARNING/2 (<string>, line 61)

Literal block expected; none found.

Column Description ---------- -------------------------------------------------------- 1 Id An identifier for read sequence. 2 Seq Full read sequence. It will be converted to the complementary sequence if mapped on the reverse chain of the reference. 3 Qual Sequence quality. 4 Numhits Number of best hits. Reads with no hits are ignored. 5 a/b A flag used for pair-end alignment that distinguishes which file the read is belongs to. 6 Length Read length. If aligned after trimming, it will report the length of the trimmed read. 7 +/- Denotes whether alignment occurs on the direct(+) or reverse(-) chain of the reference. 8 Chr Identifier of the reference sequence. 9 Location Location of the first base pair on the reference, counted from 1. 10 Types Type of hit associated with read 0: exact match 1-100: RefAllele->OffsetQueryAlleleQual": number of mismatches, followed by detailed mutation sites and switch of allele types. Offset is relative to the initial location on reference. 'OffsetAlleleQual': offset, allele and quality. Example: "2 A->10T30 C->13A32" means there are two mismatches, one on location+10 of reference, and the other on location+13 of reference. The allele on reference is A and C respectively while query allele type and its quality is T,30 and A,32. 100+n Offset": n-bp insertion on read. Example: "101 15" means 1-bp insertion on read, start after location+15 on reference. 200+n Offset: n-bp deletion on read. Example: "202 16" means 2-bp deletion on query, start after 16bp on reference.

Algorithm

SOAP1 initially loads reference sequences into memory and then creates hash tables for seed indexing. For each query, a search is performed for seeded hits followed by the alignment of the read to the reference.

1. Load in reference sequences. In contrast to Eland and Maq which load reads into RAM, SOAP stores the reference sequences into memory. Two bits are used for each base, so one byte can store 4 bps. In theory, it will need L/4 bytes for reference with total sequence size L.

2. Suppose a read is split into 4 parts; a, b, c and d. Two mismatches will be distributed on at most two of the 4 parts at the same time. So if we use the combination of two parts as seed, and check for mismatches in the remaining parts, it will be able to get all hits with up to 2 mismatches. There are six combinations - ab, ac, ad, bc, bd and cd, and essentially 3 types of seeds - ab, ac, ad. So we build 3 index tables. To save memory, we set a skip of 3-bp on the reference. The strategy is essentially the same as that used in the Eland and Maq program.

3. Look up table. We used look up table to judge how many mismatches are present between a reference and read. To have best efficiency, the table used 3 bytes to check a fragment of 12-bp on a time. The table occupied 2^24=16Mb RAM.

4. Search for hits. Identical hits are identified first. If no hits are found then 1-mismatch hits will be picked up followed by 2-mismatch hits. Finally, gapped hits are identified.

Important notes

1. The use of very short sequences should be avoided in the reference genome. The program will break if there are reference sequences which are shorter than the query reads so these should be removed.

Simple rules to set parameter of seed size:

S*2+3=Min(Read length)
Hash size=4^S, normally S=12bp
Larger S will be faster

Citation

If you use SOAP in your research, we would appreciate the citation of its paper:

Ruiqiang Li, et. al. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008 24: 713-714.

More information

For test data and more detailed information, click here.

This tool was installed from a ToolShed, you may be able to find additional information by following this link: http://gigatoolshed.net/view/peterli/soap1