See also
UPARSE home page
OTU benchmark results
UPARSE algorithm
Example UPARSE command lines
A UPARSE pipeline clusters NGS amplicon reads into OTUs using the cluster_otus command. This page discusses the pre- and post-processing steps that are typically required to get the best results from cluster_otus in practice.
It is not possible to give a single set of command lines that will work for all reads because there are many variations in the input data, especially of read layouts.This page summarizes steps that are usually performed by a pipeline, and provides links to further discussion and details of the command lines that can be used. You could start from these examples.
Reads with Phred (quality) scores
I strongly recommended starting from "raw" reads, i.e. the reads
originally provided by the sequencing machine base-calling software. Phred
scores should be retained, and you should do quality filtering with USEARCH
rather than using
reads that have already been filtered by third-party software. Start by converting to FASTQ format, if needed. If you
have 454 reads in FASTA +
QUAL format,
you can use the
faqual2fastq.py script to convert.
Sample pooling
I recommend combining reads from as many samples as possible if they contain
similar communities and / or if you are planning to compare samples using
measures such as beta diversity. See sample pooling
for discussion.
Read quality filtering
Quality
filtering of the reads should be done using USEARCH because
maximum expected error
filtering method is much more effective at suppressing reads with high error
rates than other filters, e.g. those based on
average Q scores. Using a maximum expected errors of 1.0
is a good default choice (-fastq_maxee 1.0 option to
fastq_filter)..
Demultiplexed Illumina reads
See here for discussion of adding sample labels to
demultiplexed Illumina reads, i.e. reads that
are already split into separate FASTQ files by barcode/sample identifier.
Flowgram denoising
If you have 454 reads, then as an alternative to quality filtering with
USEARCH you can generate FASTA format reads by denoising flowgrams using a
third-party algorithm [Pubmed:20805793,
Pubmed:19668203]. This
may give a small improvement in OTU sequence accuracy compared to Phred score
quality filtering, but denoising can be very computationally intensive and I
generally don't consider it worth the effort. If you do choose to use denoising,
then you should convert the output from the denoising program so that
size annotations are added to the labels in
USEARCH format, remove barcodes (adding sample identifiers to the read labels),
and then skip ahead to the abundance sort step below. If you use a denoising
package (e.g. AmpliconNoise)
that includes a chimera filter (Perseus in the case of AN), then you should turn
off the chimera filter, i.e., extract denoised reads before any chimera
filtering step.
FASTA reads
See "Flowgram denoising" above if the FASTA reads were produced by a
denoising program. If "raw" reads or reads that have been quality-filtered by a
third-party program are only available in FASTA format, then you should start by
trimming them to a fixed length (see
fastx_truncate), unless the reads contain full-length amplicons,
in which case this step may not be necessary. See
global trimming for discussion. Since quality information is not available,
you cannot choose the trim length based on predicted error rates. Instead, you
could choose a value that is, say, a few percent longer than the average length in
order to maximize the number of bases retained. However, you should be cautious
here because quality tends to get worse towards the end of a read. For example,
if you have 454 reads that are, say, 400 bases or longer, then it might be
better to truncate to a shorter length, e.g. 250 or 300 bases as this could
substantially reduce the error rate.
Length trimming
Trimming to a fixed position is
critically important for achieving the best results. For unpaired
reads, trim to a fixed length. For overlapping paired reads, the reverse read
should start at an amplification primer, which achieves an equivalent result.
The important point is that identical or very similar reads must be globally
alignable with no terminal gaps. See
global trimming for discussion.
Paired reads
Paired reads should be merged using the
fastq_mergepairs command before quality filtering. After merging, you should use fastq_filter with a
maximum expected error threshold. Length truncation is typically not needed
since the merged pairs usually cover full-length amplicons (see
global trimming for discussion).
Barcodes
Barcodes and any other non-biological sequence must be stripped from the
reads before dereplication. This can be done using the
fastq_strip_barcode_relabel.py script or any other convenient method. The
barcode must be removed before dereplication to allow finding of identical
sequences derived from the biological tag. The barcode sequence or label should be inserted
into the read label so that the read can later be
mapped back to an OTU. It is recommended to strip barcodes and other
non-biological sequence before quality filtering.
Dereplication
Input to dereplication is a set of reads in
FASTA format with non-biological sequences such as barcode stripped. The reads
should be globally trimmed before
dereplication, and quality filtered if
possible as described above. You should use the
derep_fulllength command with the -sizeout option. I recommend
pooling samples before dereplication.
Abundance sort
Use the sortbysize command to sort the
dereplicated reads by decreasing abundance. To discard
singletons (usually recommended), use the
‑minsize 2 option.
OTU clustering
To create OTUs, run the cluster_otus command
with abundance-sorted reads as input. This will generate a set of OTU
representative sequences. I recommend using the following options:-otus otus.fa
-uparseout results.txt -relabel OTU_ -sizeout. The uparseout file classifies
each input sequence, the -relabel OTU_ option creates new labels OTU_1, OTU_2...
for the OTU centroid sequences (otus.fa), and the -sizeout option records how
many input sequences were assigned to each OTU using a
size annotation.
Chimera filtering
The cluster_otus command discards reads that have chimeric models built from
more abundant reads. However, a few chimeras may be missed, especially if they
have parents that are absent from the reads or are present with very low
abundance. It is therefore recommend to use a reference-based chimera filtering
step using UCHIME if a suitable database is
available. Use the uchime_ref command for this
step with the OTU representative sequences as input and the
‑nonchimeras option to get a chimera-filtered set of OTU sequences. For
the 16S gene, I recommend the
gold database (do
not use a large 16S database like Greengenes). For
the ITS region, you could try using the UNITE
database as a reference.
Creating an OTU table
To create an OTU table, you should first map
reads to OTUs. Then you can use the
uc2otutab.py
script to generate the OTU table.
Taxonomy assignment
You can use the utax command to assign taxonomy
to the OTU representative sequences.