HIV-1 Ultradeep Sequence Analysis Pipeline
Manuscript currently in preparation
Phase 0: Quality Filtering
Alignments are filtered by removing sequences of low quality (as determined by their PHRED scores). The current defaults are set to only include reads of minimum length 100 bp
and PHRED scores > 20. A PHRED score of 10 is approximately equivalent to 95% confidence, whereas 20 is approximately 99%. The HIV-1 454 analysis pipeline is also available for download
Phase 1: Amino acid and nucleotide alignment
The alignment phase first performs amino acid alignment between a chosen reference sequence and each of the reads. Only alignments that exceed an alignment score
threshold are retained, where the threshold is 5 x the the alignment score expected from a read of equal length and identical base composition. The next alignment step tries to include reads which failed the
amino acid alignment by performing pairwise nucleotide alignments to the consensus of the reads which passed the amino acid alignment. Sequences are included in this second step if the pairwise per nucleotide alignment
score exceeds the median of that from all reads included in the amino acid alignment step.
Phase 2: Estimation of summary statistics
This phase reports summary statistics on read length, depth and frequencies of minority variants.
Phase 3: Diversity Analysis
The sliding window analysis phase estimates nucleotide diversity in sliding windows which meet the minimum coverage criteria. Phylogenies are also estimated within sliding windows, and bootstrap resampling is
applied to the sliding window with at least 4 variants and maximum nucleotide diversity. The latter is useful for the estimation of dual/multi infection, although the power to recover well-supported trees is
reduced since reads are typically short (<200bp).
Phase 4: Mutation rate estimation
The number of mutation rate classes is estimated using a binomial mixture model. Briefly, we fit a model with a single rate class and estimate the mutation rate from a binomial distribution with the number of successes
equal to the number of observed mutations at a site, and the number of trials equal to the observed coverage at a site. Additional rate classes are added using a mixture of binomial models until model fit (evaluated using
AIC) is no longer improved. The parameters of the binomial mixture model (i.e. rates and their respective proportions) are estimated using maximum likelihood.
Phase 5: Selection analysis
Selection at sites is evaluated using all pairwise comparisons between reads. We estimate the ratio of observed non-synonymous to synonymous substitutions (weighted by the number of pairwise comparisons) and compare
this to that expected given the observed codon frequencies and the genetic code.
Phase 6: Drug resistant mutation analysis
For each drug resistant site we estimate the mutation rank (i.e. the rank of the mutation rate with respect to all other sites) and calculate the median mutation rank of all drug resistant sites. The probability (P)
that the median mutation rank at drug resistant sites is greater than an equivalent-sized sample of non-drug resistant sites is evaluated with permutations (n=1000). These data can be used to determine if mutation
properties at drug resistant sites are unique. Furthermore, we classify drug resistant sites into mutation rate classes using the same methods described in the mutation rate class estimation procedure. Here we evaluate
the posterior probability that a drug resistant site falls within a particular mutation rate class.
Phase 7: Identification of drug resistant compensatory mutations
This analysis phase screens reads for the occurrence of both drug resistant and compensatory mutation sites. A Fisher's exact test is performed to determine whether drug resistant mutations and compensatory mutations
occur more frequently than expected by chance.
All results are presented online. Result databases are also available for each gene processed for subsequent analysis and processing. We are in the process of writing dedicated HyPhy scripts for these purposes which will be
made available here
UCSD Viral Evolution Group 2004-2017