Document Actions

Tag Clustering methodology

Tag clustering is a methodology that measures the errors introduced into the data during library construction and sequencing, and also identifies SAGE tag sequences that are of the highest quality.

Using the Phred scores (a quality estimate) associated with each tag sequence, similar (single base discrepancies or singletons) tags are clustered to combine tags likely to originate from a common transcript in the original RNA. The hypothesis being that those tag sequences which fail to match an available sequence resource (i.e. RefSeq, MGC, Ensembl) and differ from a highly abundant tag by only one base position are most likely artifacts of the library construction or sequencing process. These tags therefore will be combined with their "parent" tag to more accurately reflect that abundance of the transcript.

Off-by-one tag sequences matching multiple tags were clustered with the most abundant tag after which the total library error rate was calculated as being the ratio of reclassified tags to parent tags.

This error rate combined the DNA sequencing error rate with the library construction (reverse transcription, PCR, ligation) error rate, which was the asymptotic limit of the total error rate computed after the removal of tags which had Phred scores indicating a high error probability.

The p-value for each tag in a library is calculated as:

ptag(tag, library) = 1-((1-pseq(tag, library))*(1-pPCR(library)))

A separate p-value was also computed for each tag in the sequence, which is dependant on it's occurrence in the library. Setting n to the count for a tag in the library yielded:

ptype(tag, library) = 1-1(-p(tag, library)n