Summer project update - Sequence editing almost all done!
And just like that, it's 5 months since my last update. Darn. Time really does fly.
It's been a very busy five months, what with a deluge of AP and IB tests, college move-in preparations, job applications, summer reading and a host of other obligations tackling me right when I thought I had the perfect several weeks set aside for working solely on my research. I haven't been able to update this page nearly as much as I would've wanted to (that is, to say, at all), so sorry for that...however, I'm finally back, one month before my official move-in day as an incoming freshman, so I hope to post two or three lab notes before I inevitably disappear again for a few weeks. (That first quarter is gonna be a tough one.)
Anyway, a summary of what I've been working on this summer: sequence editing.
A lot of sequence editing.

This summer, with no homework from high school and an extra month's length thanks to the quarter system, has been the perfect time for me to sit down and really polish all of the sequences I've obtained from the two rounds of sequencing associated with this project - the first one in August of last year (feels like almost yesterday!) and the second one from about half a year ago now, in February. This process has mainly consisted of ironing out the reads with alignment-based proofreading and piecing together the halves of the longer gene fragments I sequenced (namely rbcL, 18S, and 26S). The former task I accomplished thanks to the GenBank database and BLAST, with the general procedure outlined below: I match my sequenced fragment with the suggested most identical sequences online, and slowly and manually check all of the non-identical bases in the aligned portion output by the program. If there is a "spelling" error (a miscalled, uncalled, extra or missing base), I make a manual correction using an ab1 file viewer (Finch TV); if the bases are non-identical but the ab1 strongly suggests the different base, I leave it as is. As I don't have Phred scores or another objective means for determining the accuracy of each base call in this non-identical cases (and would often fall below a reasonable statistical strength, I would guess), my individual determination may be somewhat subjective as to whether or not I keep the original base or change it in favor of the database's suggestion. Which is why, especially for the longer sequences, part two of the editing process is so important. After I finish editing the forward and reverse fragments of each gene region (which are, of course, reverse complements of each other), I use MEGA7 to align them (with the ClustalX plugin for rbcL, and manually for the 18S and 26S; the latter two regions only overlap by about 100-300 bases, so ClustalX doesn't really know how to align them). Afterwards, I open a Word document, paste the forward portion of the FASTA text in there, and then examine all the non-identicalities in the overlapping region. This is sort of my quality check and another way for me to proofread the sequences in their weakest and least statistically significant reaches. In many cases, the forward and reverse reads can be as long as 1100 bases, cleaned up thanks to GenBank alignments and BLAST, so the weakest 200 to 300 bases at the ends of each read are paired together after a reverse-complement operation is performed to the reverse read. In addition to the GenBank comparisons, therefore, I compare the two sequences to each other and have one final chance to make edits before I grab the second half of the reverse read, paste it onto the end of the edited forward read, and create my full gene region.
Two months later, and most everything is complete! I'm finishing up the editing process for perhaps a dozen ugly, messy sequences I saved for the very end, but otherwise more than 90% of the sequences are ready to go for phylogenetic tree construction, accession to GenBank, and more fun stuff! I already reported on the UPA which I completed in March (cause those were really short and there were just a few of them), but here's a summary of everything else that's been done since.
UPA: 17 sequences suitable for analysis, almost all coming from the second round of sequencing (antibiotic-treated cultures); the first round was almost entirely useless, thanks to UPA picking up lots of bacterial DNA. The exact strains with good sequences: JIACs 1, 3, 4, 5, 6, 7, 10, 12, 13, 13i, 15, 27, 31, 32, 33, 36, and VolvFL. Average length runs around 350 bp.
tufA: 30 sequences survived! About one-quarter came from the first round of sequencing; the other three-quarters were repeated with the antibiotic-treated cultures after they pulled up lots of bacterial DNA. The exact strains with good sequences: JIACs 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 13i, 15, 18, 19, 20, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, and Sel (a designation for some very old Ankistrodesmus-like frozen material collected and cultured back in Chico). Average length runs around 850 bp.
ITS: 28 sequences, a much higher number than I expected from the painstaking process of trying to amplify them. (They got so much benefit of the doubt from me - I reran some of those PCRs like 5 times). An even mix from each round of sequencing. The exact strains with good sequences: Fila1, Fila2, Fila3, JIACs 1, 2, 3, 4 (accessed very long ago), 5, 6, 9, 10, 11, 12, 18, 19, 20, 21, 22, 24, 25, 26, 30, 31, 34, 35, 36, OedoSF, "Rhizo" (a sample of filamentous algal material washed and collected from Chico), and VolvFL. A good length for these is in the low to mid 600s bp.
rbcL: 21 full sequences. Sadly, many of the Scenedesmaceae could not be pieced together because they were just maybe 50 bp short from an overlap; budget being a budget, and convenience being a big factor in determining the best barcode, I did not elect to sequence the middle of the gene independently using conserved regions as primers. Oh well. At least full sequences could be amplified and portions sequenced just as well from both rounds of experiments. The exact strains with full sequences: Fila2 and OedoSF (to be discussed at a later time), JIACs 1, 2, 3, 4, 7, 10, 11, 13, 15, 17, 18, 21, 22, 24, 26, 27, 32, 33, and Spiro. Full length runs around 1300-1350 bp; partial sequences are between 600 and 1000 bp, and could theoretically be partitioned for analysis alongside full sequences. I'll work that out later.
18S: 18 full sequences so far. While both rounds of sequencing generated a number of useful sequencing, the second round had a noticeable systematic issue where the forward read would "fade" really quickly, leaving only about 600 bp of alignable nucleotides. This phenomenon was not observed in the first round of forward reads, which could stretch over 1000 bp easily. In any case, many of the second-round strains' sequences were successfully assembled. The exact strains with full sequences: JIACs 2, 3, 4, 10, 12, 15, 21, 22, 23, 24, 25, 26, 28, 30, 31, 33, Sel, and VolvFL. Good lengths ranged from 1650 to maybe 1750 bp?
26S: 14 full sequences so far, but editing is still largely a work in progress, so I hope this number grows. Both rounds of sequencing generated fabulous reads; the only problem is that it's just such a large fragment that without dedicated sequencing of the middle of the gene, it's difficult to assemble the full read. (Thank GOODNESS for 1100 bp+ forward and reverse reads for many strains.) The exact strains with full sequences: JIACs 3, 4, 10, 15, 19, 22, 25, 26, 27, 28, 30, 31, 35, and Spiro. The average length for one of these is around 2000-2050 bp

Anyway, I should be finishing up the final edits in the next two weeks or so, after which I start to leave the computer running overnight for Bayesian and ML phylogenetics tests. I can't wait to share those, so stay tuned!
0 comments