Data!
Hi backers!
We finally received the DNA sequencing data from uBiome. Altogether it's about 12GB of data in FASTQ format, which are basically fancy text files of ATCGs (4 bases of DNA), quality/confidence scores, and other metadata.
To analyze the data, I'm currently building a computational data analysis pipeline in Python, which processes the data and calculate metrics.
The main steps in processing the data are:
Merge the reads. It's a bit like taking two photos side by side, then combining the photos where they overlap. In the same way, we stitch together a longer DNA sequence from two shorter ones.
Trimming and quality filtering. Artifacts from DNA amplification and sequencing are present in sequencing data, so we discard reads that do no meet our quality thresholds.
OTU (Operational Taxonomic Unit) clustering. This groups similar sequences in each sample into what are called OTUs, which are representative of species. From here, we can look up each OTU against large 16S databases to figure out what bacteria each DNA read came from.
Calculate diversity metrics and identifying significant changes between Soylent and Regular diet groups.

Here's a screenshot showing some of the process. On the top left we have an open FASTQ file with sequences of DNA. As you can tell, it's not really meant for humans to read. On the bottom left we have part of the analysis pipeline running. On the right side, there's some of the code I've written to manipulate the data. Some of the bioinformatic tools we are going to use include UPARSE, QIIME, and mothur.
Once we have our initial analysis complete, we will be meeting with faculty at UC Berkeley to get some more perspectives on the data. We also have a few members in the Arkin Lab who are guiding in formally writing up our findings.
We'll keep you up to date with what we find!
5 comments