Bioinformatics Plan

Oct 31, 2023

101

Bioinformatics PlanThis lab note will elaborate in detail the bioinformatics portion of our project and how it will help us with our goals. We are flexible on the configurations/software/tools we use and will most likely experiment to see what works best. Therefore, the below steps should be treated more as a guideline or initial plan that is subject to change. We draw a lot of inspiration from the bioinformatics work done by Tobias Messmer on his single-cell analysis of bovine cells project (https://doi.org/10.3389/fnut.2023.1212196), however most of the tasks being done are quite general and can be translated across to other single-cell analysis projects too.
Our goals:
Identify and classify the different cell types and subpopulations => useful for learning about cell composition and their relationship with phenotypic changes.
Identification of cell surface markers for each cell type => useful for FACS and eventual antibody development.
Gain insight on how the changes in gene expression and chromatin accessibility affect the proliferation/differentiation and activity of cells => useful for optimizing growth conditions and the eventual development of serum free medium.
Create a killifish single-cell foundation model using our data => useful for extracting additional bioinformatics insights via machine learning and helping future researchers speed up their data analysis.
[SECONDARY GOAL] Create a cultivated meat single-cell data atlas platform => useful for promoting the sharing of data among researchers in cultivated meat development.


Step 1: Preprocessing and CleaningDemultiplexing
Use CellRanger’s mkfastq function
Align to reference genome
Reads will be aligned to the Fundulus Heteroclitus (mummichog/killifish) genome: https://useast.ensembl.org/Fundulus_heteroclitus/Info/Annotation 
Quality control 
All cells must pass a quality control criteria which will only retain cells that are within 3 Median Absolute Deviations (MAD) of the median for three categories: expressed genes, total counts, and percentage mitochondrial genes.
Explanation:
What is “Within X Median Absolute Deviations (MAD) of the Median”: This refers to a statistical measure used to assess the variability of data points. In this context, it means that for each cell, the expression levels of genes, total counts, and the percentage of mitochondrial genes fell within a certain range of variation from the median value across all cells.
Why Expressed Genes: This refers to the number of unique genes that are detected as being actively expressed in a single cell. Higher numbers of expressed genes can indicate higher data quality.
Why Total Counts: This represents the total number of sequencing reads or fragments mapped to genes in a single cell. It's an indication of how much data was collected for each cell.
Why Percentage of Mitochondrial Genes: High percentages of mitochondrial genes can be indicative of poor cell quality, as damaged or dying cells tend to have higher expression of mitochondrial genes.

Normalization
The raw gene expression counts (the number of times a specific gene was sequenced in each cell) will be normalized to account for factors such as percentage of mitochondrial genes, library size, number of genes, cell cycle effects. This will ensure the data can be compared across timepoints. The Seurat library’s sctransform can be used to accomplish this. We will start with using sctransform with its default parameters and adjust accordingly.
Explanation:
Why Percentage of Mitochondrial Genes: A high percentage of mitochondrial genes can be indicative of poor cell quality. By "regressing out," it means that the effect of mitochondrial gene expression on the data was statistically adjusted for, effectively removing this potential source of variation.
Why Library Size: The total number of sequencing reads or fragments mapped to genes in each cell can vary. This is referred to as library size. Regressing out library size helps to correct for differences in sequencing depth between cells.
Why Number of Genes: Some cells might express a higher number of genes than others. This can be due to differences in cell type, activation state, or other biological factors. Regressing out the number of genes helps to account for this variation.
Why Cell Cycle Effects: The cell cycle is a natural process in which cells grow, divide, and replicate their DNA. Cell cycle effects can introduce variation in gene expression. By regressing out cell cycle effects, the data is adjusted to minimize the impact of cell cycle-related variations.


Step 2: Manual data analysisImage sources: All example images were taken from Figure 2 in  https://doi.org/10.3389/fnut.2023.1212196 

Differential gene analysis
scRNA UMAP + clustering: Run UMAP dimensionality reduction at each timepoint and then cluster the cells. This will reveal the number of cell types along with their distinct gene expression profiles.
 
 
scATAC UMAP + clustering:  Run UMAP dimensionality reduction at each timepoint and then cluster the cells. This will reveal the number of cell types along with their distinct chromatin accessibility profile.
Combined scRNA+scATAC UMAP and clustering: Run the above steps except the data vector of each cell will be a concatenation of its gene expression and chromatin accessibility data. That way, the scATAC is essentially providing additional dimensionality onto the existing scRNA data and this will be factored into the UMAP dimensionality reduction.
This step will give us insights into the following:
How do the composition of cell types change over time? What cell types persist in culture until time point 3?
Do the gene expression profile and chromatin accessibility profiles have different clusterings? This may indicate certain cell types identify with multiple profiles such as one gene expression profile but two or more chromatin accessibility profiles. How does this change when gene expression and chromatin accessibility data is combined?

Regulatory element identification
Annotation of the scATAC data’s genomic regions with their regulatory elements. When combined with the scRNA data, you can identify genes that are in close proximity to accessible regions which helps to provide a link between regulatory elements to the genes that they are likely to regulate.


This step will give us insights into the following:
What are the regulatory elements that control gene expression? Can we identify the regulatory elements that control each gene?

Functional enrichment analysis 
Retrieve a list of most significantly enriched GO terms between timepoints. The EnrichR package can be used. EnrichR is a widely used web-based tool and software package that performs functional enrichment analysis. Functional enrichment analysis takes a list of differentially expressed genes (up-regulated and down-regulated) and finds the terms/processes that most significantly corresponds to the up-regulation and down-regulation (top 5 terms in the below example image)
 
 
This step will give us insights into the following:
What biological processes or activities are overrepresented or underrepresented at each timepoint? 
Do different cell types have different lists of significantly enriched GO terms?
Do changes in the GO terms correspond to phenotypical changes of the cells in culture?

Surface Receptor Analysis
Identify the expression of surface receptors in each different cell type at each timepoint. Surface receptors can be identified by filtering for protein-coding genes that are located in the plasma membrane. This can be visualized using gene expression plots or violin plots. Finding the receptors with the highest expression levels for each cell type will be useful for eventual physical cell identification and physical sorting.
  
 
A common method to isolate cells is using fluorescence-activated cell sorting (FACS). In a FACS panel, special fluorescently labeled antibodies bind to certain cell surface receptors. The engineering of these antibodies first requires knowing what cell surface receptors exist on different fish cells - a piece of knowledge that has always been missing until there is more data and analysis can be done.

This step will give us insights into the following:
What are the receptors we can use for flow cytometry staining and a subsequent fluorescent-activated cell sorting (FACS) panel?

Killifish-specific Analysis 
At this point, we will have most likely identified numerous cell types. We also know that the KFE-5 killifish cell line consists of two core phenotypes: mononucleated fibroblastic cells, and elongated myoblastic cells with the capability to differentiate into myocytic cells. We essentially would like to run an analysis to identify the relationship between the different cell types and the two core phenotypes. 

This step will give us insights into the following:
Are certain cell types associated with a certain phenotype? Or perhaps there is no relationship?
How do cells differentiate into the different phenotypes and can any of the information we gathered in the previous analysis steps help us here?


Step 3: Building a foundational single cell omics model
One of the goals with generating single-cell data is the development of a foundational single-cell omics model for killifish and eventually for future species we may work with. We intend to start off by a model with the same architecture and procedure as scGPT (https://www.biorxiv.org/content/10.1101/2023.04.30.538439v2.full.pdf) which is a human single-cell foundation model trained on over 33 million cells. Because we won’t have this many cells for our killifish datasets, we’ll explore two options: train a model from scratch on our smaller dataset, or explore transfer learning methods to finetune a human scGPT model on killifish data.

scGPT achieves great performance on many downstream tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference. The following will especially be of interest to us:
Cell type classification: UMAP visualization of scGPT’s cell embeddings to reveal its classification of cell types. We can then compare its results to our step 2 manual analysis pipeline.
Multi-Omic Integration: scGPT can learn joint embeddings for the data from both scRNA/transcriptomic and scATAC/epigenomic data that can be used for downstream tasks.
Gene regulatory network: Create a unified network of gene-gene interactions, regulatory elements interactions, perturbations, pathways, functionally related genes, and gene activity across different states/timepoints.

We have access to the SHARCNET high performance computing cluster which gives us the ability to use GPUs to train our machine learning model.
We hope that having a foundational single-cell model for killifish will help us in automating the data analysis of new data. It will also give us the technical knowledge to create future similar models for other fish species. By open-sourcing our model along with the datasets, we hope to create a repository of these models that researchers can use to understand fish biology. Having this model available will speed up their data analysis pipeline and help them gain more insights from their own single-cell multiomic data which will ultimately speed up scientific development in cultivated seafood.


Step 4: Creation of a cultivated meat single-cell RNA atlas platform

Given the lack of publicly available data in the space, there currently is no centralized ‘cultivated meat atlas’ resource, where researchers can easily upload and analyze their data, along with all other existing data in the field. The idea of creating such a platform is one that has been met with very strong enthusiasm from our mentors/collaborators, especially Tobias Messmer of Mosa Meat. Tobias’s research has already produced many single-cell RNA transcriptomic datasets that he’s willing to share contingent on the creation of such a platform, which would pair nicely with the fish-related datasets we will create. Many people in the cultivated meat space have also expressed informal interest in a data-sharing initiative through an organized well-maintained platform that could benefit alternative protein research with data that otherwise would remain private. If created, we plan to upload all our datasets to this platform.

2 comments

Nathan like this.

6 comments

Join the conversation!Sign In

Breanna Duffy

Do the other species you plan to apply this to next have established reference genomes? If not, how would that change the application of the Killifish work to more commercially relevant species?

Nov 13, 20230

Kevin ShenResearcher

Yup! There are reference genomes available for our next three species. 1) Cyprinus carpio (Common Carp): https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_018340385.1/ Killifish is similar to carp, and as such carp is the species we expect to gain the most translatable insights from through our initial work. 2) Sander Vitreus (Walleye): https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_031162955.1/ Walleye is the worst annotated of the three to our understanding, but is the species that Dr. Vo has the most expertise in. He has used perch and sea bass (which have more well understood genomes) to approximate walleye very successfully, and is familiar with what reagents (media, antibodies, etc) work best for walleye cells. 3) Oncorhynchus mykiss (Rainbow Trout): https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_013265735.2/

Nov 14, 20230

Breanna Duffy

How useful would this model be for other species? Or would a new species require starting from scratch and developing a new model?

Nov 13, 20230

Rikard SaqeResearcher

This is the first attempt at building such a generalized model in a cultivated meat context. In order to optimize performance for downstream tasks, a new model would necessarily have to be developed for every species. However, building these new models would not be starting from scratch, it would be significantly easier. Once we build our killifish model, there will be 2 software outcomes from this: a retraining pipeline that can easily integrate new killifish data generated in the future into the model, and a generalized pipeline for training a new species model with scGPT (or whatever other model we use) as a foundation. With adequate software engineering planning, these pipelines make the creation/updating of models extremely easy and efficient, even if they are wholly different. As an example, a semi-supervised transformer pipeline that took 4 months to build in a prior internship of mine, took less than a week to entirely repurpose to an entirely different type of biological data (metabolomic vs genomic). This was also several years ago when transformer infrastructure was significantly less mature than it is now. Where we view the main uncertainty/time sink is in successfully establishing a pipeline for getting the transfer learning to work the first time/establishing the killifish model, with the difficulties as touched on in the previous comment above.

Nov 14, 20230

Erin Wilson

explore transfer learning methods

I think transfer learning/pretraining will be really important! Training large models from very little training data is really tough >.<

Nov 09, 20230

Kevin ShenResearcher

We think so too! The scGPT model we found was trained on a combined/integrated dataset of 33 million human cells. The only two existing papers we have found that have generated single cell data in a cultivated meat context total ~50k cells (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE211428, https://www.biorxiv.org/content/10.1101/2023.09.06.556523v1.full.pdf)! scGPT does support transfer learning, but their documentation doesn’t cover transfer learning of non-human single-cell data. Depending on the task, resolving the differences between the set of genes/vocabulary between species may or may not be trivial. In the best case, enough of the biology will be conserved such that it will help us understand the unknowns (ex: we identify novel species specific promoters because there is an unknown sequence located upstream of a conserved gene). In the worst case, this could perhaps look like operating/training only on the common/conserved set of genes? Regardless, should a significant amount of the biology be conserved, the contextual benefits of transfer learning to downstream biological tasks in our experience has consistently been very strong. Also, we asked the authors of scGPT about this, they said- “I think one strategy can be (step 1) to find matched genes based on one-to-one homologs (step 2) fine-tuning on this subset of matched genes. This has been used in https://doi.org/10.1093/nargab/lqad070 and this https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9825607/ could be one of the database you can use to match them.” If this suggested approach or alternative strategies do not work, we will have to explore the use of models that specialize in minimal amounts of data (something in the realms of few-shot learning).

Nov 14, 20230

Erin Wilson

filtering for protein-coding genes that are located in the plasma membrane

This seems non-trivial - can you explain more about how this filtering would work?

Nov 09, 20230

Kevin ShenResearcher

Feel free to check the reply in the solution statement for the answer to this question.

Nov 14, 20230

Erin Wilson

Will GO term results feed back into your analysis of calling distinct cell types? Or just help you label which cell type is what likely function?

Nov 09, 20230

Kevin ShenResearcher

It would be more of the latter - identifying functions/processes of cell types. The clustering of cell types could be done without GO terms, so we won’t need to feed these GO terms back into cell type classification, though they can serve as a useful sanity check in that analysis! The GO term functional enrichment analysis is more related to time point analysis - essentially seeing how the expressions of a certain set of genes correlate to cellular processes and how they change over time. Of course, once we have the information we could also factor in cell types and extend our time point analysis to see how cellular processes change for each cell type over time. We intend to leverage RNA velocity analysis (https://www.nature.com/articles/s41586-018-0414-6).

Nov 14, 20230

Erin Wilson

I'm still curious to understand more about how this contributes to your cell surface receptor ID efforts, or are they unrelated?

Nov 09, 20230

Kevin ShenResearcher

They are related because of scATAC-seq. Please feel free to refer to the comment in the solution statement.

Nov 14, 20230

About This Project

Cultivated Seafood: Single-Cell Multiomic Analysis of Killifish for Antibodies Development

The development of cultivated seafood requires identifying and isolating specific cell types. Cell sorting with specialized antibodies is pivotal in this process. This entails the design of antibodies tailored to specific cell surface receptors. This pilot project on killifish aims to generate a single-cell multi-omics dataset for analysis, identifying cell types and surface receptors. Cultivated seafood could potentially reduce global fishery/aquaculture carbon emissions by 68%.