• Home
  • Blog
  • About
  • Research
    • Species
    • Publications
    • Presentations
    • GitHub
    • Videos & Recordings
  • Contact Me
  • Other Fun Stuff
    • Photography
    • Flash Mob
    • Curryosities
    • Experiment.com
My Life is Crap

Megabubbles

3/18/2022

1 Comment

 
​This is a really niche post intended for geneticists who are doing de novo genome assemblies of 10X linked-reads sequencing data using Supernova (see… niche), but it also might be interesting to people who want to know a little bit about what the heck it is I do for a living.
​A good amount of bioinformatics is Googling.
  • Need to figure out how to code something new (or something you should really already know but never got around to memorizing), Google it.
  • Get an error, Google it.
  • Not entirely sure what the heck you’re doing, Google it.
There’s forum among forum with people who likely have had the exact question you have and people who have answered it. However, every so often, I can’t find the answer I am looking for. So, when that happens and I somehow figure it out, I am going to start posting it here.

Supernova is an assembly program specifically for use with linked-reads sequencing data generated by 10X Genomics sequencing technology. Explaining exactly what linked-reads are can be a whole other post so if you want those details, go here. But for now, all you need to know is that it’s the raw data I am using to generate reference genomes from scratch (a.k.a. de novo). 
The end product of a de novo assembly is something called a FASTA. This is a large file that contains all the As, Gs, Cs, and Ts in order for that genome. This file is what’s used to as a guide for doing future genomic analyses of the species.
In Supernova, there are 4 options for this output. The online documentation didn't go into enough depth on how the 4 output options work for me to really know which would be the best option for what we’re going to be using them for, and Googling didn’t get me any closer to an answer. So, I generated all 4 types of output, got assembly stats (QUAST & BUSCO) on all of them, and that gave me enough info to figure out which option would be best.
So, here's my very simplified takeaway from this endeavor:
​Supernova represents sequences in the raw assembly as “microbubbles” and “megabubbles”.
Picture
​It looks like a “bubble” is when there's more than one sequence assembled to a contig separated by "gaps" (single sequences or runs of Ns). Collections of microbubbles create megabubbles. These megabubbles then have to be flattened into a single sequence. This is where the output options come into play.
Picture
So, here’s the full breakdown that sounds like a riddle but isn’t a riddle:​
raw is ALL the bubbles even microbubbles within megabubbles in one FASTA
​megabubbles flattens microbubbles within megabubble arms in one FASTA
pseudohap flattens all the megabubbles to one FASTA
​pseudohap2 is each flattened megabubble arm in a separate FASTA
  • Our "best" output is the pseudohap2 option. This option provides two FASTAs, like a "maternal" and "paternal" strand, and, seemingly, generated a more complete assembly (i.e., largest contigs and lots of orthologs).
  • For our purposes, the raw option isn't ideal. It keeps all the assembly information, the micro- and megabubbles, and puts them into a single FASTA. While this might be informative, because it has all the bubbles it results in wonky stats and a HUGE FASTA (8.6 vs 2.2 Gb) with lots of small contigs.
  • The megabubbles option flattens microbubbles based on the highest coverage but each megabubble arm has a record in a single FASTA. This also results in a large FASTA (4.0 vs 2.2 Gb). Though not AS large as raw, it’s still not ideal.
  • The pseudohap option flattens megabubbles at random without recording separate megabubble arms. This one results in a single FASTA similar to the two pseudohap2 FASTAs. So, I say, why not do pseudohap2 and have two non-random FASTAs to pick from.​
​I hope this helps some of my fellow Googling bioinfomagicians out there and didn’t make everyone else just super confused.
Picture
1 Comment
Raymond Smith link
10/7/2022 11:36:18 am

Benefit sense population full. Energy community top send either wonder. Over she turn only child traditional road. Those toward yet pressure indeed many medical.

Reply



Leave a Reply.

© Caitlin Curry All Rights Reserved
  • Home
  • Blog
  • About
  • Research
    • Species
    • Publications
    • Presentations
    • GitHub
    • Videos & Recordings
  • Contact Me
  • Other Fun Stuff
    • Photography
    • Flash Mob
    • Curryosities
    • Experiment.com