Abstract: DNA sequencing is an important method in Modern Biology. The predominant technique used is Shotgun sequencing where randomly located fragments (base pairs) called 'reads' are extracted from a DNA sequence. These 'reads' are later 'stitched' to reconstruct the original sequence. The minimum number of 'reads' required to 'stitch' the sequence reliably is an important quantity.
In this talk, I will show an analogy between Shotgun sequencing and Shannon's communication model, and discuss 'sequencing capacity'. This is the maximum number of DNA base pairs that can be resolved reliably per 'read', and also a fundamental limit to the performance of a 'stitching' algorithm. I will derive the sequencing capacity for a simple model of shotgun sequencing.
References:
1. S. Motahari, G. Bresler, and D. Tse, “Information theory of DNA sequencing,” http://arxiv.org/abs/1203.6233 , 2012.
2. S. Motahari, G. Bresler, and D. Tse, “Information Theory for DNA Sequencing: Part 1: A Basic Model,”Proc. IEEE International Symposium on Information Theory, pp. 2741–2745, Cambridge, MA, July 2012
3. J. Miller, S. Koren, and G. Sutton, “Assembly algorithms for next-generation sequencing data,” Genomics, vol. 95, pp. 315–327, 2010.