As part of my second year lectures on protein evolution, I cover (albeit far too briefly) the topic of where new proteins come from. The most common answer to this question, is "other proteins". Gene duplication is common and happens at a variety of levels, from individual domains (independent folding units of proteins) to full length proteins, multiple proteins (on larger chromosomal stretches) to whole genome duplication, in which every single gene is duplicated.
Duplication is so important because it generates a copy that can remain essentially unchanged, performing the ancestral function, and a copy that is free to accumulate mutations. Often - usually - these mutations will ultimately disable one of the copies, generating what are known as "pseudogenes", the easily identified relics of past mutational dead ends. Occasionally, however, a new function will arise, such as the binding of a new enzyme substrate, the the reception of a different wavelength of light, and both copies will be retained.
This is fascinating stuff in itself (and I may post more on it later) but it is not what I want to talk about here. Important as they are, duplications rely on something - an existing protein-coding gene - being there to be duplicated. Where do these proteins come from in the first place? Are all modern proteins just duplicated and edited versions of a suite of ancestral proteins? Does this mean that supporters of Intelligent Design community are right and that some kind of intelligent intervention is necessary to increase the repertoire of proteins that are available?
Well, no. Far from being a statistical improbable event unlikely to ever occur, the "
de novo" creation of new proteins from non-coding DNA seems to be a reasonably common event. In an article just published in PLoS Genetics, for example, "De Novo Origin of Human Protein-Coding Genes" (
PLoS Genet 7(11): e1002379), Wu, Irwin & Zhang describe sixty such proteins that appear to have arisen in the human lineage since we shared a common ancestor with chimps.
This and other work in the area is nicely summarised in the same issue by Guerzoni & McLysaght in a two page article that is well worth the read, "De Novo Origins of Human Genes" (
PLoS Genet 7(11): e1002381). The Wu paper itself builds on
earlier work from the McLysaght lab, and they include a nice summary of the type of evidence the work looks for (see the paper for details):
It is important to bear in mind that not all of the sixty novel protein-coding genes might be "true positive" results in the strictest sense. As Guerzoni & McLysaght point out:
"The operational definition of a de novo gene used by Wu et al. means that there may be an ORF (and thus potentially a protein-coding gene) in the chimpanzee genome that is up to 80% of the length of the human gene (for about a third of the genes the chimpanzee ORF is at least 50% of the length of the human gene). This is a more lenient criterion than employed by other studies, and this may partly explain the comparatively high number of de novo genes identified. Some of these cases may be human-specific extensions of pre-existing genes, rather than entirely de novo genes—an interesting, but distinct, phenomenon."
Nevertheless, this is still the creation of
de novo protein sequence, just not whole genes. Indeed, if one were to relax the criteria, how much more might this happen within proteins? Domains are frequently linked by disordered regions and also have a tendency to be flanked by intron/exon boundaries, making it quite plausible for extra exons to be picked up along their evolutionary journey. In some ways, therefore, this sixty is likely to be an underestimate.
There are other issues with identification process itself that make real estimates difficult. For robustness, authors limit themselves to stretches of DNA that are present in both humans and chimps, to rule out the loss of an ancestral protein, rather than the gain of a human-specific one. Insertions and deletions ("indels") are quite common and actually account for most of the difference (in absolute genetic terms) between humans and chimps:
"It is now common knowledge human and chimpanzee DNA differ by only 1% (more accurately, they differ in 1%of alignable regions of genome, with a further 3% divergence due to lineage- specific indels)."
It is quite possible, therefore, that a few extra protein sequences have arisen in non-coding human-lineage DNA that was deleted in the chimp lineage. The other limiting factor - and probably one that has a greater impact - is that of protein annotation. Discovery relies on proteins that are annotated as such in the human genome. The problem here is that such annotations are normally made using conservation of proteins from other organisms - something that is, by definition, lacking for lineage-specific proteins. This issue is not insurmountable and there are sources of evidence that are used but it is still likely to result in underestimation.
An important caveat remains, however. Although we have some examples of
de novo proteins that appear to have function, for the majority of them we have no idea what they do, if indeed they do anything at all. A more interesting question opens up of when do you define translated DNA as a protein? Leaky transcription and translation almost certainly happens (although I have no idea how much) and so we might be detecting stuff that is really there but only rarely. Alternatively, it might be consistently transcribed and translated but have no function whatsoever - a neutral mutation that has drifted randomly to high frequency but may have no long-term future.
These questions and issues are unlikely to be resolved soon. For now, we just know that
de novo creation of proteins is more common than we used to think and is likely to be a substantial source of raw material on which natural selection (and random drift) can act to evolve new functions. What the ultimate fate of these proteins tends to be, only time will tell.