Sunday, 12 July 2015

Developments in high throughput sequencing (June 2015 Edition)

This is nearly a month old now but Keith Bradnam’s ACGT blog a while back drew my attention to the June 2015 edition of Lex Nederbragt’s Developments in high throughput sequencing in which he plots Gigabases* per run against (log) read length (*the human genome is about 3Gb):

I’m particularly excited by the two technologies on the right of this graph, which represent the latest single molecule “long read” sequencing technologies, both of which we now have access to through the Ramaciotti Centre for Genomics. In fact, we got our first data from the PacBio RS II (right) and it’s looking good! (More on that later.)

Despite being a bioinformatician with a background in genetics, I have been keeping my distance a bit from “next generation sequencing” as the technical challenges of dealing with short read data far eclipse the scientific interest. (For me, that is - the kinds of things that I am most interested in do not suit short read data.) The new long read technologies are a real game changer, and I see a lot more genomics in my (and this blog’s) future.

Thursday, 9 July 2015

Sydney Sunset

Today, I met up with an ex-student who moved to Sydney this week. After lunch at Coogee and a bit of the afternoon at UNSW, we headed into the city and ended up at circular quay for sunset.

I’ve said it before and I’ll say it again: Sydney does do a good sunset.

Thursday, 25 June 2015

What's Really Warming the World?

It’s hard to believe in 2015 that there are people out there who have still not accepted the reality of man-made climate change. But then some people still use homeopathic medicine and think that vaccination causes autism. Sigh. Anyway, if you have any doubts about the causes of increasing global temperatures, or just really like slick infographics, you can do a lot worse than check out Bloomberg’s page on What’s Really Warming the World?

Sunday, 21 June 2015

The importance of knowing how your data are scaled

A few weeks ago, there was a post on WEIT, The correlation between rejection of evolution and rejection of environmental regulation: what does it mean? It was triggered by a tweet about by the Washington post about a graph comparing attitudes to the environment and attitudes to evolution, broken down by religious affiliation:

We’ll get to the tweet later. First, the graph. It was from a US National Center for Science Education blog post based on 2007 data from the Pew Religious Landscape Study, examining two binary choice statements:

y-axis. Stricter environmental laws and regulations cost too many jobs and hurt the economy; or Stricter environmental laws and regulations are worth the cost.

x-axis. Evolution is the best explanation for the origins of human life on earth. (Agree/disagree)

Data was normalised onto a percentile scale with each circle representing (1) by position, the normalised percentile of that group’s response, (2) by area, the size of that group. (36,000 people were surveyed in total.)

The percentile normalisation method was based on a previous analysis of different Pew questions by Toby Grant, who explains it thus:

Geek note on measurement

The range of each dimension ranges from zero to 100. These scores were calculated by calculating the percentage of each religion giving each answer. The percentages were then subtracted (e.g., percent saying “smaller government” minus percent saying “bigger government”). The scores were then standardized using the mean and standard deviation for all of the scores. Finally, I converted the standardized scores into percentiles by mapping the standardized scores onto the standard Gaussian/normal distribution. The result is a score that represents the group’s average graded on the curve, literally.

A few things annoy me about this:

  1. This is not simply a “Geek note”. Knowing what was done to data is vital for understanding what a plot means. To be fair to Grant, he does mention that he is plotting percentiles in the graph legend. (As far as I can see, Robineau does not mention it anywhere!)
  2. By first normalising to the mean and then converting everything to percentiles, there is a double loss of quantitative information. Following the first normalisation, all you can do is compare groups - there is no absolute information about responses. Following the second, you cannot even compare the degree of difference. What this plot is basically doing is pulling in the outliers to make them look more similar to mean, and spreading out those similar to the mean to make them look more different.
  3. When converting to percentiles, the additional normalisations seem pointless. Unless I've misunderstood, if the data is truly normally distributed then the percentile of the fitted data should be the same as the percentile of the raw data. If not, you shouldn’t do the normalisation in the first place. Either way, I think you are just adding error and confusion. (There is no data presented to support the fact that these opinions are normally distributed.)

It is also worth noting that, to the unwary, the circle sizes could be misleading. The bigger the circle, the more data and the more accurate the estimation of the value. The small circles might have much more random sampling bias in their positions. (Under a null model where all groups are the same, you would expect the large circles to gravitate towards the mean, while the smaller circles should be the outliers.) Most importantly, circles that overlap are not more similar than circles that do not.

It would be more useful to have estimated standard errors plotted for each group. Again, because we have lost the quantitative information, we cannot tell whether a small difference in responses (possibly within measurement error) would have a big difference in percentiles. There are 36,000 people in total but some of the groups are less than 0.5% and therefore have fewer than 200 people.

Robineau’s plot uses the same method although he:

“didn’t rescale to the 0-100 scale, since I didn’t want this to seem like a percentage when it isn’t.”

It's not a percentage but it is a percentile, so 0-100 is entirely appropriate. Leaving it as -1.0 to +1.0 is in fact very misleading, as it implies that people are positive or negative with respect to the questions. In reality, positive just means “above average” and negative is “below average”. I have an above average number of arms: two. This does not mean that I have lots of arms, it just means that some people have fewer arms than me.

These things aside, Robineau asks:

“So what does this tell us?

Thanks to the scaling, the only thing this graph tells us is that (a) there is a rank correlation between the answers to the two questions, and (b) some religious groups (particularly evangelical Christians) appear to agree with these statements less than average, while other groups (notably non-Christians) tend to agree with these statements more than average.

These observations could still be of interest. The real problem comes when people start interpreting this graph as if the normalisations and rescaling have not been done to it. Robineau first:

First, look at all those groups whose members support evolution. There are way more of them than there are of the creationist groups, and those circles are bigger. We need to get more of the pro-evolution religious out of the closet.

Second, look at all those religious groups whose members support climate change action. Catholics fall a bit below the zero line on average, but I have to suspect that the forthcoming papal encyclical on the environment will shake that up.”

This in turn was apparently interpreted by the Washington post to mean this:

The fact is, the normalisation has removed all hope of actually knowing whether there is conflict or not. The percentile scaling removes almost all of the quantitative info on the axes, so proximity on the scale means nothing with respect to proximity of answer. All the groups inside the small top right cluster could have >90% support for the scientific evidence and all of the groups outside <10% support, and you could still get that plot. (It’s hard to tell but the top-right cluster look closer to 1.0 than the bottom-left groups are to -1.0, indicating that they might deviate much more from the mean thanks to the mapping onto a normal distribution. This implies that the data was not normally distributed in the first place and is probably a heavy-tailed or bimodal distribution instead.)

Critically, it is impossible to conclude that any groups “support evolution” or “support climate change action”. As the graph is scaled by percentiles, 0.0 is essentially the point where 50% are above and 50% below. Because the vast majority of groups are religious, of course there are many religious groups above the line. There essentially have to be, unless all religious groups were identical (in which case they would group very slightly below 0.0).

To many, stand-out thing is that atheists and agnostics are all in the top-top right. This graph could easily have been branded “the conflict between science and religion in one chart”! But it cannot even really say that: every group could disagree with the two statements and thus be in conflict with the scientific evidence. You would still get the same plot after the rescaling.

My big question from all of this is: why not make the plot using the raw percentage responses? What do the normalisations actually achieve?

And my big take home message: if you are going to infer things from plots, make sure that you understand how the data were scaled.

Saturday, 20 June 2015

MapTime lives!

MapTime

It’s fair to say that MapTime has been somewhat neglected in the past couple of years, now that the core team is spread over three continents. However, having given the website a long overdue look over today (after it went down a while ago), I am pleased to report that it still works! I even added a new TimePoint to the Organic Evolution TimeLine.

The cocksure cockatoo

As well as my first wild wallaby, my Lorne trip earlier this year featured some good photo opportunities with sulphur-crested cockatoos. Although I included a couple of photos in the earlier post, I thought they deserved their own. They are very handsome birds and sometimes, when they pose, you think they just know it.

Of course, the “cock” in cockatoo does not really derive from cocksure. Instead, cockatoo is a derivative of the Indonesian name kaka(k)tua. [The cock in cocksure is a euphemism for God according to Google.]

As well as the posers, I also got some good shots of cockatoos eating. They can actually be pests and a flock of cockatoos can (apparently) strip a fruit tree in a few hours - not so fun if that tree is in your garden or farm!

Tuesday, 26 May 2015

Forget MC Hammer, meet PFC Hammer

Ok, so most people have forgotten MC hammer. Anyway, here’s a feel-good story (courtesy of WEIT) about PFC Hammer, a cat adopted by U.S. Troops in Iraq:

In 2004, a tiny Egyptian Mau kitten wandered into U.S. Army headquarters in Iraq. Dubbed PFC [Private First Class] Hammer, he became a ratter, morale booster, and important stress reliever to the soldiers. When the battalion was set to ship back to Colorado, Staff Sgt. Rick Bousfield contacted Alley Cat Allies and Military Mascots for help in getting PFC Hammer back to the States. PFC Hammer was vetted and quarantined before traveling to Colorado Springs, where he took up permanent residence with Staff Sgt. Bousfield…

And my favourite bit:

…When Hammer was being carried to Bousfield, he heard Rick’s voice and began purring and kneading the arm of the transporter. As it turns out, he remembered his Army buddy after all.

Read the rest of the story here.