### The Doll curve

Peter Lee, reporting on Burch’s lecture to the Royal Statistical Society, complained that Burch did not take into account Doll’s then current views on the dose-response relationship between smoking and lung cancer. This was unfair. It is true that Doll, or rather Doll and Peto, rethought the relationship, but their paper *Cigarette smoking and bronchial carcinoma: dose and time relationships among regular smokers and lifelong non-smokers* appeared early in 1978, too late for Burch to consider it in a lecture delivered in April from a text submitted in draft months before.

The Doll-Peto paper was a long overdue attempt to answer the tricky question posed by Armitage to Doll at the Royal Statistical Society in 1970. As they mildly put it:

Epidemiological data on smoking and bronchial carcinoma are more extensive that for any other cause of human carcinomas, so it may be profitable to ask which stage or process smoking affects most strongly. This has already been done (Doll, 1971), but the epidemiological evidence was difficult to fit together plausibly (Armitage, 1971).

The aim of the paper is to present arguments using data from the Doctors’ Study which “may allow circumvention of the difficulty discussed by Armitage.”

The data relates to cases of lung cancer in British male doctors who were either lifelong non-smokers or reported a constant rate of smoking over the lifetime of the study. Out of those subjects the statistical analysis excluded consideration of those who were aged under 40 or over 80, smoked more than 40 a day or did not start smoking between the ages of 16 and 25. The final sample consisted of 539 men. Lung cancer incidence in these subjects, broken down by rates of smoking into rows, are given in table 4 of the paper. Successive columns present the raw numbers as modified by calculations explained in the text:

- absolute values
- values expected if lung cancer rates were the same for all smoking rates
- values expected if all age groups had the same smoking rates (this in the same step as the last)
- relative risk calculated by two different methods
- onset rates in the general population predicted by the preferred measure of relative risk.

The final step simply involves multiplication by just over 112, as that makes it easier “to grasp the medical significance of these relative risks”. Figure 1 of the paper plots this final result against absolute rates of smoking (these are not plotted at regular intervals for reasons given in the text).

Two lines run through the eight data points in figure 1: a straight line and a curve representing proposed theoretical dose-response relationships, one linear and one quadratic. Each is the best fit of its kind to the data and both fit the data within 90% confidence limits. This is, frankly, unimpressive. By convention, 95% confidence limits are customary in the social sciences, and 99% or better in the physical sciences. Though there is nothing much to choose between the two lines, Doll and Peto suggest that “the curved line does appear to fit considerably better than the straight line”.

A ninth data point lies nowhere near either line, though it technically fits the straight line within 90% confidence limits. It represents men who smoked more than 40 cigarettes a day. This group had a far lower lung cancer rate than you might expect (six cases in about 100 men), lower than the 35-40 a day group and about the same as the 30 a day group. Doll and Peto excluded it from their analysis, giving three reasons for doing so.

1 It is possible that heavy smokers may have a different constitution (essentially, different genes) from other groups.

2 It is possible that the men in this group are exaggerating or lying.

3 It is possible that heavy smokers are predominantly inhalers, and inhalation has the ‘paradoxical effect’ of protecting against lung cancer.

These things are indeed possible.

The authors then proceed to revise their account of the relationship of lung cancer rates to age. The old dose-response relationship of the form *cd ^{n}* and the new one, of the form

*c*will both be straight lines when plotted on a logarithmic scale, but the constants need to be given different values in order to fit the data, which is given in table 5.

^{2}d^{n}Structurally, it is much the same as table 4 except that it is organised as age groups standardised for dose rather than the other way round (no details of the standardisation procedures are given). Of the innumerable possible lines, figure 2 displays three (drawn through three different versions of the data set which assume different ages at which the carcinogenic mechanism begins). The three lines all fit the data points within 90% confidence limits. Doll and Peto prefer the middle one, originating at age 22, because lung cancer rates are supposed to be a function of years of smoking rather than the past five or ten years or years since birth.

There are, however, two slight snags about this. One is that the candidate lines, projected upwards into the age groups excluded from analysis, ought to graze a ninth data point. In fact it lies lower down than the eighth and the seventh point. The other is that the formulae involve an *n*th power of duration of the mechanism. If, as with the left hand line, the power *n* is three, the model assumes a three stage process in which a single cell undergoes three changes (‘hits’) to become cancerous. In the right hand line, *n* is 7.2 and in the middle one it is 4.5, but there is no such thing as 0.2 or 0.5 of a hit. The authors conclude that “It is clear that the only whole-number exponents of smoking duration which can possibly fit these data are 4 or 5.”

It is not difficult to imagine what Burch will have made of this. He nowhere accorded the Doll-Peto paper an extended discussion but there are brief references to it in his later work. The two most important of these are in the otherwise minor *Smoking, lung cancer and hypothesis testing* (1981) and in *The Surgeon General’s Criteria* (1983). The first briefly rehearses the Armitage-Doll exchange and continues:

These particular difficulties have now been circumvented by a two-hit model in which incidence in men smoking 40 or fewer cigarettes per day is said to be proportional to (cigarettes per day + 6)

^{2}· (age-22.5)^{4.5}. However, incidence at ages below 22.5 years becomes meaningless – although observed values are not negligible – and it is inconceivable that in non-smokers (at least), the first step(s) in a multi-hit, multi-stage mechanism will not have occurred before 22.5 years. Probably any initiator model based on realistic premises will fail to account for the observed lack of dependence of mortality ratios on age, the average age of onset in light and heavy smokers, and the dependence of age-specific death-rates on age in countries of low and high levels of lung cancer.

The second reads:

Many of us will have been impressed by the striking linearity of the graph of annual death-rates vs average smoking rate in Doll and Hill’s study of British doctors. However, theory suggested that a multi-hit (at least 2-hit) mechanism of tobacco-carcinogenesis is necessary to account for the age-dependent features of lung cancer. Doll and Peto were able to show that a quadratic relationship gives a better interpretation of suitably analysed data for British doctors (based on 539 deaths) provided that the point for men smoking more than 40 cigarettes a day is ignored. The Surgeon General’s Report… shows a sub-linear “dose-response” relationship for Japanese males of all ages. Invariance of the form of the “dose-response” relationship relation (or better, association) appears to be absent.

In short, the new dose-response relationship was tortured out of a small data set, raises as many problems as it solves and only fits an unrepresentative sample of British males. Burch did not bother further with it in his later work on smoking and lung cancer.