Reproducing Burch in R
Burch produced the diagrams which illustrate his explanation of lung cancer by plotting real world data as points on graph paper and superimposing theoretical curves with the values which best fit the data. I have reproduced them, using the statistics programming language R, partly to demonstrate that it can be done and partly as an aid to explaining Burch’s work. Two tasks were involved: assembling the data he used and generating his curves. The curves are graphs of this function
dP/dt = (n r k S t(r−1) exp(−ktr))(1 −exp(−ktr)) (n−1)
for the different values of the parameters S, k, n and r which best fit each dataset. (The derivation is given in Burch on Cancer.) For data relating to men, n and r are always 2 and 5, while for women they are 1 and 6. The values of S and k need to be ascertained separately for each dataset.
Most of the data for England and Wales 1901-70 is available in printed sources in exactly the form Burch used, but some needs to be pieced together as explained below. The main difficulty is establishing what values of his parameters S and k give the curves which fit the data most closely. Burch does not provide a full listing but specifies some of the values in his discussion in The Biology of Cancer and elsewhere. These are adequate to make a full listing of values of k but not S. However, full values of S can be read off from the Four Tides and Twin Peaks diagrams: I therefore reproduced these figures first and used the values of S I obtained to reproduce the Seven Decades diagrams.
Burch specifies his source of data for the 24 Countries diagram in the publications of Segi and Kurihara and provides a full listing of the values of S and k which he used. Unfortunately, I was unable to locate the 1964 edition (relating to 1961) and had to use the 1966 edition (relating to 1965). A slight modification of the 1966 values of S and k gives a satisfactory fit to this data.
The data sources are
G. Todd, Statistics of smoking in the United Kingdom (6th edition, London, 1972).
M. Segi and M. Kurihara, Cancer mortality for selected sites in 24 countries, No. 3 (1960-61) (Sendai, 1964).
M. Segi and M. Kurihara, Cancer mortality for selected sites in 24 countries, No. 4 (1962-63) (Sendai, 1966).
M. Segi and M. Kurihara, Cancer mortality for selected sites in 24 countries, No. 5 (1964-65) (Sendai, 1966).
(The first two editions of Segi and Kurihara are those cited by Burch, the third is the one used here.)
The Registrar General’s Statistical Review of England and Wales for the Year 1950: Text, Medical (London, HMSO, 1954).
General Register Office, Studies on Medical and Population Subjects No. 13: Cancer Statistics for England and Wales 1901-1955, a Study Relating to Mortality and Morbidity (London, HMSO, 1958).
The first of these publications covers the years 1901-1950 and the second the years 1911-1956. It is therefore necessary to assemble the data for 1956-70 from the original sources.
This can be found in the 20th Century Mortality Files, a data set giving information on population and deaths for 1901-2000 which is available online here (the unusual structure of the URL is correct).
These files consist of spreadsheets designed to be read into the R structure known as a data frame. I first removed all data not referring to lung cancer as defined by numerical codes in the International Classification of Disease (ICD). This international standard is designed to harmonise the reporting of death in different countries and has been through ten versions since the first revision in 1900.
|ICD version||Years in force||Code for lung cancer|
|ICD1||1901-10||680 (all carcinoma)|
The cells of the data frame now contain absolute numbers of lung cancer deaths in each year, broken down by sex and age group: for instance, in 1958 the number of women who died of lung cancer between the ages of 35 and 39 was 16. To reproduce Burch’s analysis involved two steps:
1. grouping the data by five-year period
2. converting absolute numbers into a proportion of the population.
This involved the R operation called binning (replacing multiple cells with a single cell holding their sum). I
binned lung cancer deaths by five-year period
binned population by five-year period, and
divided the first bin by the second to get the population-adjusted five-year average.
I performed this calculation for the entire data set from 1901 to 1970 and compared the result with the Register Office tables. There is good but not exact agreement back to 1941, the start of the ICD5 period. Annotations in some of Burch’s diagrams give absolute five-year totals for lung cancer deaths, and these match the calculated totals exactly. The problem must be the population data. Population estimates can and do vary, particularly for the 1940s, when large numbers of people were abroad serving in the armed forces and there had been no census since 1931.
For the period before 1941 the discrepancy between my calculated values and the Register Office tables is too large to be explained in this way. The problem is almost certainly the diagnostic criteria used. ICD1 contains no diagnostic category equivalent to lung cancer and it seems that the Register Office must have taken its data from some other source. A comparison of my plots with those in Burch shows that my figures are accurate enough for current purposes.
My plots were mainly drawn using Hadley Wickham’s ggplot2 R library, though some of them required R base graphics. Apart from the use of colour, I have tried to reproduce the period look and feel of the originals, partly for fun and partly to help the reader identify them easily in Burch’s papers.
Code and data for download. A zip archive containing these files:
Extracts and groups lung cancer data from the 20th Century Mortality Files.
The output of mortality.R.
Lung cancer data from The Registrar General’s Statistical Review (1954).
Lung cancer data from General Register Office Studies (1958).
Lung cancer data from Segi and Kurihara no. 5 (1966).
Data on smoking rates from Todd (1972).
Draws the diagrams using the above data.