A study of geographic clusters through shared borders between states with high cancer incidence and mortality rates in the United States, 2009-2013
The purpose of this investigation is to find all (or most) of the large geographical areas in the U.S. that have high cancer incidence or mortality rates compared to other U.S. regions.
It's known that cancer rates in the U.S. vary among states and regions and that different regions in the U.S. are affected by different geography-specific environmental exposures, lifestyle factors, and different health care management practices [1, 2].
A comparison of cancer incidence and mortality rates between different states and census-defined regions and divisions is provided on the CDC website . Data, graphs, and possible reasons for the differences are provided.
In the study below, we will instead be looking at states that are known to have the highest incidence or mortality rates, and we count the number of shared borders between them in order to find geographical clusters (i.e. large regions containing multiple states). Specifically, for every type of cancer, gender, and race/ethnicity, we check the number of shared borders between the 15 states with highest incidence and mortality rates.
This study is the first to look at the number of shared borders among states with the highest cancer rates in order to find geographic clusters.
Data Source and Usage
I downloaded the data files from the United States Centers for Disease Control and Prevention (CDC). When selecting the data on the website, I selected the age-adjusted data from the combined 2009-2013 period. Rates are per 100,000 persons and are age-adjusted to the 2000 U.S. standard population.
For most cancers, there are four relevant files to download:
- Men incidence rate
- Men mortality rate
- Women incidence rate
- Women mortality rate
Each file contains the data of white, black, and Hispanic race/ethnicity in each state in the United States, for a specific cancer.
Hispanic origin is not mutually exclusive from race categories (white, black).
Data is available for the following 26 cancers: 1 - Brain and Other Nervous System 2 - Cervix 3 - Colon and Rectum 4 - Corpus and Uterus 5 - Esophagus 6 - Female Breast 7 - Female Breast In Situ 8 - Hodgkin Lymphoma 9 - Kaposi Sarcoma 10 - Kidney and Renal Pelvis 11 - Larynx 12 - Leukemias 13 - Liver and Intrahepatic Bile Duct 14 - Lung and Bronchus 15 - Melanoma 16 - Mesothelioma 17 - Myeloma 18 - Non-Hodgkin Lymphoma 19 - Oral Cavity and Pharynx 20 - Ovary 21 - Pancreas 22 - Prostate 23 - Stomach 24 - Testis 25 - Thyroid 26 - Urinary Bladder
Some cancers occur exclusively in women (cervix, corpus and uterus, female breast and female breast in situ, and ovary), and two cancers occur exclusively in men (prostate and testis).
Male breast cancer data was not available on the CDC site and therefore was not part of this study.
Three cancer incidence rate data files did not have an associated death rate data file: male and female Kaposi sarcoma, and female breast in situ.
Altogether I downloaded 87 data files:
- Men incidence rate - 21 files
- Men mortality rate - 20 files
- Women incidence rate - 24 Files
- Women mortality rate - 22 Files
The files contained 261 records. I define a "record" as the distribution of age-adjusted cancer rates (incidence or mortality) for a specific gender and race/ethnicity across the U.S. during the period 2009-2013. As an example, the distribution of colon cancer mortality rates of Hispanic women for the period 2009-2013 across the U.S. is a single record.
Some cancer rates in individual states are not provided in the CDC files (e.g. melanoma incidence in black men in Connecticut for the period 2009-2013), either because the rates are very low (i.e. less than 16 cases for that race/ethnicity) or because the rate has been suppressed at the state's request. Since my goal was to search for clusters among the 15 states with highest cancer rates, if a record had numerical rates for less than 15 states, the record was not used. When reporting the results, I indicate for each cluster the number of states in the CDC file that had numerical values.
As a result of insufficient data, 36 records were excluded from the analysis as follows:
- Men incidence rate - 2 records excluded
- Men mortality rate - 14 records excluded
- Women incidence rate - 7 records excluded
- Women mortality rate - 13 records excluded
In addition, some records were excluded because of small variability among states. Specifically, if the highest rate in a record was less than 1.0 (per 100,000 people), the record was excluded. As an example, the highest mortality rate for Hodgkin lymphoma among white men was 0.6 per 100,000 and the record was thus excluded.
As a result of small variability among states, additional 9 records were excluded from the analysis as follows:
- Men mortality rate - 3 records excluded
- Women incidence rate - 1 record excluded
- Women mortality rate - 5 records excluded
After the above exclusions, the 87 data files I downloaded contained 216 viable records, each record containing the required data to search for geographic clusters of states with high incidence or mortality rates. All 26 cancers were included in the final 216 records.
As stated in the introduction, I looked for large geographical areas in the U.S. that had high cancer incidence or mortality rates by counting the number of shared borders among the 15 states with highest incidence and mortality rates for every combination of cancer, gender, and race/ethnicity.
Specifically, I chose to look at "15" states because the number had to be large enough to allow finding geographic patterns but small enough as to represent states with highest cancer rates. I have done a separate investigation on how the number of shared borders (or more accurately, the shared-border ratio, defined below) is affected by selecting a different number of states (anywhere from 10 to 20 states).
I define a "shared-border ratio" (expressed as percentage) as the number of shared borders divided by the number of states:
Shared-border ratio [%] = (Number of shared borders * 100) / (Number of states)
As an example, 15 shared borders among 15 states is a shared-border ratio of 100%. If there are 24 shared borders among 15 states then the ratio is 160%. More specifically, the four states Minnesota, Iowa, Wisconsin and Illinois have five shared borders among them, or a shared-border ratio of 125%.
In situations where states beyond the "15 states with highest cancer rates" had identical cancer rates to the 15th state, I included those states in the investigation. For example, in the list of states with highest cancer rates, if the 15th, 16th, and 17th states all had identical rates, I calculated the shared-border ratio of the 17 states with highest cancer rates. When reporting the results, I indicate for each cluster the number of states that were included in the analysis.
For simplicity, throughout the rest of the discussion, whenever possible I will use the term "15 states with highest cancer rates" while it should be clear that sometimes the number is higher than 15, and the correct number is always specified in the results.
Since I was looking in this study for shared borders, I used cancer data from states in the contiguous USA, consisting of the 48 adjoining U.S. states plus Washington, D.C. (District of Columbia). For the purpose of fluidity of discussion, the District of Columbia is counted as a state since it reports it's own data and is not part of any state.
Considering a list of 15 states with the highest cancer rates, I define a "cluster," and provide the results in this report, if the following three conditions are met:
- At least 80% of states in the list have shared borders, e.g. 12 out of 15 (80%), or 14 out of 17 (82%).
- The shared-border ratio is 100% or higher.
- A cluster may contain two (or more) disconnected groups of states, but a group has to contain at least 3 states with shared borders to be part of a cluster.
The above definition of what constitutes a "cluster" is partly common sense but partly arbitrary, since there is no scientific definition of what constitutes a "cluster." However, the definition is consistent with how the CDC defines cancer clusters.
Below you can find links to all the clusters that were found in this study, including maps of the United States and lists of 15 states with the highest cancer rates (including the cancer rate in each state and the shared-border ratio of the entire list).
For each cancer, any clusters that were found are displayed in the following order:
- Men incidence rate
- Men mortality rate
- Women incidence rate
- Women mortality rate
Within each of the above four groups, races/ethnicities are displayed in the following order:
To understand the methodology behind the maps (e.g. why groups of states are in different colors) please read the map legend.
- Brain and Other Nervous System (1 cluster)
- Cervix (3 clusters)
- Colon and Rectum (10 clusters)
- Corpus and Uterus (4 clusters)
- Esophagus (7 clusters)
- Female Breast (3 clusters)
- Female Breast In Situ (1 cluster)
- Hodgkin Lymphoma (3 clusters)
- Kaposi Sarcoma (0 clusters)
- Kidney and Renal Pelvis (8 clusters)
- Larynx (5 clusters)
- Leukemias (2 clusters)
- Liver and Intrahepatic Bile Duct (4 clusters)
- Lung and Bronchus (7 clusters)
- Melanoma (3 clusters)
- Mesothelioma (1 cluster)
- Myeloma (3 clusters)
- Non-Hodgkin Lymphoma (4 clusters)
- Oral Cavity and Pharynx (5 clusters)
- Ovary (1 cluster)
- Pancreas (7 clusters)
- Prostate (1 cluster)
- Stomach (2 clusters)
- Testis (2 clusters)
- Thyroid (2 clusters)
- Urinary Bladder (6 clusters)
The following are the key points of the results:
- Geographic clusters of states with high cancer incidence or mortality rates were found in 25 out of 26 cancers (the only exception being Kaposi sarcoma which had two viable records but no clusters were found).
- Out of 216 data records that were analyzed, this study found 95 records (44%) that contained clusters of states with high incidence or mortality rates.
- Records of men and women had identical percentage of clusters (44% of their respective records).
- Considerably less clusters were found for Hispanic origin (as a percentage of their records), than for white and black race/ethnicity.
- Considerably more clusters were found when analyzing mortality rates than when analyzing incidence rates (as a percentage of their respective records).
- The highest percentage of clusters was found when analyzing black women mortality rates (72% of the records). The lowest percentage of clusters was found when analyzing Hispanic women incidence rates (5% of the records).
The two tables below display the points above in more detail:
|Total (entire dataset)||216||95||44|
|Men (all races/ethnicities, including both incidence and mortality records)||104||46||44|
|Women (all races/ethnicities, including both incidence and mortality records)||112||49||44|
|White (men and women, including both incidence and mortality records)||78||42||54|
|Black (men and women, including both incidence and mortality records)||76||39||51|
|Hispanic (men and women, including both incidence and mortality records)||62||14||23|
|Incidence (all genders/races/ethnicities)||125||46||37|
|Mortality (all genders/races/ethnicities)||91||49||54|
|White men incidence||21||12||57|
|White men mortality||17||9||53|
|Black men incidence||21||7||33|
|Black men mortality||15||8||53|
|Hispanic men incidence||19||5||26|
|Hispanic men mortality||11||5||45|
|White women incidence||22||10||45|
|White women mortality||18||11||61|
|Black women incidence||22||11||50|
|Black women mortality||18||13||72|
|Hispanic women incidence||20||1||5|
|Hispanic women mortality||12||3||25|
This study has two limitations:
- The number of clusters depends on the definition of what constitutes a "cluster." For example, had I investigated a list of "10 states with the highest cancer rates" (instead of 15) the number of clusters would have been significantly reduced .
- The shared-border ratio as a function of number of states is inherently unstable since the addition (or removal) of a single state from the list can cause the value of the shared-border ratio to change considerably .
However, I believe that searching for geographic clusters of states with high cancer incidence or mortality rates through shared borders, as has been done in this study, is viable and meaningful as exemplified by the results of this study.
By counting the number of shared borders among the 15 states with highest incidence and mortality rates for every combination of cancer, gender, and race/ethnicity, geographic clusters were found in 25 out of 26 cancers, and in 95 out of 216 records (44%). Maps of the clusters, as well as numerical results, are displayed in the results section of this study.
Possibly incorporate into the definition of a cluster (or the definition of a shared-border ratio) a weighted function based on:
- The geographical size of states (in square kilometers)
- The population size of states
- The geographical dispersion of states within the cluster (i.e. how concentrated the states are around the geographical center of the cluster)