Introduction
This vignette gives an example for the analysis of a typical social science data set. It is the data file of the American National Election Study of 19481 available from the American National Election Studies website. The data file contains data from to USA-wide surveys conducted October and November 1948 by the Survey Research Centre, University Michigan (principal investigators: Angus Campbell and Robert L. Kahn). The total number of cases in the data set is 662 and the number of variables is 65 (more details about this data set can be found at https://electionstudies.org/studypages/1948prepost/1948prepost.htm).
With 662 cases and 65 variables, the 1948 ANES data set is relatively small as compared to current social science data sets. Such larger data sets can be processed along the same lines as in this vignette. Unlike the 1948 ANES data, their size as well as, in some cases, legal restrictions prohibit the inclusion of such a data set into the package, however.
This vignette starts with a demonstration how a data file can be
examined before loading it and how a subset of the data can be loaded
into memory. After loading this subset into memory, some descriptive
analyses are conducted that showcase the construction of contingency
tables and of general tables of descriptive statistics using the
genTable
function. In addition, a logit analysis is
demonstrated and the collection of several logit coefficients into a
comprehensive table by the mtable
function.
It should be noted that the analyses reported in the following are conducted only for purpose of demonstrating the features of the package and are not to be considered of conclusive scientific evidence of any kind.
This vignette is run with the help of the knitr package. This allows to showcase not only data management facilities provided by memisc. The following code also demonstrates how output created with some of the facilities of memisc can neatly integrated in reports generated with knitr. Before we start, we adjust knitr’s output (with which this vignette is formatted) to produce HTML where possible.
knit_print.codebook <-function(x,...)
knitr::asis_output(format_html(x,...))
knit_print.descriptions <-function(x,...)
knitr::asis_output(format_html(x,...))
knit_print.ftable <-function(x,options,...)
knitr::asis_output(
format_html(x,
digits=if(length(options$ftable.digits))
options$ftable.digits
else 0,
...))
# We can now adjust the number of digits after the comma
# for each column e.g. by adding an `ftable.digits` option
# to an R chunk, as in ```{r,ftable=c(2,2,0)}
knit_print.mtable <-function(x,...)
knitr::asis_output(format_html(x,...))
Reading in a “portable” SPSS data file
We start with importing the data into R. The following code extracts
the SPSS portable file NES1948.POR
from zip file
NES1948.ZIP
delivered with the memisc package.
library(memisc)
options(digits=3)
nes1948.por <- unzip(system.file("anes/NES1948.ZIP",package="memisc"),
"NES1948.POR",exdir=tempfile())
Now the portable file is in a temporary directory and the path to the
file is contained in the string variable nes1948.por
. In
the next step, the file is declared as a SPSS/PSPP portable file using
the function spss.portable.file
, which as first argument
takes the path to the file. spss.portable.file
reads in the
information about the variables contained in the data set and counts the
number of cases in the file. That is, standard I/O operations are used
on the file, but the data read in are just thrown away without
allocating core memory for the data. This counting of cases can, of
course, be suppressed if it would take too long.
nes1948 <- spss.portable.file(nes1948.por)
Warning: 9 variables have duplicated labels:
V480004, V480012, V480020, V480021A, V480021B, V480033A, V480033B,
V480034A, V480034B
print(nes1948)
SPSS portable file '/tmp/RtmpqGo13L/file1c5f79d6390c/NES1948.POR'
with 67 variables and 662 observations
At this stage, the data are not loaded into the memory yet. But we can see which variables exist inside the data set:
names(nes1948)
[1] "VVERSION" "VDSETNO" "V480001" "V480002" "V480003" "V480004"
[7] "V480005" "V480006" "V480007" "V480008" "V480009" "V480010"
[13] "V480011" "V480012" "V480013" "V480014A" "V480014B" "V480015A"
[19] "V480015B" "V480016A" "V480016B" "V480017A" "V480017B" "V480018"
[25] "V480019" "V480020" "V480021A" "V480021B" "V480022A" "V480022B"
[31] "V480023" "V480024" "V480025A" "V480025B" "V480026" "V480027"
[37] "V480028" "V480029" "V480030" "V480031A" "V480031B" "V480031C"
[43] "V480032A" "V480032B" "V480032C" "V480033A" "V480033B" "V480034A"
[49] "V480034B" "V480035A" "V480035B" "V480036A" "V480036B" "V480037"
[55] "V480038" "V480039" "V480040" "V480041" "V480042" "V480043"
[61] "V480044" "V480045" "V480046" "V480047" "V480048" "V480049"
[67] "V480050"
Note that the variable names are all changed from uppercase to
lowercase (SPSS does not distinguish uppercase and lowercase variable
names and uppercase looks like shouting). Casefolding could have been
suppressed by the call
spsp.portable.file(nes1948.por,tolower=FALSE)
.
We also can ask for a description (“variable label”) for each variable:
description(nes1948)
Variable | Description |
VVERSION | NES VERSION NUMBER |
VDSETNO | NES DATASET NUMBER |
V480001 | ICPSR ARCHIVE NUMBER |
V480002 | INTERVIEW NUMBER |
V480003 | POP CLASSIFICATION |
V480004 | CODER |
V480005 | NUMBER OF CALLS TO R |
V480006 | R REMEMBER PREVIOUS INT |
V480007 | INTR INTERVIEW THIS R |
V480008 | PRVS PRE-ELCTN R REINT |
V480009 | R INT IN PRE/POSTELCTN |
V480010 | RENT CNTRL KEPT/DROPPED |
V480011 | GOVT CONTROL PRICES |
V480012 | WHAT TO DO W TFT-HT ACT |
V480013 | PRESLELCTN OTCM SURPRISE |
V480014A | WHY PPL VTD FOR TRUMAN 1 |
V480014B | WHY PPL VTD FOR TRUMAN 2 |
V480015A | WHY PPL VTD AGNST TRUMAN 1 |
V480015B | WHY PPL VTD AGNST TRUMAN 2 |
V480016A | WHY PPL VTD FOR DEWEY 1 |
V480016B | WHY PPL VTD FOR DEWEY 2 |
V480017A | WHY PPL VTD AGNST DEWEY 1 |
V480017B | WHY PPL VTD AGNST DEWEY 2 |
V480018 | DID R VOTE/FOR WHOM |
V480019 | WN DECIDE FOR WHOM TO VT |
V480020 | CNSD VT FOR SOMEONE ELSE |
V480021A | XWHY DID NOT VT FOR HIM 1 |
V480021B | XWHY DID NOT VT FOR HIM 2 |
V480022A | WHY VT THE WAY YOU DID 1 |
V480022B | WHY VT THE WAY YOU DID 2 |
V480023 | VOTED STRAIGHT TICKET |
V480024 | R NOT VT-IF VT,FOR WHOM |
V480025A | R NOT VT-WHY DID NOT VT 1 |
V480025B | R NOT VT-WHY DID NOT VT 2 |
V480026 | R NOT VT-WAS R REG TO VT |
V480027 | VTD IN PRVS PRESL ELCTN |
V480028 | VTD FOR WHOM IN 1944 |
V480029 | OCCUPATION OF HEAD |
V480030 | HEAD BELONG TO LBR UN |
V480031A | GRPS IDENTIFIED W TRUMAN 1 |
V480031B | GRPS IDENTIFIED W TRUMAN 2 |
V480031C | GRPS IDENTIFIED W TRUMAN 3 |
V480032A | GRPS IDENTIFIED W DEWEY 1 |
V480032B | GRPS IDENTIFIED W DEWEY 2 |
V480032C | GRPS IDENTIFIED W DEWEY 3 |
V480033A | ISSUES CONNECTED W TRMN 1 |
V480033B | ISSUES CONNECTED W TRMN 2 |
V480034A | ISSUES CONNECTED W DEWEY 1 |
V480034B | ISSUES CONNECTED W DEWEY 2 |
V480035A | PERSONAL ATTRIBUTE TRMN 1 |
V480035B | PERSONAL ATTRIBUTE TRMN 2 |
V480036A | PERSONAL ATTRIBUTE DEWEY 1 |
V480036B | PERSONAL ATTRIBUTE DEWEY 2 |
V480037 | CMPN INCIDENTS MENTIONED |
V480038 | 41-PRESLELCTN PLAN TO VT |
V480039 | 41-PLAN TO VT REP/DEM |
V480040 | 41-USA’S CNCRN W OTHERS |
V480041 | 41-SATISD USA TWRD RUSS |
V480042 | 41-INFORMATION LEVEL |
V480043 | 41-USA GV IN,AGRT RUSS |
V480044 | 41-USA-RUSS AGRT VIA U.N |
V480045 | SEX OF RESPONDENT |
V480046 | RACE OF RESPONDENT |
V480047 | AGE OF RESPONDENT |
V480048 | EDUCATION OF RESPONDENT |
V480049 | TOTAL 1948 INCOME |
V480050 | RELIGIOUS PREFERENCE |
or even a code book using
codebook(nes1948)
(this is not shown here because the output would have taken more then thirty pages). We can also get a codebook of the first few variabels instead, with
codebook(nes1948[1:5])
VVERSION
— ‘NES VERSION NUMBER’
Storage mode: | double |
Measurement: | interval |
Min: | 1 | . | 000 |
Max: | 1 | . | 000 |
Mean: | 1 | . | 000 |
Std.Dev.: | 0 | . | 000 |
VDSETNO
— ‘NES DATASET NUMBER’
Storage mode: | character |
Measurement: | nominal |
Min: | “1948 | T” | |
Max: | “1948 | T” |
V480001
— ‘ICPSR ARCHIVE NUMBER’
Storage mode: | double |
Measurement: | interval |
Min: | 7218 | . | 000 |
Max: | 7218 | . | 000 |
Mean: | 7218 | . | 000 |
Std.Dev.: | 0 | . | 000 |
V480002
— ‘INTERVIEW NUMBER’
Storage mode: | double |
Measurement: | interval |
Min: | 1001 | . | 000 |
Max: | 1662 | . | 000 |
Mean: | 1331 | . | 500 |
Std.Dev.: | 191 | . | 103 |
V480003
— ‘POP CLASSIFICATION’
Storage mode: | double |
Measurement: | nominal |
Values and labels | N | Percent | ||||
1 | ‘METROPOLITAN AREA’ | 182 | 27 | . | 5 | |
2 | ‘TOWN OR CITY’ | 354 | 53 | . | 5 | |
3 | ‘OPEN COUNTRY’ | 126 | 19 | . | 0 |
Reading in a subset of the data
After we have decided which variables to use we can read in a subset of the data:
vote.48 <- subset(nes1948,
select=c(
V480018,
V480029,
V480030,
V480045,
V480046,
V480047,
V480048,
V480049,
V480050
))
The subset of the ANES 1948 we read in is now contained in the
variable vote.48
, which contains an object of class
data.set
. A data.set
is an “embellished”
version of a data.frame
, a data structure intended to
contained labelled
vectors. labelled
vectors
contain the all the special information attached to the variables in the
original data set, such as variable labels, value labels, and general
missing values. A short summary of this special information shows up
after a call to str
.
str(vote.48)
Data set with 662 obs. of 9 variables:
$ V480018: Nmnl. item w/ 7 labels for 1,2,3,... + ms.v. num 1 2 1 2 1 2 2 1 2 1 ...
$ V480029: Nmnl. item w/ 12 labels for 10,20,30,... + ms.v. num 70 30 40 10 10 20 80 80 40 40 ...
$ V480030: Nmnl. item w/ 4 labels for 1,2,8,... + ms.v. num 1 2 2 2 2 2 2 2 1 1 ...
$ V480045: Nmnl. item w/ 3 labels for 1,2,9 + ms.v. num 1 2 2 2 1 2 1 2 1 1 ...
$ V480046: Nmnl. item w/ 4 labels for 1,2,3,... + ms.v. num 1 1 1 1 1 1 1 1 1 1 ...
$ V480047: Nmnl. item w/ 7 labels for 1,2,3,... + ms.v. num 3 3 2 3 2 3 4 5 2 2 ...
$ V480048: Nmnl. item w/ 4 labels for 1,2,3,... + ms.v. num 1 2 2 3 3 2 1 1 2 2 ...
$ V480049: Nmnl. item w/ 8 labels for 1,2,3,... + ms.v. num 4 7 5 7 5 7 5 2 5 6 ...
$ V480050: Nmnl. item w/ 6 labels for 1,2,3,... + ms.v. num 1 1 2 1 2 1 1 1 1 2 ...
This output shows, for example, that variable V480018
has the description (variable label) “DID R VOTE/FOR WHOM” is considered
as having nominal level of measurement, has seven value labels and one
defined missing value.
Since the variable names in the ANES data set are not very mnemonic, we rename the variables:
vote.48 <- rename(vote.48,
V480018 = "vote",
V480029 = "occupation.hh",
V480030 = "unionized.hh",
V480045 = "gender",
V480046 = "race",
V480047 = "age",
V480048 = "education",
V480049 = "total.income",
V480050 = "religious.pref"
)
Since many data sets available from public repositories have such non-mnemonic variable names as in this example, it might be convenient to do the data loading and renaming in one step. Indeed it is possible:
vote.48 <- subset(nes1948,
select=c(
vote = V480018,
occupation.hh = V480029,
unionized.hh = V480030,
gender = V480045,
race = V480046,
age = V480047,
education = V480048,
total.income = V480049,
religious.pref = V480050
))
Before we start with analyses, we take a closer look at the data.
codebook(vote.48)
vote
— ‘DID
R VOTE/FOR WHOM’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘VOTED - FOR TRUMAN’ | 212 | 32 | . | 1 | 32 | . | 0 | |
2 | ‘VOTED - FOR DEWEY’ | 178 | 27 | . | 0 | 26 | . | 9 | |
3 | ‘VOTED - FOR WALLACE’ | 1 | 0 | . | 2 | 0 | . | 2 | |
4 | ‘VOTED - FOR OTHER’ | 11 | 1 | . | 7 | 1 | . | 7 | |
5 | ‘VOTED - NA FOR WHOM’ | 20 | 3 | . | 0 | 3 | . | 0 | |
6 | ‘DID NOT VOTE’ | 238 | 36 | . | 1 | 36 | . | 0 | |
9 | M | ‘NA WHETHER VOTED’ | 2 | 0 | . | 3 |
occupation.hh
— ‘OCCUPATION OF HEAD’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 99 |
Values and labels | N | Valid | Total | ||||||
10 | ‘PROFESSIONAL, SEMI-PROFESSIONAL’ | 44 | 6 | . | 9 | 6 | . | 6 | |
20 | ‘SELF-EMPLOYED, MANAGERIAL, SUPERVISORY’ | 73 | 11 | . | 5 | 11 | . | 0 | |
30 | ‘OTHER WHITE-COLLAR (CLERICAL, SALES, ET’ | 79 | 12 | . | 5 | 11 | . | 9 | |
40 | ‘SKILLED AND SEMI-SKILLED’ | 164 | 25 | . | 9 | 24 | . | 8 | |
60 | ‘PROTECTIVE SERVICE’ | 6 | 0 | . | 9 | 0 | . | 9 | |
70 | ‘UNSKILLED, INCLUDING FARM AND SERVICE W’ | 85 | 13 | . | 4 | 12 | . | 8 | |
80 | ‘FARM OPERATORS AND MANAGERS’ | 105 | 16 | . | 6 | 15 | . | 9 | |
92 | ‘STUDENT’ | 7 | 1 | . | 1 | 1 | . | 1 | |
94 | ‘UNEMPLOYED’ | 5 | 0 | . | 8 | 0 | . | 8 | |
95 | ‘RETIRED, TOO OLD OR UNABLE TO WORK’ | 38 | 6 | . | 0 | 5 | . | 7 | |
96 | ‘HOUSEWIFE’ | 28 | 4 | . | 4 | 4 | . | 2 | |
99 | M | ‘NA’ | 28 | 4 | . | 2 |
unionized.hh
— ‘HEAD BELONG TO LBR UN’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 8 - Inf |
Values and labels | N | Valid | Total | ||||||
1 | ‘YES’ | 150 | 23 | . | 3 | 22 | . | 7 | |
2 | ‘NO’ | 493 | 76 | . | 7 | 74 | . | 5 | |
8 | M | ‘DK’ | 5 | 0 | . | 8 | |||
9 | M | ‘NA’ | 14 | 2 | . | 1 |
gender
— ‘SEX OF RESPONDENT’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘MALE’ | 302 | 45 | . | 8 | 45 | . | 6 | |
2 | ‘FEMALE’ | 357 | 54 | . | 2 | 53 | . | 9 | |
9 | M | ‘NA’ | 3 | 0 | . | 5 |
race
— ‘RACE
OF RESPONDENT’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘WHITE’ | 585 | 90 | . | 7 | 88 | . | 4 | |
2 | ‘NEGRO’ | 60 | 9 | . | 3 | 9 | . | 1 | |
3 | ‘OTHER’ | 0 | 0 | . | 0 | 0 | . | 0 | |
9 | M | ‘NA’ | 17 | 2 | . | 6 |
age
— ‘AGE
OF RESPONDENT’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘18-24’ | 57 | 8 | . | 7 | 8 | . | 6 | |
2 | ‘25-34’ | 142 | 21 | . | 7 | 21 | . | 5 | |
3 | ‘35-44’ | 174 | 26 | . | 6 | 26 | . | 3 | |
4 | ‘45-54’ | 125 | 19 | . | 1 | 18 | . | 9 | |
5 | ‘55-64’ | 86 | 13 | . | 1 | 13 | . | 0 | |
6 | ‘65 AND OVER’ | 70 | 10 | . | 7 | 10 | . | 6 | |
9 | M | ‘NA’ | 8 | 1 | . | 2 |
education
— ‘EDUCATION OF RESPONDENT’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘GRADE SCHOOL’ | 292 | 44 | . | 4 | 44 | . | 1 | |
2 | ‘HIGH SCHOOL’ | 266 | 40 | . | 4 | 40 | . | 2 | |
3 | ‘COLLEGE’ | 100 | 15 | . | 2 | 15 | . | 1 | |
9 | M | ‘NA’ | 4 | 0 | . | 6 |
total.income
— ‘TOTAL 1948 INCOME’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘UNDER $500’ | 25 | 3 | . | 8 | 3 | . | 8 | |
2 | ‘$500-$999’ | 43 | 6 | . | 6 | 6 | . | 5 | |
3 | ‘$1000-1999’ | 110 | 16 | . | 8 | 16 | . | 6 | |
4 | ‘$2000-2999’ | 185 | 28 | . | 2 | 27 | . | 9 | |
5 | ‘$3000-3999’ | 142 | 21 | . | 7 | 21 | . | 5 | |
6 | ‘$4000-4999’ | 66 | 10 | . | 1 | 10 | . | 0 | |
7 | ‘$5000 AND OVER’ | 84 | 12 | . | 8 | 12 | . | 7 | |
9 | M | ‘NA’ | 7 | 1 | . | 1 |
religious.pref
— ‘RELIGIOUS PREFERENCE’
Storage mode: | double |
Measurement: | nominal |
Missing values: | 9 |
Values and labels | N | Valid | Total | ||||||
1 | ‘PROTESTANT’ | 460 | 70 | . | 0 | 69 | . | 5 | |
2 | ‘CATHOLIC’ | 140 | 21 | . | 3 | 21 | . | 1 | |
3 | ‘JEWISH’ | 25 | 3 | . | 8 | 3 | . | 8 | |
4 | ‘OTHER’ | 14 | 2 | . | 1 | 2 | . | 1 | |
5 | ‘NONE’ | 18 | 2 | . | 7 | 2 | . | 7 | |
9 | M | ‘NA’ | 5 | 0 | . | 8 |
We now have obtained a codebook, which contains information of the class and type of the variables in the data set, the value labels and defined missing values, and counts of the distinct values of the variables.
Analysis
Some descriptive analyses
We start our analyses with a contingency table, but first we make some preparations: We recode the variables of interest into a smaller number of categories in order to get results that are easier to read and interpret.
vote.48 <- within(vote.48,{
vote3 <- recode(vote,
1 -> "Truman",
2 -> "Dewey",
3:4 -> "Other"
)
occup4 <- recode(occupation.hh,
10:20 -> "Upper white collar",
30 -> "Other white collar",
40:70 -> "Blue collar",
80 -> "Farmer"
)
relig3 <- recode(religious.pref,
1 -> "Protestant",
2 -> "Catholic",
3:5 -> "Other,none"
)
race2 <- recode(race,
1 -> "White",
2 -> "Black"
)
})
Warning in recode(vote, "Truman" <- 1, "Dewey" <- 2, "Other" <- 3:4): recoding
created 260 NAs
Warning in recode(occupation.hh, "Upper white collar" <- 10:20, "Other white
collar" <- 30, : recoding created 106 NAs
Warning in recode(religious.pref, "Protestant" <- 1, "Catholic" <- 2,
"Other,none" <- 3:5): recoding created 5 NAs
Warning in recode(race, "White" <- 1, "Black" <- 2): recoding created 17 NAs
Having constructed the unordered factors vote3
,
occup4
, relig3
, and race2
we can
proceed examining the association the vote, occupational class, relgious
denomination, and race. First, we look upon a simple contingency
table.
occup4 | |||||||||||||
vote3 | Upper white collar | Other white collar | Blue collar | Farmer | |||||||||
Truman | 17 | 30 | 114 | 26 | |||||||||
Dewey | 67 | 31 | 36 | 14 | |||||||||
Other | 2 | 0 | 4 | 3 |
Tables of percentages may seem more informative about the impact of
various factors on the vote. So we use the function
genTable
to obtain such tables of percentages:
gt1 <- genTable(percent(vote3)~occup4,data=vote.48)
## For knitr-ing, we use ```{r, ftable.digits=c(2,2,2,0)} here.
ftable(gt1,row.vars=2)
occup4 | Truman | Dewey | Other | N | |||||||||
Upper white collar | 19 | . | 77 | 77 | . | 91 | 2 | . | 33 | 86 | |||
Other white collar | 49 | . | 18 | 50 | . | 82 | 0 | . | 00 | 61 | |||
Blue collar | 74 | . | 03 | 23 | . | 38 | 2 | . | 60 | 154 | |||
Farmer | 60 | . | 47 | 32 | . | 56 | 6 | . | 98 | 43 | |||
NA | 43 | . | 10 | 51 | . | 72 | 5 | . | 17 | 58 |
Obviously, voters from farmer and blue collar worker households were especially supportive of President Truman, while voters of upper white collar background largely supported the Republican Candidate Dewey.
relig3 | Truman | Dewey | Other | N | |||||||||
Protestant | 44 | . | 71 | 50 | . | 98 | 4 | . | 31 | 255 | |||
Catholic | 66 | . | 02 | 33 | . | 98 | 0 | . | 00 | 103 | |||
Other,none | 68 | . | 18 | 29 | . | 55 | 2 | . | 27 | 44 | |||
NA | NaN | NaN | NaN | 0 |
This table shows that Catholics and adherents of other denominations were more supportive of Truman than of Dewey.
race2 | Truman | Dewey | Other | N | |||||||||
White | 51 | . | 33 | 45 | . | 48 | 3 | . | 19 | 376 | |||
Black | 64 | . | 71 | 35 | . | 29 | 0 | . | 00 | 17 | |||
NA | 88 | . | 89 | 11 | . | 11 | 0 | . | 00 | 9 |
African Americans apparently supported Truman by a large majority. The number of members of this group in the sample is very small, however, so that such an inference would be very shaky.
total.income | Truman | Dewey | Other | N | |||||||||
UNDER $500 | 50 | . | 00 | 50 | . | 00 | 0 | . | 00 | 8 | |||
$500-$999 | 61 | . | 54 | 38 | . | 46 | 0 | . | 00 | 13 | |||
$1000-1999 | 64 | . | 41 | 32 | . | 20 | 3 | . | 39 | 59 | |||
$2000-2999 | 66 | . | 99 | 30 | . | 10 | 2 | . | 91 | 103 | |||
$3000-3999 | 47 | . | 52 | 48 | . | 51 | 3 | . | 96 | 101 | |||
$4000-4999 | 45 | . | 83 | 50 | . | 00 | 4 | . | 17 | 48 | |||
$5000 AND OVER | 31 | . | 82 | 68 | . | 18 | 0 | . | 00 | 66 | |||
NA | 50 | . | 00 | 25 | . | 00 | 25 | . | 00 | 4 |
The table of percentage of vote by income suggests that income had some considerable influence on the choice either of Truman or of Dewey, but the unequal distribution of income categories warrants a more refined analysis that takes into account the uncertainty about the vote percentages. Therefore, the percentages of support for Truman broken down by income shown with confidence intervals:
## For knitr-ing, we use ```{r, ftable.digits=c(2,2,2)} here.
inc.tab <- genTable(percent(vote3,ci=TRUE)~total.income,data=vote.48)
ftable(inc.tab,row.vars=c(3,2))
total.income | vote3 | Percentage | lower | upper | |||||||
UNDER $500 | Truman | 50 | . | 00 | 15 | . | 70 | 84 | . | 30 | |
Dewey | 50 | . | 00 | 15 | . | 70 | 84 | . | 30 | ||
Other | 0 | . | 00 | 0 | . | 00 | 36 | . | 94 | ||
$500-$999 | Truman | 61 | . | 54 | 31 | . | 58 | 86 | . | 14 | |
Dewey | 38 | . | 46 | 13 | . | 86 | 68 | . | 42 | ||
Other | 0 | . | 00 | 0 | . | 00 | 24 | . | 71 | ||
$1000-1999 | Truman | 64 | . | 41 | 50 | . | 87 | 76 | . | 45 | |
Dewey | 32 | . | 20 | 20 | . | 62 | 45 | . | 64 | ||
Other | 3 | . | 39 | 0 | . | 41 | 11 | . | 71 | ||
$2000-2999 | Truman | 66 | . | 99 | 57 | . | 03 | 75 | . | 94 | |
Dewey | 30 | . | 10 | 21 | . | 45 | 39 | . | 92 | ||
Other | 2 | . | 91 | 0 | . | 60 | 8 | . | 28 | ||
$3000-3999 | Truman | 47 | . | 52 | 37 | . | 49 | 57 | . | 70 | |
Dewey | 48 | . | 51 | 38 | . | 45 | 58 | . | 67 | ||
Other | 3 | . | 96 | 1 | . | 09 | 9 | . | 83 | ||
$4000-4999 | Truman | 45 | . | 83 | 31 | . | 37 | 60 | . | 83 | |
Dewey | 50 | . | 00 | 35 | . | 23 | 64 | . | 77 | ||
Other | 4 | . | 17 | 0 | . | 51 | 14 | . | 25 | ||
$5000 AND OVER | Truman | 31 | . | 82 | 20 | . | 89 | 44 | . | 44 | |
Dewey | 68 | . | 18 | 55 | . | 56 | 79 | . | 11 | ||
Other | 0 | . | 00 | 0 | . | 00 | 5 | . | 44 | ||
NA | Truman | 50 | . | 00 | 6 | . | 76 | 93 | . | 24 | |
Dewey | 25 | . | 00 | 0 | . | 63 | 80 | . | 59 | ||
Other | 25 | . | 00 | 0 | . | 63 | 80 | . | 59 |
Occupational class is more evenly distributed in the sample, thus it may be possible to obtain more precise estimates of the percentages of support for Truman for occupational classes:
occup4 | vote3 | Percentage | lower | upper | |||||||
Upper white collar | Truman | 19 | . | 77 | 11 | . | 96 | 29 | . | 75 | |
Dewey | 77 | . | 91 | 67 | . | 67 | 86 | . | 14 | ||
Other | 2 | . | 33 | 0 | . | 28 | 8 | . | 15 | ||
Other white collar | Truman | 49 | . | 18 | 36 | . | 14 | 62 | . | 30 | |
Dewey | 50 | . | 82 | 37 | . | 70 | 63 | . | 86 | ||
Other | 0 | . | 00 | 0 | . | 00 | 5 | . | 87 | ||
Blue collar | Truman | 74 | . | 03 | 66 | . | 35 | 80 | . | 75 | |
Dewey | 23 | . | 38 | 16 | . | 94 | 30 | . | 86 | ||
Other | 2 | . | 60 | 0 | . | 71 | 6 | . | 52 | ||
Farmer | Truman | 60 | . | 47 | 44 | . | 41 | 75 | . | 02 | |
Dewey | 32 | . | 56 | 19 | . | 08 | 48 | . | 54 | ||
Other | 6 | . | 98 | 1 | . | 46 | 19 | . | 06 | ||
NA | Truman | 43 | . | 10 | 30 | . | 16 | 56 | . | 77 | |
Dewey | 51 | . | 72 | 38 | . | 22 | 65 | . | 05 | ||
Other | 5 | . | 17 | 1 | . | 08 | 14 | . | 38 |
The upper and lower white-collar and blue-collar classes are quite distinct with regard to the percentages of support for Truman. The point estimates of the percentages are outside the confidence intervals of the respective other occupational classes, the confidence intervals do not even overlap. However, it is not clear whether farmers are distinct from the blue-collar and lower white-collar classes.
Logit modelling of candidate choice
In the following we conduct a logit analysis of the vote for Truman.
First, we assign non-standard contrasts the categorical predictors.
Here, the function contr
is used to assign treatment
(dummy) contrasts to occup4
and total.income
with baseline category 3 and 4, respectively.
vote.48 <- within(vote.48,{
contrasts(occup4) <- contr("treatment",base = 3)
contrasts(total.income) <- contr("treatment",base = 4)
})
We now fit some logistic regression models of the impact occupational class, income, and religious denomination on the vote choice supporting Truman. The contrasts of the occupational class and income factors are such that they compare the choices of the members of the blue-collar class with all other classes and the middle income group ($ 2000-2999) with the other income groups. The religious denomination factor compares Protestants with Catholics and those with other or no denominations.
model1 <- glm((vote3=="Truman")~occup4,data=vote.48,
family="binomial")
model2 <- glm((vote3=="Truman")~total.income,data=vote.48,
family="binomial")
model3 <- glm((vote3=="Truman")~occup4+total.income,data=vote.48,
family="binomial")
model4 <- glm((vote3=="Truman")~relig3,data=vote.48,
family="binomial")
model5 <- glm((vote3=="Truman")~occup4+relig3,data=vote.48,
family="binomial")
First, we use mtable
to construct a comparative table of
the estimates of model1
, model2
, and
model3
. We thus can compare the impact of occupational
class and income on the choice of candidate Truman.
Calls:
model1: glm(formula = (vote3 == "Truman") ~ occup4, family = "binomial",
data = vote.48)
model2: glm(formula = (vote3 == "Truman") ~ total.income, family = "binomial",
data = vote.48)
model3: glm(formula = (vote3 == "Truman") ~ occup4 + total.income, family = "binomial",
data = vote.48)
===============================================================================
model1 model2 model3
-------------------------------------------------------------------------------
(Intercept) 1.047*** 0.708*** 1.316***
(0.184) (0.210) (0.268)
occup4: Upper white collar/Blue collar -2.448*** -2.328***
(0.327) (0.357)
occup4: Other white collar/Blue collar -1.080*** -1.015**
(0.315) (0.323)
occup4: Farmer/Blue collar -0.622 -0.792*
(0.362) (0.383)
total.income: UNDER $500/$2000-2999 -0.708 -0.662
(0.737) (1.056)
total.income: $500-$999/$2000-2999 -0.238 0.912
(0.607) (1.143)
total.income: $1000-1999/$2000-2999 -0.115 0.144
(0.343) (0.440)
total.income: $3000-3999/$2000-2999 -0.807** -0.527
(0.289) (0.338)
total.income: $4000-4999/$2000-2999 -0.875* -0.509
(0.358) (0.411)
total.income: $5000 AND OVER/$2000-2999 -1.470*** -0.535
(0.337) (0.405)
-------------------------------------------------------------------------------
Nagelkerke R-sq. 0.246 0.085 0.274
Deviance 404.190 524.433 390.551
AIC 412.190 538.433 410.551
N 344 398 340
===============================================================================
Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05
mtable
returns an object of class "mtable"
.
When formatted it looks close to the requirements of typical social
science publications. Yet at least we want to change the technical
variable names into non-technical ones, for which we can use
relabel
:
relabel(mtable(
"Model 1"=model1,
"Model 2"=model2,
"Model 3"=model3,
summary.stats=c("Nagelkerke R-sq.","Deviance","AIC","N")),
UNDER="under",
"AND OVER"="and over",
occup4="Occup. class",
total.income="Income",
gsub=TRUE
)
Calls:
Model 1: glm(formula = (vote3 == "Truman") ~ occup4, family = "binomial",
data = vote.48)
Model 2: glm(formula = (vote3 == "Truman") ~ total.income, family = "binomial",
data = vote.48)
Model 3: glm(formula = (vote3 == "Truman") ~ occup4 + total.income, family = "binomial",
data = vote.48)
====================================================================================
Model 1 Model 2 Model 3
------------------------------------------------------------------------------------
(Intercept) 1.047*** 0.708*** 1.316***
(0.184) (0.210) (0.268)
Occup. class: Upper white collar/Blue collar -2.448*** -2.328***
(0.327) (0.357)
Occup. class: Other white collar/Blue collar -1.080*** -1.015**
(0.315) (0.323)
Occup. class: Farmer/Blue collar -0.622 -0.792*
(0.362) (0.383)
Income: under $500/$2000-2999 -0.708 -0.662
(0.737) (1.056)
Income: $500-$999/$2000-2999 -0.238 0.912
(0.607) (1.143)
Income: $1000-1999/$2000-2999 -0.115 0.144
(0.343) (0.440)
Income: $3000-3999/$2000-2999 -0.807** -0.527
(0.289) (0.338)
Income: $4000-4999/$2000-2999 -0.875* -0.509
(0.358) (0.411)
Income: $5000 and over/$2000-2999 -1.470*** -0.535
(0.337) (0.405)
------------------------------------------------------------------------------------
Nagelkerke R-sq. 0.246 0.085 0.274
Deviance 404.190 524.433 390.551
AIC 412.190 538.433 410.551
N 344 398 340
====================================================================================
Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05
The comparison of the pseudo-R-Square values of model 1 and 2 suggests that occupational class has a stronger influence on a preference for Truman than household income. Indeed, if occupational class is taken into account, the effect of income is no longer statistically significant as the column corresponding to model 3 indicates.
Second, we compare the effect of occupational class and religious
denomination on the preference for Truman along the same lines as above.
We use mtable
to collect the estimates of
model1
, model4
, and model5
into a
common table.
relabel(mtable(
"Model 1"=model1,
"Model 4"=model4,
"Model 5"=model5,
summary.stats=c("Nagelkerke R-sq.","Deviance","AIC","N")),
occup4="Occup. class",
relig3="Religion",
gsub=TRUE
)
Calls:
Model 1: glm(formula = (vote3 == "Truman") ~ occup4, family = "binomial",
data = vote.48)
Model 4: glm(formula = (vote3 == "Truman") ~ relig3, family = "binomial",
data = vote.48)
Model 5: glm(formula = (vote3 == "Truman") ~ occup4 + relig3, family = "binomial",
data = vote.48)
====================================================================================
Model 1 Model 4 Model 5
------------------------------------------------------------------------------------
(Intercept) 1.047*** -0.213 0.698**
(0.184) (0.126) (0.216)
Occup. class: Upper white collar/Blue collar -2.448*** -2.385***
(0.327) (0.337)
Occup. class: Other white collar/Blue collar -1.080*** -1.098***
(0.315) (0.326)
Occup. class: Farmer/Blue collar -0.622 -0.346
(0.362) (0.374)
Religion: Catholic/Protestant 0.877*** 0.685*
(0.243) (0.292)
Religion: Other,none/Protestant 0.975** 1.191**
(0.347) (0.441)
------------------------------------------------------------------------------------
Nagelkerke R-sq. 0.246 0.060 0.281
Deviance 404.190 537.711 393.105
AIC 412.190 543.711 405.105
N 344 402 344
====================================================================================
Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05
A comparison of the pseudo-R-squared values suggests that also the effect of religious denomination is weaker than that of occupational class. However, as the third column in the above table indicates the effect of religious denomination remains statistically significant.