Data Set Objects
dataSets.Rd
"data.set"
objects are collections of "item"
objects,
with similar semantics as data frames. They are distinguished
from data frames so that coercion by as.data.fame
leads to a data frame that contains only vectors and factors.
Nevertheless most methods for data frames are inherited by
data sets, except for the method for the within
generic
function. For the within
method for data sets, see the details section.
Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R's statistical functions.
dsView
is a function that displays data sets in a similar
manner as View
displays data frames. (View
works
with data sets as well, but changes them first into data frames.)
Usage
data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
stringsAsFactors = FALSE, document = NULL)
as.data.set(x, row.names=NULL, ...)
# S4 method for class 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
# S3 method for class 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
# S4 method for class 'data.set'
within(data, expr, ...)
dsView(x)
# S4 method for class 'data.set'
head(x,n=20,...)
# S4 method for class 'data.set'
tail(x,n=20,...)
Arguments
- ...
For the
data.set
function several vectors or items, forwithin
further, ignored arguments.- row.names, check.rows, check.names, stringsAsFactors, optional
arguments as in
data.frame
oras.data.frame
, respectively.- document
NULL or an optional character vector that contains documenation of the data.
- x
for
is.data.set(x)
, any object; foras.data.frame(x,...)
anddsView(x)
a "data.set" object.- data
a data set, that is, an object of class "data.set".
- expr
an expression, or several expressions enclosed in curly braces.
- n
integer; the number of rows to be shown by
head
ortail
Details
The as.data.frame
method for data sets is just a copy
of the method for list. Consequently, all items in the data set
are coerced in accordance to their measurement
setting,
see as.vector,item-method
and measurement
.
The within
method for data sets has the same effect as
the within
method for data frames, apart from two differences:
all results of the computations are coerced into items if
they have the appropriate length, otherwise, they are automatically
dropped.
Currently only one method for the generic function as.data.set
is defined: a method for "importer" objects.
Value
data.set
and the within
method for
data sets returns a "data.set" object, is.data.set
returns a logical value, and as.data.frame
returns
a data frame.
Examples
Data <- data.set(
vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
income = exp(rnorm(300,sd=.7))*2000
)
Data <- within(Data,{
description(vote) <- "Vote intention"
description(region) <- "Region of residence"
description(income) <- "Household income"
wording(vote) <- "If a general election would take place next tuesday,
the candidate of which party would you vote for?"
wording(income) <- "All things taken into account, how much do all
household members earn in sum?"
foreach(x=c(vote,region),{
measurement(x) <- "nominal"
})
measurement(income) <- "ratio"
labels(vote) <- c(
Conservatives = 1,
Labour = 2,
"Liberal Democrats" = 3,
"Don't know" = 8,
"Answer refused" = 9,
"Not applicable" = 97,
"Not asked in survey" = 99)
labels(region) <- c(
England = 1,
Scotland = 2,
Wales = 3,
"Not applicable" = 97,
"Not asked in survey" = 99)
foreach(x=c(vote,region,income),{
annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
})
missing.values(vote) <- c(8,9,97,99)
missing.values(region) <- c(97,99)
# These to variables do not appear in the
# the resulting data set, since they have the wrong length.
junk1 <- 1:5
junk2 <- matrix(5,4,4)
})
#> Warning: Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
#>
#> Data set with 300 observations and 3 variables
#>
#> vote region income
#> 1 Liberal Democrats England 2799.8996
#> 2 Liberal Democrats Wales 8209.4580
#> 3 *Don't know England 4618.5151
#> 4 Labour Scotland 3166.8900
#> 5 Liberal Democrats Scotland 4102.3323
#> 6 *Not asked in survey Wales 2878.5366
#> 7 *Answer refused Wales 4382.3194
#> 8 Liberal Democrats England 2646.0701
#> 9 *Not asked in survey Wales 1003.9857
#> 10 *Answer refused Scotland 1406.8507
#> 11 *Not applicable Scotland 3991.4361
#> 12 Liberal Democrats Scotland 9273.7135
#> 13 Liberal Democrats England 1781.7885
#> 14 *Answer refused England 1237.2840
#> 15 *Don't know Wales 3865.6381
#> 16 *Answer refused Scotland 1783.0354
#> 17 Labour *Not asked in survey 803.7312
#> 18 Liberal Democrats England 1205.2757
#> 19 *Answer refused *Not asked in survey 5292.0691
#> 20 *Not applicable Wales 3223.3965
#> 21 Labour England 2775.7733
#> 22 Conservatives Scotland 2685.2986
#> 23 *Not asked in survey England 1142.1532
#> 24 Labour *Not asked in survey 1420.8097
#> 25 *Answer refused Scotland 2916.6927
#> .. .................... .................... .........
#> (25 of 300 observations shown)
if (FALSE) { # \dontrun{
# If we insist on seeing all, we can use 'print' instead
print(Data)
} # }
str(Data)
#> Data set with 300 obs. of 3 variables:
#> $ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v. num 3 3 8 2 3 99 9 3 99 9 ...
#> $ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v. num 1 3 1 2 2 3 3 1 3 2 ...
#> $ income: Rto. item num 2800 8209 4619 3167 4102 ...
summary(Data)
#> vote region income
#> Conservatives :33 England :144 Min. : 367.6
#> Labour :44 Scotland : 84 1st Qu.: 1286.6
#> Liberal Democrats :43 Wales : 44 Median : 1946.9
#> *Don't know :39 *Not asked in survey: 28 Mean : 2489.7
#> *Answer refused :50 3rd Qu.: 3051.9
#> *Not applicable :43 Max. :11977.1
#> *Not asked in survey:48
if (FALSE) { # \dontrun{
# If we want to 'View' a data set we can use 'dsView'
dsView(Data)
# Works also, but changes the data set into a data frame first:
View(Data)
} # }
Data[[1]]
#>
#> Item 'Vote intention' (measurement: nominal, type: double, length = 300)
#>
#> [1:300] Liberal Democrats Liberal Democrats *Don't know Labour ...
Data[1,]
#>
#> Data set with 1 observations and 3 variables
#>
#> vote region income
#> 1 Liberal Democrats England 2799.9
head(as.data.frame(Data))
#> vote region income
#> 1 Liberal Democrats England 2799.900
#> 2 Liberal Democrats Wales 8209.458
#> 3 <NA> England 4618.515
#> 4 Labour Scotland 3166.890
#> 5 Liberal Democrats Scotland 4102.332
#> 6 <NA> Wales 2878.537
EnglandData <- subset(Data,region == "England")
EnglandData
#>
#> Data set with 144 observations and 3 variables
#>
#> vote region income
#> 1 Liberal Democrats England 2799.8996
#> 2 *Don't know England 4618.5151
#> 3 Liberal Democrats England 2646.0701
#> 4 Liberal Democrats England 1781.7885
#> 5 *Answer refused England 1237.2840
#> 6 Liberal Democrats England 1205.2757
#> 7 Labour England 2775.7733
#> 8 *Not asked in survey England 1142.1532
#> 9 *Answer refused England 1665.0582
#> 10 *Don't know England 1395.8343
#> 11 *Don't know England 3434.1108
#> 12 Liberal Democrats England 977.1105
#> 13 *Don't know England 2221.9989
#> 14 *Not applicable England 1153.2038
#> 15 *Not applicable England 1351.7092
#> 16 Liberal Democrats England 755.4447
#> 17 *Not applicable England 1428.9611
#> 18 Labour England 1015.9381
#> 19 *Not asked in survey England 413.6085
#> 20 *Not asked in survey England 3459.1476
#> 21 *Don't know England 1157.0507
#> 22 Labour England 4070.6896
#> 23 Conservatives England 2458.1784
#> 24 Labour England 1625.1921
#> 25 *Not applicable England 1124.7315
#> .. .................... ....... .........
#> (25 of 144 observations shown)
xtabs(~vote+region,data=Data)
#> region
#> vote England Scotland Wales
#> Conservatives 14 12 4
#> Labour 20 13 5
#> Liberal Democrats 25 12 5
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
#> region
#> vote England Scotland Wales
#> Conservatives 14 12 4
#> Labour 20 13 5
#> Liberal Democrats 25 12 5
#> *Don't know 18 12 6
#> *Answer refused 21 15 9
#> *Not applicable 24 10 5
#> *Not asked in survey 22 10 10