Data Set Objects — data.set • memisc

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R's statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE, document = NULL)
as.data.set(x, row.names=NULL, ...)
# S4 method for class 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
# S3 method for class 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
# S4 method for class 'data.set'
within(data, expr, ...)

dsView(x)

# S4 method for class 'data.set'
head(x,n=20,...)
# S4 method for class 'data.set'
tail(x,n=20,...)

Arguments

...: For the data.set function several vectors or items, for within further, ignored arguments.
row.names, check.rows, check.names, stringsAsFactors, optional: arguments as in data.frame or as.data.frame, respectively.
document: NULL or an optional character vector that contains documenation of the data.
x: for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a "data.set" object.
data: a data set, that is, an object of class "data.set".
expr: an expression, or several expressions enclosed in curly braces.
n: integer; the number of rows to be shown by head or tail

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see as.vector,item-method and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for "importer" objects.

Value

data.set and the within method for data sets returns a "data.set" object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Examples

Data <- data.set(
          vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
          region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
          income = exp(rnorm(300,sd=.7))*2000
          )

Data <- within(Data,{
  description(vote) <- "Vote intention"
  description(region) <- "Region of residence"
  description(income) <- "Household income"
  wording(vote) <- "If a general election would take place next tuesday,
                    the candidate of which party would you vote for?"
  wording(income) <- "All things taken into account, how much do all
                    household members earn in sum?"
  foreach(x=c(vote,region),{
    measurement(x) <- "nominal"
    })
  measurement(income) <- "ratio"
  labels(vote) <- c(
                    Conservatives         =  1,
                    Labour                =  2,
                    "Liberal Democrats"   =  3,
                    "Don't know"          =  8,
                    "Answer refused"      =  9,
                    "Not applicable"      = 97,
                    "Not asked in survey" = 99)
  labels(region) <- c(
                    England               =  1,
                    Scotland              =  2,
                    Wales                 =  3,
                    "Not applicable"      = 97,
                    "Not asked in survey" = 99)
  foreach(x=c(vote,region,income),{
    annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
    })
  missing.values(vote) <- c(8,9,97,99)
  missing.values(region) <- c(97,99)

  # These to variables do not appear in the
  # the resulting data set, since they have the wrong length.
  junk1 <- 1:5
  junk2 <- matrix(5,4,4)
  
})
#> Warning: Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
#> 
#> Data set with 300 observations and 3 variables
#> 
#>                    vote               region    income
#>  1    Liberal Democrats              England 2799.8996
#>  2    Liberal Democrats                Wales 8209.4580
#>  3          *Don't know              England 4618.5151
#>  4               Labour             Scotland 3166.8900
#>  5    Liberal Democrats             Scotland 4102.3323
#>  6 *Not asked in survey                Wales 2878.5366
#>  7      *Answer refused                Wales 4382.3194
#>  8    Liberal Democrats              England 2646.0701
#>  9 *Not asked in survey                Wales 1003.9857
#> 10      *Answer refused             Scotland 1406.8507
#> 11      *Not applicable             Scotland 3991.4361
#> 12    Liberal Democrats             Scotland 9273.7135
#> 13    Liberal Democrats              England 1781.7885
#> 14      *Answer refused              England 1237.2840
#> 15          *Don't know                Wales 3865.6381
#> 16      *Answer refused             Scotland 1783.0354
#> 17               Labour *Not asked in survey  803.7312
#> 18    Liberal Democrats              England 1205.2757
#> 19      *Answer refused *Not asked in survey 5292.0691
#> 20      *Not applicable                Wales 3223.3965
#> 21               Labour              England 2775.7733
#> 22        Conservatives             Scotland 2685.2986
#> 23 *Not asked in survey              England 1142.1532
#> 24               Labour *Not asked in survey 1420.8097
#> 25      *Answer refused             Scotland 2916.6927
#> .. .................... .................... .........
#> (25 of 300 observations shown)

if (FALSE) { # \dontrun{

# If we insist on seeing all, we can use 'print' instead
print(Data)
} # }

str(Data)
#> Data set with 300 obs. of 3 variables:
#>  $ vote  : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num  3 3 8 2 3 99 9 3 99 9 ...
#>  $ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num  1 3 1 2 2 3 3 1 3 2 ...
#>  $ income: Rto. item  num  2800 8209 4619 3167 4102 ...
summary(Data)
#>                    vote                     region        income       
#>  Conservatives       :33   England             :144   Min.   :  367.6  
#>  Labour              :44   Scotland            : 84   1st Qu.: 1286.6  
#>  Liberal Democrats   :43   Wales               : 44   Median : 1946.9  
#>  *Don't know         :39   *Not asked in survey: 28   Mean   : 2489.7  
#>  *Answer refused     :50                              3rd Qu.: 3051.9  
#>  *Not applicable     :43                              Max.   :11977.1  
#>  *Not asked in survey:48                                               

if (FALSE) { # \dontrun{
# If we want to 'View' a data set we can use 'dsView'
dsView(Data)
# Works also, but changes the data set into a data frame first:
View(Data)
} # }

Data[[1]]
#> 
#> Item 'Vote intention' (measurement: nominal, type: double, length = 300) 
#> 
#>  [1:300] Liberal Democrats Liberal Democrats *Don't know Labour ...
Data[1,]
#> 
#> Data set with 1 observations and 3 variables
#> 
#>                vote  region income
#> 1 Liberal Democrats England 2799.9
head(as.data.frame(Data))
#>                vote   region   income
#> 1 Liberal Democrats  England 2799.900
#> 2 Liberal Democrats    Wales 8209.458
#> 3              <NA>  England 4618.515
#> 4            Labour Scotland 3166.890
#> 5 Liberal Democrats Scotland 4102.332
#> 6              <NA>    Wales 2878.537

EnglandData <- subset(Data,region == "England")
EnglandData
#> 
#> Data set with 144 observations and 3 variables
#> 
#>                    vote  region    income
#>  1    Liberal Democrats England 2799.8996
#>  2          *Don't know England 4618.5151
#>  3    Liberal Democrats England 2646.0701
#>  4    Liberal Democrats England 1781.7885
#>  5      *Answer refused England 1237.2840
#>  6    Liberal Democrats England 1205.2757
#>  7               Labour England 2775.7733
#>  8 *Not asked in survey England 1142.1532
#>  9      *Answer refused England 1665.0582
#> 10          *Don't know England 1395.8343
#> 11          *Don't know England 3434.1108
#> 12    Liberal Democrats England  977.1105
#> 13          *Don't know England 2221.9989
#> 14      *Not applicable England 1153.2038
#> 15      *Not applicable England 1351.7092
#> 16    Liberal Democrats England  755.4447
#> 17      *Not applicable England 1428.9611
#> 18               Labour England 1015.9381
#> 19 *Not asked in survey England  413.6085
#> 20 *Not asked in survey England 3459.1476
#> 21          *Don't know England 1157.0507
#> 22               Labour England 4070.6896
#> 23        Conservatives England 2458.1784
#> 24               Labour England 1625.1921
#> 25      *Not applicable England 1124.7315
#> .. .................... ....... .........
#> (25 of 144 observations shown)

xtabs(~vote+region,data=Data)
#>                    region
#> vote                England Scotland Wales
#>   Conservatives          14       12     4
#>   Labour                 20       13     5
#>   Liberal Democrats      25       12     5
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
#>                       region
#> vote                   England Scotland Wales
#>   Conservatives             14       12     4
#>   Labour                    20       13     5
#>   Liberal Democrats         25       12     5
#>   *Don't know               18       12     6
#>   *Answer refused           21       15     9
#>   *Not applicable           24       10     5
#>   *Not asked in survey      22       10    10