Skip to contents

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R's statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE, document = NULL)
as.data.set(x, row.names=NULL, ...)
# S4 method for list
as.data.set(x,row.names=NULL,...)
is.data.set(x)
# S3 method for data.set
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
# S4 method for data.set
within(data, expr, ...)

dsView(x)

# S4 method for data.set
head(x,n=20,...)
# S4 method for data.set
tail(x,n=20,...)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a "data.set" object.

data

a data set, that is, an object of class "data.set".

expr

an expression, or several expressions enclosed in curly braces.

n

integer; the number of rows to be shown by head or tail

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see as.vector,item-method and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for "importer" objects.

Value

data.set and the within method for data sets returns a "data.set" object, is.data.set

returns a logical value, and as.data.frame returns a data frame.

Examples

Data <- data.set(
          vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
          region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
          income = exp(rnorm(300,sd=.7))*2000
          )

Data <- within(Data,{
  description(vote) <- "Vote intention"
  description(region) <- "Region of residence"
  description(income) <- "Household income"
  wording(vote) <- "If a general election would take place next tuesday,
                    the candidate of which party would you vote for?"
  wording(income) <- "All things taken into account, how much do all
                    household members earn in sum?"
  foreach(x=c(vote,region),{
    measurement(x) <- "nominal"
    })
  measurement(income) <- "ratio"
  labels(vote) <- c(
                    Conservatives         =  1,
                    Labour                =  2,
                    "Liberal Democrats"   =  3,
                    "Don't know"          =  8,
                    "Answer refused"      =  9,
                    "Not applicable"      = 97,
                    "Not asked in survey" = 99)
  labels(region) <- c(
                    England               =  1,
                    Scotland              =  2,
                    Wales                 =  3,
                    "Not applicable"      = 97,
                    "Not asked in survey" = 99)
  foreach(x=c(vote,region,income),{
    annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
    })
  missing.values(vote) <- c(8,9,97,99)
  missing.values(region) <- c(97,99)

  # These to variables do not appear in the
  # the resulting data set, since they have the wrong length.
  junk1 <- 1:5
  junk2 <- matrix(5,4,4)
  
})
#> Warning: Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
#> 
#> Data set with 300 observations and 3 variables
#> 
#>                    vote               region    income
#>  1      *Answer refused              England 3084.2687
#>  2      *Answer refused              England 1892.9557
#>  3      *Not applicable              England 1156.5737
#>  4      *Answer refused *Not asked in survey 2917.6312
#>  5      *Not applicable                Wales 2867.5674
#>  6      *Answer refused              England 1311.5318
#>  7          *Don't know              England  325.2048
#>  8      *Answer refused              England 1102.6051
#>  9      *Not applicable             Scotland  759.3739
#> 10      *Answer refused *Not asked in survey 4143.9835
#> 11        Conservatives             Scotland  879.8037
#> 12    Liberal Democrats *Not asked in survey  940.6910
#> 13 *Not asked in survey                Wales  782.6869
#> 14 *Not asked in survey             Scotland 4149.1683
#> 15    Liberal Democrats *Not asked in survey 2667.0581
#> 16               Labour              England  892.5669
#> 17               Labour                Wales  696.1730
#> 18               Labour              England 1324.5828
#> 19      *Not applicable              England 2691.7366
#> 20        Conservatives              England 3043.3218
#> 21 *Not asked in survey                Wales 4037.6429
#> 22        Conservatives              England 1462.9063
#> 23 *Not asked in survey              England 7149.0187
#> 24        Conservatives              England 2522.0721
#> 25 *Not asked in survey              England 5090.6997
#> .. .................... .................... .........
#> (25 of 300 observations shown)

if (FALSE) {

# If we insist on seeing all, we can use 'print' instead
print(Data)
}

str(Data)
#> Data set with 300 obs. of 3 variables:
#>  $ vote  : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num  9 9 97 9 97 9 8 9 97 9 ...
#>  $ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num  1 1 1 99 3 1 1 1 2 99 ...
#>  $ income: Rto. item  num  3084 1893 1157 2918 2868 ...
summary(Data)
#>                    vote                     region        income       
#>  Conservatives       :51   England             :128   Min.   :  315.5  
#>  Labour              :30   Scotland            : 82   1st Qu.: 1244.5  
#>  Liberal Democrats   :49   Wales               : 44   Median : 2113.7  
#>  *Don't know         :44   *Not asked in survey: 46   Mean   : 2746.1  
#>  *Answer refused     :40                              3rd Qu.: 3432.0  
#>  *Not applicable     :39                              Max.   :12913.8  
#>  *Not asked in survey:47                                               

if (FALSE) {
# If we want to 'View' a data set we can use 'dsView'
dsView(Data)
# Works also, but changes the data set into a data frame first:
View(Data)
}

Data[[1]]
#> 
#> Item 'Vote intention' (measurement: nominal, type: double, length = 300) 
#> 
#>  [1:300] *Answer refused *Answer refused *Not applicable *Answer refused ...
Data[1,]
#> 
#> Data set with 1 observations and 3 variables
#> 
#>              vote  region   income
#> 1 *Answer refused England 3084.269
head(as.data.frame(Data))
#>   vote  region   income
#> 1 <NA> England 3084.269
#> 2 <NA> England 1892.956
#> 3 <NA> England 1156.574
#> 4 <NA>    <NA> 2917.631
#> 5 <NA>   Wales 2867.567
#> 6 <NA> England 1311.532

EnglandData <- subset(Data,region == "England")
EnglandData
#> 
#> Data set with 128 observations and 3 variables
#> 
#>                    vote  region    income
#>  1      *Answer refused England 3084.2687
#>  2      *Answer refused England 1892.9557
#>  3      *Not applicable England 1156.5737
#>  4      *Answer refused England 1311.5318
#>  5          *Don't know England  325.2048
#>  6      *Answer refused England 1102.6051
#>  7               Labour England  892.5669
#>  8               Labour England 1324.5828
#>  9      *Not applicable England 2691.7366
#> 10        Conservatives England 3043.3218
#> 11        Conservatives England 1462.9063
#> 12 *Not asked in survey England 7149.0187
#> 13        Conservatives England 2522.0721
#> 14 *Not asked in survey England 5090.6997
#> 15 *Not asked in survey England 1507.5981
#> 16          *Don't know England 2023.7406
#> 17        Conservatives England 1493.3919
#> 18               Labour England 3029.6432
#> 19 *Not asked in survey England 2465.3593
#> 20          *Don't know England 4519.6520
#> 21        Conservatives England  984.0079
#> 22      *Answer refused England 2054.7803
#> 23        Conservatives England 7624.3885
#> 24      *Answer refused England  638.0687
#> 25               Labour England 2945.0936
#> .. .................... ....... .........
#> (25 of 128 observations shown)

xtabs(~vote+region,data=Data)
#>                    region
#> vote                England Scotland Wales
#>   Conservatives          22       13     6
#>   Labour                 14        7     6
#>   Liberal Democrats      18       14     7
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
#>                       region
#> vote                   England Scotland Wales
#>   Conservatives             22       13     6
#>   Labour                    14        7     6
#>   Liberal Democrats         18       14     7
#>   *Don't know               22        8     8
#>   *Answer refused           20       12     4
#>   *Not applicable           12       12     8
#>   *Not asked in survey      20       16     5