Compare data frames and Plot Differences
uni_compare.Rd
Returns data or a plot showing the difference of two or more data frames The differences are calculated on the base of differing metrics, chosen in the funct argument. All used data frames must contain at least one column named equal in all data frames, that has equal values.
Usage
uni_compare(
dfs,
benchmarks,
variables = NULL,
nboots = 2000,
n_bench = NULL,
boot_all = FALSE,
funct = "rel_mean",
data = TRUE,
type = "comparison",
legendlabels = NULL,
legendtitle = NULL,
colors = NULL,
shapes = NULL,
summetric = "rmse2",
label_x = NULL,
label_y = NULL,
plot_title = NULL,
varlabels = NULL,
name_dfs = NULL,
name_benchmarks = NULL,
summet_size = 4,
silence = TRUE,
conf_level = 0.95,
conf_adjustment = NULL,
percentile_ci = TRUE,
weight = NULL,
id = NULL,
strata = NULL,
weight_bench = NULL,
id_bench = NULL,
strata_bench = NULL,
adjustment_weighting = "raking",
adjustment_vars = NULL,
raking_targets = NULL,
post_targets = NULL,
ndigits = 3,
parallel = FALSE
)
Arguments
- dfs
A character vector containing the names of data frames to compare against the benchmarks.
- benchmarks
A character vector containing the names of benchmarks to compare the data frames against. The vector must either be the same length as
dfs
, or length 1. If it has length 1 every df will be compared against the same benchmark. Benchmarks can either be the name of data frames, the name of a list of tables, or a named vector of means. The tables in the list need to be named as the respective variables in the data frame of comparison. When they are a named vector of means, the means need to be named as the respective variables in the dfs.- variables
A character vector containing the names of the variables for the comparison. If NULL, all variables named similarly in both the
dfs
and the benchmarks will be compared. Variables missing in one of the data frames or the benchmarks will be neglected for this comparison.- nboots
The number of bootstraps used to calculate standard errors. Must either be >2 or 0. If >2 bootstrapping is used to calculate standard errors with
nboots
iterations. If 0, SE is calculated analytically. We do not recommend usingnboots
=0 because this method is not yet suitable for everyfunct
used and every method. Depending on the size of the data and the number of bootstraps,uni_compare
can take a while.- n_bench
A list of vectors containing the number of cases for every variable in the benchmark. This is only needed, if the benchmark is given as a vector. The list should be as long as the number of dataframes
- boot_all
If TURE, both, dfs and benchmarks will be bootstrapped. Otherwise the benchmark estimate is assumed to be constant.
- funct
A character string, indicating the function to calculate the difference between the data frames.
Predefined functions are:
"d_mean"
,"ad_mean"
A function to calculate the (absolute) difference in mean of the variables indfs
and benchmarks with the same name. Only applicable for metric variables."d_prop"
,"ad_prop"
A function to calculate the (absolute) difference in proportions of the variables indfs
and benchmarks with the same name. Only applicable for dummy variables."rel_mean"
,"abs_rel_mean"
A function to calculate the (absolute) relative difference in mean of the variables indfs
and benchmarks with the same name. #' For more information on the formula for difference and analytic variance, see Felderer et al. (2019). Only applicable for metric variables."rel_prop"
,"abs_rel_prop"
A function to calculate the (absolute) relative difference in proportions of the variables indfs
and benchmarks with the same name. It is calculated similar to the relative difference in mean (see Felderer et al., 2019), however the default label for the plot is different. Only applicable for dummy variables.
- data
If TRUE, a uni_compare_object is returned, containing results of the comparison.
- type
Define the type of comparison. Can either be
"comparison"
or"nonresponse"
.- legendlabels
A character string or vector of strings containing a label for the legend.
- legendtitle
A character string containing the title of the legend.
- colors
A vector of colors, that is used in the plot for the different comparisons.
- shapes
A vector of shapes applicable in
ggplot2::ggplot2()
, that is used in the plot for the different comparisons.- summetric
If
"avg1"
,"mse1"
,"rmse1"
, or"R"
the respective measure is calculated for the biases of each survey. The values"mse1"
and"rmse1"
lead to similar results as in"mse2"
and"rmse2"
, with slightly different visualization in the plot. Ifsummetric = NULL
, no summetric will be displayed in the Plot. When"R"
is chosen, alsoresponse_identificator
is needed.- label_x, label_y
A character string or vector of character strings containing a label for the x-axis and y-axis.
- plot_title
A character string containing the title of the plot.
- varlabels
A character string or vector of character strings containing the new names of variables, also used in plot.
- name_dfs, name_benchmarks
A character string or vector of character strings containing the new names of the
dfs
andbenchmarks
, that is also used in plot.- summet_size
A number to determine the size of the displayed
summetric
in the plot.- silence
If
silence = FALSE
a warning will be displayed, if variables a re excluded from either the data frame or benchmark, for not existing in both.- conf_level
A numeric value between zero and one to determine the confidence level of the confidence interval.
- conf_adjustment
If
conf_adjustment = TRUE
the confidence level of the confidence interval will be adjusted with a Bonferroni adjustment, to account for the problem of multiple comparisons.- percentile_ci
If TURE, cofidence intervals will be calculated using the percentile method. If False, they will be calculated using the normal method.
- weight, weight_bench
A character vector determining variables to weight the
dfs
orbenchmarks
. They have to be part of the respective data frame. If only one character is provided, the same variable is used to weigh everydf
orbenchmark
. If a weight variable is provided also anid
variable is needed.For weighting, thesurvey
package is used.- id, id_bench
A character vector determining
id
variables used to weigh thedfs
orbenchmarks
with the help of thesurvey
package. They have to be part of the respective data frame. If only one character is provided, the same variable is used to weigh everydf
orbenchmark
.- strata, strata_bench
A character vector determining strata variables used to weigh the
dfs
orbenchmarks
with the help of thesurvey
package. They have to be part of the respective data frame. If only one character is provided, the same variable is used to weight everydf
orbenchmark
.- adjustment_weighting
A character vector indicating if adjustment weighting should be used. It can either be
"raking"
or"post_start"
.- adjustment_vars
Variables used to adjust the survey when using raking or post stratification.
- raking_targets
A list of raking targets that can be given to the rake function of
rake
, to rake thedfs
.- post_targets
A list of post-stratification targets that can be given to the
postStratify
function, to post-stratify thedfs
.- ndigits
The number of digits to round the numbers in the plot.
- parallel
Can be either
FALSE
or a number of cores that should be used in the function. If it isFALSE
, only one core will be used and otherwise the given number of cores will be used.
Value
A plot based on ggplot2::ggplot2()
(or data frame if data==TRUE)
which shows the difference between two or more data frames on predetermined variables,
named identical in both data frames.
References
Felderer, B., Kirchner, A., & Kreuter, FALSE. (2019). The Effect of Survey Mode on Data Quality: Disentangling Nonresponse and Measurement Error Bias. Journal of Official Statistics, 35(1), 93–115. https://doi.org/10.2478/jos-2019-0005
Examples
## Get Data for comparison
require(wooldridge)
card<-wooldridge::card
north<-wooldridge::card[wooldridge::card$south==0,]
white<-wooldridge::card[wooldridge::card$black==0,]
## use the function to plot the data
univar_comp<-sampcompR::uni_compare(dfs = c("north","white"),
benchmarks = c("card","card"),
variables= c("age","educ","fatheduc","motheduc","wage","IQ"),
funct = "abs_rel_mean",
nboots=200,
summetric="rmse2",
data=FALSE)
#> Error in get(dfs[i]): object 'north' not found
univar_comp
#> Error: object 'univar_comp' not found