Analysis
of data is a
process of inspecting, cleaning, transforming, and modeling data with the goal
of highlighting useful information, suggesting conclusions, and supporting
decision making. Data analysis has multiple facts and approaches, encompassing
diverse techniques under a variety of names, in different business, science,
and social science domains.
Data
mining is a
particular data analysis technique that focuses on modeling and knowledge
discovery for predictive rather than purely descriptive purposes. Business
intelligence covers data analysis that relies heavily on aggregation, focusing
on business information. In statistical applications, some people divide data
analysis into descriptive statistics, exploratory data analysis (EDA), and
confirmatory data analysis (CDA). EDA focuses on discovering new features in
the data and CDA on confirming or falsifying existing hypotheses. Predictive
analytics focuses on application of statistical or structural models for
predictive forecasting or classification, while text analytics applies
statistical, linguistic, and structural techniques to extract and classify
information from textual sources, a species of unstructured data. All are
varieties of data analysis.
Data
integration is a
precursor to data analysis, and data analysis is closely linked to data
visualization and data dissemination. The term data analysis is sometimes used
as a synonym for data modeling.
The process of data analysis
Data
analysis is a process, within which several phases can be distinguished:
Data
cleaning
Data cleaning is an important
procedure during which the data are inspected, and erroneous data are—if
necessary, preferable, and possible—corrected. Data cleaning can be done during
the stage of data entry. If this is done, it is important that no subjective
decisions are made. The guiding principle provided by Adèr (ref) is: during
subsequent manipulations of the data, information should always be cumulatively
retrievable. In other words, it should always be possible to undo any data set
alterations. Therefore, it is important not to throw information away at any
stage in the data cleaning phase. All information should be saved (i.e., when
altering variables, both the original values and the new values should be kept,
either in a duplicate data set or under a different variable name), and all
alterations to the data set should be carefully and clearly documented, for instance
in a syntax or a log.
Initial
data analysis
The most important distinction
between the initial data analysis phase and the main analysis phase, is that
during initial data analysis one refrains from any analysis that are aimed at
answering the original research question. The initial data analysis phase is
guided by the following four questions:
Quality
of data
The quality of the data should be
checked as early as possible. Data quality can be assessed in several ways,
using different types of analyses: frequency counts, descriptive statistics
(mean, standard deviation, median), normality (skewness, kurtosis, frequency
histograms, n: variables are compared with coding schemes of variables external
to the data set, and possibly corrected if coding schemes are not comparable.
Test
for common-method variance.
The choice of analyses to assess
the data quality during the initial data analysis phase depends on the analyses
that will be conducted in the main analysis phase.
Quality
of measurements
The quality of the measurement
instruments should only be checked during the initial data analysis phase when
this is not the focus or research question of the study. One should check
whether structure of measurement instruments corresponds to structure reported
in the literature.
There
are two ways to assess measurement quality:
Confirmatory
factor analysis
Analysis of homogeneity (internal
consistency), which gives an indication of the reliability of a measurement
instrument. During this analysis, one inspects the variances of the items and
the scales, the Cronbach's α of the scales, and the change in the Cronbach's
alpha when an item would be deleted from a scale.
Initial
transformations
After assessing the quality of
the data and of the measurements, one might decide to impute missing data, or
to perform initial transformations of one or more variables, although this can
also be done during the main analysis phase.
Possible transformations of
variables are:
·
Square
root transformation (if the distribution differs moderately from normal)
·
Log-transformation
(if the distribution differs substantially from normal)
·
Inverse
transformation (if the distribution differs severely from normal)
·
Make
categorical (ordinal / dichotomous) (if the distribution differs severely from
normal, and no transformations help)
Did the implementation of the
study fulfill the intentions of the research design?
One should check the success of
the randomization procedure, for instance by checking whether background and
substantive variables are equally distributed within and across groups.
If the study did not need and/or
use a randomization procedure, one should check the success of the non-random
sampling, for instance by checking whether all subgroups of the population of
interest are represented in sample.
Other
possible data distortions that should be checked are:
·
dropout
(this should be identified during the initial data analysis phase)
·
Item
nonresponse (whether this is random or not should be assessed during the
initial data analysis phase)
·
Treatment
quality (using manipulation checks).
Characteristics
of data sample
In any report or article, the
structure of the sample must be accurately described. It is especially
important to exactly determine the structure of the sample (and specifically
the size of the subgroups) when subgroup analyses will be performed during the
main analysis phase.
The
characteristics of the data sample can be assessed by looking at:
·
Basic
statistics of important variables
·
Scatter
plots
·
Correlations
and associations
·
Cross-tabulations
Final stage of the initial data
analysis
During the final stage, the
findings of the initial data analysis are documented, and necessary,
preferable, and possible corrective actions are taken.
Also, the original plan for the
main data analyses can and should be specified in more detail and/or rewritten.
·
In
order to do this, several decisions about the main data analyses can and should
be made:
·
In
the case of non-normals: should one transform variables; make variables
categorical (ordinal/dichotomous); adapt the analysis method?
·
In
the case of missing data: should one neglect or impute the missing data; which
imputation technique should be used?
·
In
the case of outliers: should one use robust analysis techniques?
·
In
case items do not fit the scale: should one adapt the measurement instrument by
omitting items, or rather ensure comparability with other (uses of the)
measurement instrument(s)?
·
In
the case of (too) small subgroups: should one drop the hypothesis about
inter-group differences, or use small sample techniques, like exact tests or
bootstrapping?
·
In
case the randomization procedure seems to be defective: can and should one
calculate propensity scores and include them as covariates in the main analyses?
Analyses
Several analyses can be used
during the initial data analysis phase:
·
Univariate
statistics(single variable)
·
Bivariate
associations (correlations)
·
Graphical
techniques (scatter plots)
It is important to take the
measurement levels of the variables into account for the analyses, as special
statistical techniques are available for each level:
·
Nominal
and ordinal variables
·
Frequency
counts (numbers and percentages)
·
Associations
·
circumambulations
(crosstabulations)
·
hierarchical
loglinear analysis (restricted to a maximum of 8 variables)
·
loglinear
analysis (to identify relevant/important variables and possible confounders)
·
Exact
tests or bootstrapping (in case subgroups are small)
·
Computation
of new variables
·
Continuous
variables
·
Distribution
·
Statistics
(M, SD, variance, skewness, kurtosis)
·
Stem-and-leaf
displays
·
Box
plots
References
Adèr, H.J. (2008). Chapter 14:
Phases and initial steps in data analysis. In H.J. Adèr & G.J. Mellenbergh
(Eds.) (with contributions by D.J. Hand), Advising on Research Methods: A
consultant's companion (pp. 333–356). Huizen, the Netherlands: Johannes van
Kessel Publishing.
Adèr, H.J. & Mellenbergh,
G.J. (with contributions by D.J. Hand) (2008). Advising on Research Methods: A
consultant's companion. Huizen, the Netherlands: Johannes van Kessel
Publishing.
ASTM International (2002). Manual
on Presentation of Data and Control Chart Analysis, MNL 7A, ISBN 0-8031-2093-1
Juran, Joseph M.; Godfrey, A.
Blanton (1999). Juran's Quality Handbook. 5th ed. New York: McGraw Hill. ISBN
0-07-034003-X
Lewis-Beck, Michael S. (1995).
Data Analysis: an Introduction, Sage Publications Inc, ISBN 0-8039-5772-6
NIST/SEMATEK (2008) Handbook of
Statistical Methods,
Pyzdek, T, (2003). Quality
Engineering Handbook, ISBN 0-8247-4614-7
Richard Veryard (1984). Pragmatic
data analysis. Oxford : Blackwell Scientific Publications. ISBN 0-632-01311-7
Tabachnick, B.G. & Fidell,
L.S. (2007). Using Multivariate Statistics, Fifth Edition. Boston: Pearson
Education, Inc. / Allyn and Bacon, ISBN 978-0-205-45938-4
No comments:
Post a Comment