Title: | Generate Alluvial Plots with a Single Line of Code |
---|---|
Description: | Alluvial plots are similar to sankey diagrams and visualise categorical data over multiple dimensions as flows. (Rosvall M, Bergstrom CT (2010) Mapping Change in Large Networks. PLoS ONE 5(1): e8694. <doi:10.1371/journal.pone.0008694> Their graphical grammar however is a bit more complex then that of a regular x/y plots. The 'ggalluvial' package made a great job of translating that grammar into 'ggplot2' syntax and gives you many options to tweak the appearance of an alluvial plot, however there still remains a multi-layered complexity that makes it difficult to use 'ggalluvial' for explorative data analysis. 'easyalluvial' provides a simple interface to this package that allows you to produce a decent alluvial plot from any dataframe in either long or wide format from a single line of code while also handling continuous data. It is meant to allow a quick visualisation of entire dataframes with a focus on different colouring options that can make alluvial plots a great tool for data exploration. |
Authors: | Bjoern Koneswarakantha [aut, cre] |
Maintainer: | Bjoern Koneswarakantha <[email protected]> |
License: | CC0 |
Version: | 0.3.2 |
Built: | 2025-01-01 04:22:51 UTC |
Source: | https://github.com/erblast/easyalluvial |
adds bar plot of important features to model response alluvial plot
add_imp_plot(grid, p = NULL, data_input, plot = T, ...)
add_imp_plot(grid, p = NULL, data_input, plot = T, ...)
grid |
gtable or ggplot |
p |
alluvial plot, optional if alluvial plot has already been passed as grid. Default: NULL |
data_input |
dataframe used to generate alluvial plot |
plot |
logical if plot should be drawn or not |
... |
additional parameters passed to |
gtable
## Not run: df = mtcars2[, ! names(mtcars2) %in% 'ids' ] train = caret::train( disp ~ . , df , method = 'rf' , trControl = caret::trainControl( method = 'none' ) , importance = TRUE ) pred_train = caret::predict.train(train, df) p = alluvial_model_response_caret(train, degree = 4, pred_train = pred_train) p_grid = add_marginal_histograms(p, data_input = df) p_grid = add_imp_plot(p_grid, p, data_input = df) ## End(Not run)
## Not run: df = mtcars2[, ! names(mtcars2) %in% 'ids' ] train = caret::train( disp ~ . , df , method = 'rf' , trControl = caret::trainControl( method = 'none' ) , importance = TRUE ) pred_train = caret::predict.train(train, df) p = alluvial_model_response_caret(train, degree = 4, pred_train = pred_train) p_grid = add_marginal_histograms(p, data_input = df) p_grid = add_imp_plot(p_grid, p, data_input = df) ## End(Not run)
will add density histograms and frequency plots of original data to alluvial plot
add_marginal_histograms( p, data_input, top = TRUE, keep_labels = FALSE, plot = TRUE, ... )
add_marginal_histograms( p, data_input, top = TRUE, keep_labels = FALSE, plot = TRUE, ... )
p |
alluvial plot |
data_input |
dataframe, input data that was used to create dataframe |
top |
logical, position of histograms, if FALSE adds them at the bottom, Default: TRUE |
keep_labels |
logical, keep title and caption, Default: FALSE |
plot |
logical if plot should be drawn or not |
... |
additional arguments for model response alluvial plot concerning the response variable
|
gtable
## Not run: p = alluvial_wide(mtcars2, max_variables = 3) p_grid = add_marginal_histograms(p, mtcars2) ## End(Not run)
## Not run: p = alluvial_wide(mtcars2, max_variables = 3) p_grid = add_marginal_histograms(p, mtcars2) ## End(Not run)
Plots two variables of a dataframe on an alluvial plot. A third variable can be added either to the left or the right of the alluvial plot to provide coloring of the flows. All numerical variables are scaled, centered and YeoJohnson transformed before binning.
alluvial_long( data, key, value, id, fill = NULL, fill_right = T, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), NA_label = "NA", order_levels_value = NULL, order_levels_key = NULL, order_levels_fill = NULL, complete = TRUE, fill_by = "first_variable", col_vector_flow = palette_qualitative() %>% palette_filter(greys = F), col_vector_value = RColorBrewer::brewer.pal(9, "Greys")[c(3, 6, 4, 7, 5)], verbose = F, stratum_labels = T, stratum_label_type = "label", stratum_label_size = 4.5, stratum_width = 1/4, auto_rotate_xlabs = T, ... )
alluvial_long( data, key, value, id, fill = NULL, fill_right = T, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), NA_label = "NA", order_levels_value = NULL, order_levels_key = NULL, order_levels_fill = NULL, complete = TRUE, fill_by = "first_variable", col_vector_flow = palette_qualitative() %>% palette_filter(greys = F), col_vector_value = RColorBrewer::brewer.pal(9, "Greys")[c(3, 6, 4, 7, 5)], verbose = F, stratum_labels = T, stratum_label_type = "label", stratum_label_size = 4.5, stratum_width = 1/4, auto_rotate_xlabs = T, ... )
data |
a dataframe |
key |
unquoted column name or string of x axis variable |
value |
unquoted column name or string of y axis variable |
id |
unquoted column name or string of id column |
fill |
unquoted column name or string of fill variable which will be used to color flows, Default: NULL |
fill_right |
logical, TRUE fill variable is added to the right FALSE to the left, Default: T |
bins |
number of bins for automatic binning of numerical variables, Default: 5 |
bin_labels |
labels for bins, Default: c("LL", "ML", "M", "MH", "HH") |
NA_label |
character vector define label for missing data |
order_levels_value |
character vector denoting order of y levels from low to high, does not have to be complete can also just be used to bring levels to the front, Default: NULL |
order_levels_key |
character vector denoting order of x levels from low to high, does not have to be complete can also just be used to bring levels to the front, Default: NULL |
order_levels_fill |
character vector denoting order of color fill variable levels from low to high, does not have to be complete can also just be used to bring levels to the front, Default: NULL |
complete |
logical, insert implicitly missing observations, Default: TRUE |
fill_by |
one_of(c('first_variable', 'last_variable', 'all_flows', 'values')), Default: 'first_variable' |
col_vector_flow |
HEX color values for flows, Default: palette_filter( greys = F) |
col_vector_value |
HEX color values for y levels/values, Default:RColorBrewer::brewer.pal(9, 'Greys')[c(3,6,4,7,5)] |
verbose |
logical, print plot summary, Default: F |
stratum_labels |
logical, Default: TRUE |
stratum_label_type |
character, Default: "label" |
stratum_label_size |
numeric, Default: 4.5 |
stratum_width |
double, Default: 1/4 |
auto_rotate_xlabs |
logical, Default: TRUE |
... |
additional parameter passed to |
ggplot2 object
alluvial_wide
,geom_flow
, geom_stratum
,manip_bin_numerics
## Not run: data = quarterly_flights alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'last_variable' ) # more flow coloring variants ------------------------------------ alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'first_variable' ) alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'all_flows' ) alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'value' ) # color by additional variable carrier --------------------------- alluvial_long( data, key = qu, value = mean_arr_delay, fill = carrier, id = tailnum ) # use same color coding for flows and y levels ------------------- palette = c('green3', 'tomato') alluvial_long( data, qu, mean_arr_delay, tailnum, fill_by = 'value' , col_vector_flow = palette , col_vector_value = palette ) # reorder levels ------------------------------------------------ alluvial_long( data, qu, mean_arr_delay, tailnum, fill_by = 'first_variable' , order_levels_value = c('on_time', 'late') ) alluvial_long( data, qu, mean_arr_delay, tailnum, fill_by = 'first_variable' , order_levels_key = c('Q4', 'Q3', 'Q2', 'Q1') ) require(dplyr) require(magrittr) order_by_carrier_size = data %>% group_by(carrier) %>% count() %>% arrange( desc(n) ) %>% .[['carrier']] alluvial_long( data, qu, mean_arr_delay, tailnum, carrier , order_levels_fill = order_by_carrier_size ) ## End(Not run)
## Not run: data = quarterly_flights alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'last_variable' ) # more flow coloring variants ------------------------------------ alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'first_variable' ) alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'all_flows' ) alluvial_long( data, key = qu, value = mean_arr_delay, id = tailnum, fill_by = 'value' ) # color by additional variable carrier --------------------------- alluvial_long( data, key = qu, value = mean_arr_delay, fill = carrier, id = tailnum ) # use same color coding for flows and y levels ------------------- palette = c('green3', 'tomato') alluvial_long( data, qu, mean_arr_delay, tailnum, fill_by = 'value' , col_vector_flow = palette , col_vector_value = palette ) # reorder levels ------------------------------------------------ alluvial_long( data, qu, mean_arr_delay, tailnum, fill_by = 'first_variable' , order_levels_value = c('on_time', 'late') ) alluvial_long( data, qu, mean_arr_delay, tailnum, fill_by = 'first_variable' , order_levels_key = c('Q4', 'Q3', 'Q2', 'Q1') ) require(dplyr) require(magrittr) order_by_carrier_size = data %>% group_by(carrier) %>% count() %>% arrange( desc(n) ) %>% .[['carrier']] alluvial_long( data, qu, mean_arr_delay, tailnum, carrier , order_levels_fill = order_by_carrier_size ) ## End(Not run)
alluvial plots are capable of displaying higher dimensional data
on a plane, thus lend themselves to plot the response of a statistical model
to changes in the input data across multiple dimensions. The practical limit
here is 4 dimensions. We need the data space (a sensible range of data
calculated based on the importance of the explanatory variables of the model
as created by get_data_space
and the predictions
returned by the model in response to the data space.
alluvial_model_response( pred, dspace, imp, degree = 4, bin_labels = c("LL", "ML", "M", "MH", "HH"), col_vector_flow = c("#FF0065", "#009850", "#A56F2B", "#005EAA", "#710500", "#7B5380", "#9DD1D1"), method = "median", force = FALSE, params_bin_numeric_pred = list(bins = 5), pred_train = NULL, stratum_label_size = 3.5, ... )
alluvial_model_response( pred, dspace, imp, degree = 4, bin_labels = c("LL", "ML", "M", "MH", "HH"), col_vector_flow = c("#FF0065", "#009850", "#A56F2B", "#005EAA", "#710500", "#7B5380", "#9DD1D1"), method = "median", force = FALSE, params_bin_numeric_pred = list(bins = 5), pred_train = NULL, stratum_label_size = 3.5, ... )
pred |
vector, predictions, if method = 'pdp' use
|
dspace |
data frame, returned by
|
imp |
dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data. |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bin_labels |
labels for prediction bins from low to high, Default: c("LL", "ML", "M", "MH", "HH") |
col_vector_flow |
character vector, defines flow colours, Default: c('#FF0065','#009850', '#A56F2B', '#005EAA', '#710500') |
method |
character vector, one of c('median', 'pdp')
. Default: 'median' |
force |
logical, force plotting of over 1500 flows, Default: FALSE |
params_bin_numeric_pred |
list, additional parameters passed to
|
pred_train |
numeric vector, base the automated binning of the pred vector on the distribution of the training predictions. This is useful if marginal histograms are added to the plot later. Default = NULL |
stratum_label_size |
numeric, Default: 3.5 |
... |
additional parameters passed to
|
this model visualisation approach follows the "visualising the model in the dataspace" principle as described in Wickham H, Cook D, Hofmann H (2015) Visualizing statistical models: Removing the blindfold. Statistical Analysis and Data Mining 8(4) <doi:10.1002/sam.11271>
ggplot2 object
alluvial_wide
,
get_data_space
,
alluvial_model_response_caret
df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance dspace = get_data_space(df, imp, degree = 3) pred = predict(m, newdata = dspace) alluvial_model_response(pred, dspace, imp, degree = 3) # partial dependency plotting method ## Not run: pred = get_pdp_predictions(df, imp , .f_predict = randomForest:::predict.randomForest , m , degree = 3 , bins = 5) alluvial_model_response(pred, dspace, imp, degree = 3, method = 'pdp') ## End(Not run)
df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance dspace = get_data_space(df, imp, degree = 3) pred = predict(m, newdata = dspace) alluvial_model_response(pred, dspace, imp, degree = 3) # partial dependency plotting method ## Not run: pred = get_pdp_predictions(df, imp , .f_predict = randomForest:::predict.randomForest , m , degree = 3 , bins = 5) alluvial_model_response(pred, dspace, imp, degree = 3, method = 'pdp') ## End(Not run)
Wraps alluvial_model_response
and
get_data_space
into one call for caret models.
alluvial_model_response_caret( train, data_input, degree = 4, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), col_vector_flow = c("#FF0065", "#009850", "#A56F2B", "#005EAA", "#710500", "#7B5380", "#9DD1D1"), method = "median", parallel = FALSE, params_bin_numeric_pred = list(bins = 5), pred_train = NULL, stratum_label_size = 3.5, force = F, resp_var = NULL, ... )
alluvial_model_response_caret( train, data_input, degree = 4, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), col_vector_flow = c("#FF0065", "#009850", "#A56F2B", "#005EAA", "#710500", "#7B5380", "#9DD1D1"), method = "median", parallel = FALSE, params_bin_numeric_pred = list(bins = 5), pred_train = NULL, stratum_label_size = 3.5, force = F, resp_var = NULL, ... )
train |
caret train object |
data_input |
dataframe, input data |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bins |
integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5 |
bin_labels |
labels for the bins from low to high, Default: c("LL", "ML", "M", "MH", "HH") |
col_vector_flow |
character vector, defines flow colours, Default: c('#FF0065','#009850', '#A56F2B', '#005EAA', '#710500') |
method |
character vector, one of c('median', 'pdp')
. Default: 'median' |
parallel |
logical, turn on parallel processing for pdp method. Default: FALSE |
params_bin_numeric_pred |
list, additional parameters passed to
|
pred_train |
numeric vector, base the automated binning of the pred vector on the distribution of the training predictions. This is useful if marginal histograms are added to the plot later. Default = NULL |
stratum_label_size |
numeric, Default: 3.5 |
force |
logical, force plotting of over 1500 flows, Default: FALSE |
resp_var |
character, sometimes target variable cannot be inferred and needs to be passed. Default NULL |
... |
additional parameters passed to
|
this model visualisation approach follows the "visualising the model in the dataspace" principle as described in Wickham H, Cook D, Hofmann H (2015) Visualizing statistical models: Removing the blindfold. Statistical Analysis and Data Mining 8(4) <doi:10.1002/sam.11271>
ggplot2 object
We are using 'furrr' and the 'future' package to paralelize some of the computational steps for calculating the predictions. It is up to the user to register a compatible backend (see plan).
alluvial_wide
,
get_data_space
, varImp
,
extractPrediction
,
get_data_space
,
get_pdp_predictions
if(check_pkg_installed("caret", raise_error = FALSE)) { df = mtcars2[, ! names(mtcars2) %in% 'ids' ] train = caret::train( disp ~ ., df, method = 'rf', trControl = caret::trainControl( method = 'none' ), importance = TRUE ) alluvial_model_response_caret(train, df, degree = 3) } # partial dependency plotting method ## Not run: future::plan("multisession") alluvial_model_response_caret(train, df, degree = 3, method = 'pdp', parallel = TRUE) ## End(Not run)
if(check_pkg_installed("caret", raise_error = FALSE)) { df = mtcars2[, ! names(mtcars2) %in% 'ids' ] train = caret::train( disp ~ ., df, method = 'rf', trControl = caret::trainControl( method = 'none' ), importance = TRUE ) alluvial_model_response_caret(train, df, degree = 3) } # partial dependency plotting method ## Not run: future::plan("multisession") alluvial_model_response_caret(train, df, degree = 3, method = 'pdp', parallel = TRUE) ## End(Not run)
Wraps alluvial_model_response
and
get_data_space
into one call for parsnip models.
alluvial_model_response_parsnip( m, data_input, degree = 4, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), col_vector_flow = c("#FF0065", "#009850", "#A56F2B", "#005EAA", "#710500", "#7B5380", "#9DD1D1"), method = "median", parallel = FALSE, params_bin_numeric_pred = list(bins = 5), pred_train = NULL, stratum_label_size = 3.5, force = F, resp_var = NULL, .f_imp = vip::vi_model, ... )
alluvial_model_response_parsnip( m, data_input, degree = 4, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), col_vector_flow = c("#FF0065", "#009850", "#A56F2B", "#005EAA", "#710500", "#7B5380", "#9DD1D1"), method = "median", parallel = FALSE, params_bin_numeric_pred = list(bins = 5), pred_train = NULL, stratum_label_size = 3.5, force = F, resp_var = NULL, .f_imp = vip::vi_model, ... )
m |
parsnip model or trained workflow |
data_input |
dataframe, input data |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bins |
integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5 |
bin_labels |
labels for the bins from low to high, Default: c("LL", "ML", "M", "MH", "HH") |
col_vector_flow |
character vector, defines flow colours, Default: c('#FF0065','#009850', '#A56F2B', '#005EAA', '#710500') |
method |
character vector, one of c('median', 'pdp')
. Default: 'median' |
parallel |
logical, turn on parallel processing for pdp method. Default: FALSE |
params_bin_numeric_pred |
list, additional parameters passed to
|
pred_train |
numeric vector, base the automated binning of the pred vector on the distribution of the training predictions. This is useful if marginal histograms are added to the plot later. Default = NULL |
stratum_label_size |
numeric, Default: 3.5 |
force |
logical, force plotting of over 1500 flows, Default: FALSE |
resp_var |
character, sometimes target variable cannot be inferred and needs to be passed. Default NULL |
.f_imp |
vip function that calculates feature importance, Default: vip::vi_model |
... |
additional parameters passed to
|
this model visualisation approach follows the "visualising the model in the dataspace" principle as described in Wickham H, Cook D, Hofmann H (2015) Visualizing statistical models: Removing the blindfold. Statistical Analysis and Data Mining 8(4) <doi:10.1002/sam.11271>
ggplot2 object
We are using 'furrr' and the 'future' package to paralelize some of the computational steps for calculating the predictions. It is up to the user to register a compatible backend (see plan).
alluvial_wide
,
get_data_space
, varImp
,
extractPrediction
,
get_data_space
,
get_pdp_predictions
if(check_pkg_installed("parsnip", raise_error = FALSE) & check_pkg_installed("vip", raise_error = FALSE)) { df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = parsnip::rand_forest(mode = "regression") %>% parsnip::set_engine("randomForest") %>% parsnip::fit(disp ~ ., data = df) alluvial_model_response_parsnip(m, df, degree = 3) } ## Not run: # workflow --------------------------------- m <- parsnip::rand_forest(mode = "regression") %>% parsnip::set_engine("randomForest") rec_prep = recipes::recipe(disp ~ ., df) %>% recipes::prep() wf <- workflows::workflow() %>% workflows::add_model(m) %>% workflows::add_recipe(rec_prep) %>% parsnip::fit(df) alluvial_model_response_parsnip(wf, df, degree = 3) # partial dependence plotting method ----- future::plan("multisession") alluvial_model_response_parsnip(m, df, degree = 3, method = 'pdp', parallel = TRUE) ## End(Not run)
if(check_pkg_installed("parsnip", raise_error = FALSE) & check_pkg_installed("vip", raise_error = FALSE)) { df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = parsnip::rand_forest(mode = "regression") %>% parsnip::set_engine("randomForest") %>% parsnip::fit(disp ~ ., data = df) alluvial_model_response_parsnip(m, df, degree = 3) } ## Not run: # workflow --------------------------------- m <- parsnip::rand_forest(mode = "regression") %>% parsnip::set_engine("randomForest") rec_prep = recipes::recipe(disp ~ ., df) %>% recipes::prep() wf <- workflows::workflow() %>% workflows::add_model(m) %>% workflows::add_recipe(rec_prep) %>% parsnip::fit(df) alluvial_model_response_parsnip(wf, df, degree = 3) # partial dependence plotting method ----- future::plan("multisession") alluvial_model_response_parsnip(m, df, degree = 3, method = 'pdp', parallel = TRUE) ## End(Not run)
plots a dataframe as an alluvial plot. All numerical variables are scaled, centered and YeoJohnson transformed before binning. Plots all variables in the sequence as they appear in the dataframe until maximum number of values is reached.
alluvial_wide( data, id = NULL, max_variables = 20, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), NA_label = "NA", order_levels = NULL, fill_by = "first_variable", col_vector_flow = palette_qualitative() %>% palette_filter(greys = F), col_vector_value = RColorBrewer::brewer.pal(9, "Greys")[c(4, 7, 5, 8, 6)], colorful_fill_variable_stratum = T, verbose = F, stratum_labels = T, stratum_label_type = "label", stratum_label_size = 4.5, stratum_width = 1/4, auto_rotate_xlabs = T, ... )
alluvial_wide( data, id = NULL, max_variables = 20, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), NA_label = "NA", order_levels = NULL, fill_by = "first_variable", col_vector_flow = palette_qualitative() %>% palette_filter(greys = F), col_vector_value = RColorBrewer::brewer.pal(9, "Greys")[c(4, 7, 5, 8, 6)], colorful_fill_variable_stratum = T, verbose = F, stratum_labels = T, stratum_label_type = "label", stratum_label_size = 4.5, stratum_width = 1/4, auto_rotate_xlabs = T, ... )
data |
a dataframe |
id |
unquoted column name of id column or character vector with id column name |
max_variables |
maximum number of variables, Default: 20 |
bins |
number of bins for numerical variables, Default: 5 |
bin_labels |
labels for the bins from low to high, Default: c("LL", "ML", "M", "MH", "HH") |
NA_label |
character vector, define label for missing data, Default: 'NA' |
order_levels |
character vector denoting levels to be reordered from low to high |
fill_by |
one_of(c('first_variable', 'last_variable', 'all_flows', 'values')), Default: 'first_variable' |
col_vector_flow |
HEX colors for flows, Default: palette_filter( greys = F) |
col_vector_value |
Hex colors for y levels/values, Default: RColorBrewer::brewer.pal(9, "Greys")[c(3, 6, 4, 7, 5)] |
colorful_fill_variable_stratum |
logical, use flow colors to colorize fill variable stratum, Default: TRUE |
verbose |
logical, print plot summary, Default: F |
stratum_labels |
logical, Default: TRUE |
stratum_label_type |
character, Default: "label" |
stratum_label_size |
numeric, Default: 4.5 |
stratum_width |
double, Default: 1/4 |
auto_rotate_xlabs |
logical, Default: TRUE |
... |
additional arguments passed to
|
Under the hood this function converts the wide format into long format. ggalluvial also offers a way to make alluvial plots directly from wide format tables but it does not allow individual colouring of the stratum segments. The tradeoff is that we can only order levels as a whole and not individually by variable, Thus if some variables have levels with the same name the order will be the same. If we want to change level order independently we have to assign unique level names first.
ggplot2 object
alluvial_wide
,
geom_flow
, geom_stratum
, manip_bin_numerics
## Not run: alluvial_wide( data = mtcars2, id = ids , max_variables = 3 , fill_by = 'first_variable' )#' # more coloring variants---------------------- alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'last_variable' ) alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'all_flows' ) alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'first_variable' ) # manually order variable values and colour by stratum value alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'values' , order_levels = c('4', '8', '6') ) ## End(Not run)
## Not run: alluvial_wide( data = mtcars2, id = ids , max_variables = 3 , fill_by = 'first_variable' )#' # more coloring variants---------------------- alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'last_variable' ) alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'all_flows' ) alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'first_variable' ) # manually order variable values and colour by stratum value alluvial_wide( data = mtcars2, id = ids , max_variables = 5 , fill_by = 'values' , order_levels = c('4', '8', '6') ) ## End(Not run)
check if package is installed
check_pkg_installed(pkg, raise_error = TRUE)
check_pkg_installed(pkg, raise_error = TRUE)
pkg |
character, package name |
raise_error |
logical |
logical
check_pkg_installed("easyalluvial")
check_pkg_installed("easyalluvial")
calculates a dataspace based on the modeling dataframe and the importance of the explanatory variables. It only considers the most important variables as defined by the degree parameter. It selects a number (defined by bins) of sensible single values spread over the range of the numeric variables and creates all possible value combinations among the most important variables. The values of the remaining variables are set to mode(factors) or median(numerics).
get_data_space(df, imp, degree = 4, bins = 5, max_levels = 10)
get_data_space(df, imp, degree = 4, bins = 5, max_levels = 10)
df |
dataframe, training data |
imp |
dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data. |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bins |
integer, number of bins for numeric variables, and maximum number of levels for factor variables, increasing this number might result in too many flows, Default: 5 |
max_levels |
integer, maximum number of levels per factor variable, Default: 10 |
It selects a the top most important variables based on the degree
parameter and bins the numeric variables using
manip_bin_numerics
, while leaving categoric
variables unchanged. The number of bins for each numeric variable is set to
bins -2. Next the median is picked for each of the bins and the min and the
max value is added for each numeric variable So that we get (median(bin) X
bins -2, max, min) for each numeric variable. Then all possible combinations
between those values and the categoric factor levels are created. The total
number of all possible combinations defines the range of the data space. The
values of the remaining variables are set to mode(factors) or
median(numerics).
this model visualisation approach follows the "visualising the model in the dataspace" principle as described in Wickham H, Cook D, Hofmann H (2015) Visualizing statistical models: Removing the blindfold. Statistical Analysis and Data Mining 8(4) <doi:10.1002/sam.11271>
data frame
alluvial_wide
,
manip_bin_numerics
df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance dspace = get_data_space(df, imp)
df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance dspace = get_data_space(df, imp)
Alluvial plots are capable of displaying higher dimensional data on a plane, thus lend themselves to plot the response of a statistical model to changes in the input data across multiple dimensions. The practical limit here is 4 dimensions while conventional partial dependence plots are limited to 2 dimensions.
Briefly the 4 variables with the highest feature importance for a given model are selected and 5 values spread over the variable range are selected for each. Then a grid of all possible combinations is created. All none-plotted variables are set to the values found in the first row of the training data set. Using this artificial data space model predictions are being generated. This process is then repeated for each row in the training data set and the overall model response is averaged in the end. Each of the possible combinations is plotted as a flow which is coloured by the bin corresponding to the average model response generated by that particular combination.
get_pdp_predictions( df, imp, m, degree = 4, bins = 5, .f_predict = predict, parallel = FALSE )
get_pdp_predictions( df, imp, m, degree = 4, bins = 5, .f_predict = predict, parallel = FALSE )
df |
dataframe, training data |
imp |
dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data. |
m |
model object |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bins |
integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5 |
.f_predict |
corresponding model predict() function. Needs to accept 'm' as the first parameter and use the 'newdata' parameter. Supply a wrapper for predict functions with x-y syntax. For parallel processing the predict method of object classes will not always get imported correctly to the worker environment. We can pass the correct predict method via this parameter for example randomForest:::predict.randomForest. Note that a lot of modeling packages do not export the predict method explicitly and it can only be found using :::. |
parallel |
logical, turn on parallel processing. Default: FALSE |
For more on partial dependency plots see [https://christophm.github.io/interpretable-ml-book/pdp.html].
vector, predictions
We are using 'furrr' and the 'future' package to paralelize some of the computational steps for calculating the predictions. It is up to the user to register a compatible backend (see plan).
df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance pred = get_pdp_predictions(df, imp , m , degree = 3 , bins = 5) # parallel processing -------------------------- ## Not run: future::plan("multisession") # note that we have to pass the predict method via .f_predict otherwise # it will not be available in the worker's environment. pred = get_pdp_predictions(df, imp , m , degree = 3 , bins = 5, , parallel = TRUE , .f_predict = randomForest:::predict.randomForest) ## End(Not run)
df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance pred = get_pdp_predictions(df, imp , m , degree = 3 , bins = 5) # parallel processing -------------------------- ## Not run: future::plan("multisession") # note that we have to pass the predict method via .f_predict otherwise # it will not be available in the worker's environment. pred = get_pdp_predictions(df, imp , m , degree = 3 , bins = 5, , parallel = TRUE , .f_predict = randomForest:::predict.randomForest) ## End(Not run)
has been replaced by pdp_predictions which can be paralelized and also handles factor predictions. It is still used to test results.
get_pdp_predictions_seq(df, imp, m, degree = 4, bins = 5, .f_predict = predict)
get_pdp_predictions_seq(df, imp, m, degree = 4, bins = 5, .f_predict = predict)
df |
dataframe, training data |
imp |
dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data. |
m |
model object |
degree |
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4 |
bins |
integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5 |
.f_predict |
corresponding model predict() function. Needs to accept 'm' as the first parameter and use the 'newdata' parameter. Supply a wrapper for predict functions with x-y syntax. For parallel processing the predict method of object classes will not always get imported correctly to the worker environment. We can pass the correct predict method via this parameter for example randomForest:::predict.randomForest. Note that a lot of modeling packages do not export the predict method explicitly and it can only be found using :::. |
centers, scales and Yeo Johnson transforms numeric variables in a dataframe before binning into n bins of equal range. Outliers based on boxplot stats are capped (set to min or max of boxplot stats).
manip_bin_numerics( x, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), center = T, scale = T, transform = T, round_numeric = T, digits = 2, NA_label = "NA" )
manip_bin_numerics( x, bins = 5, bin_labels = c("LL", "ML", "M", "MH", "HH"), center = T, scale = T, transform = T, round_numeric = T, digits = 2, NA_label = "NA" )
x |
dataframe with numeric variables, or numeric vector |
bins |
number of bins for numerical variables, passed to cut as breaks parameter, Default: 5 |
bin_labels |
labels for the bins from low to high, Default: c("LL", "ML", "M", "MH", "HH"). Can also be one of c('mean', 'median', 'min_max', 'cuts'), the corresponding summary function will supply the labels. |
center |
logical, Default: T |
scale |
logical, Default: T |
transform |
logical, apply Yeo Johnson Transformation, Default: T |
round_numeric |
logical, rounds numeric results if bin_labels is supplied with a supported summary function name. |
digits |
integer, number of digits to round to |
NA_label |
character vector, define label for missing data, Default: 'NA' |
dataframe
summary( mtcars2 ) summary( manip_bin_numerics(mtcars2) ) summary( manip_bin_numerics(mtcars2, bin_labels = 'mean')) summary( manip_bin_numerics(mtcars2, bin_labels = 'cuts' , scale = FALSE, center = FALSE, transform = FALSE))
summary( mtcars2 ) summary( manip_bin_numerics(mtcars2) ) summary( manip_bin_numerics(mtcars2, bin_labels = 'mean')) summary( manip_bin_numerics(mtcars2, bin_labels = 'cuts' , scale = FALSE, center = FALSE, transform = FALSE))
before converting we check whether the levels contain a number, if they do the number will be preserved.
manip_factor_2_numeric(vec)
manip_factor_2_numeric(vec)
vec |
vector |
vector
fac_num = factor( c(1,3,8) ) fac_chr = factor( c('foo','bar') ) fac_chr_ordered = factor( c('a','b','c'), ordered = TRUE ) manip_factor_2_numeric( fac_num ) manip_factor_2_numeric( fac_chr ) manip_factor_2_numeric( fac_chr_ordered ) # does not work for decimal numbers manip_factor_2_numeric(factor(c("A12", "B55", "10e4"))) manip_factor_2_numeric(factor(c("1.56", "4.56", "8.4")))
fac_num = factor( c(1,3,8) ) fac_chr = factor( c('foo','bar') ) fac_chr_ordered = factor( c('a','b','c'), ordered = TRUE ) manip_factor_2_numeric( fac_num ) manip_factor_2_numeric( fac_chr ) manip_factor_2_numeric( fac_chr_ordered ) # does not work for decimal numbers manip_factor_2_numeric(factor(c("A12", "B55", "10e4"))) manip_factor_2_numeric(factor(c("1.56", "4.56", "8.4")))
mtcars dataset with cyl, vs, am ,gear, carb as factor variables and car model names as id
mtcars2
mtcars2
A data frame with 32 rows and 12 variables
Miles/(US) gallon
Number of cylinders
Displacement (cu.in.)
Gross horsepower
Rear axle ratio
Weight (1000 lbs)
1/4 mile time
Engine
Transmission
Number of forward gears
Number of carburetors
car model name
datasets
filters are based on rgb values
palette_filter( palette = palette_qualitative(), similar = F, greys = T, reds = T, greens = T, blues = T, dark = T, medium = T, bright = T, thresh_similar = 25 )
palette_filter( palette = palette_qualitative(), similar = F, greys = T, reds = T, greens = T, blues = T, dark = T, medium = T, bright = T, thresh_similar = 25 )
palette |
any vector with hex color values, Default: palette_qualitative() |
similar |
logical, allow similar colours, similar colours are detected using a threshold (thresh_similar), two colours are similar when each value for RGB is within threshold range of the corresponding RGB value of the second colour, Default: F |
greys |
logical, allow grey colours, blue == green == blue , Default: T |
reds |
logical, allow red colours, blue < 50 & green < 50 & red > 200 , Default: T |
greens |
logical, allow green colours, green > red & green > blue, Default: T |
blues |
logical, allow blue colours, blue > green & green > red, Default: T |
dark |
logical, allow colours of dark intensity, sum( red, green, blue) < 420 , Default: T |
medium |
logical, allow colours of medium intensity, between( sum( red, green, blue), 420, 600) , Default: T |
bright |
logical, allow colours of bright intensity, sum( red, green, blue) > 600, Default: T |
thresh_similar |
int, threshold for defining similar colours, see similar, Default: 25 |
vector with hex colors
require(magrittr) palette_qualitative() %>% palette_filter(thresh_similar = 0) %>% palette_plot_intensity() ## Not run: # more examples--------------------------- palette_qualitative() %>% palette_filter(thresh_similar = 25) %>% palette_plot_intensity() palette_qualitative() %>% palette_filter(thresh_similar = 0, blues = FALSE) %>% palette_plot_intensity() ## End(Not run)
require(magrittr) palette_qualitative() %>% palette_filter(thresh_similar = 0) %>% palette_plot_intensity() ## Not run: # more examples--------------------------- palette_qualitative() %>% palette_filter(thresh_similar = 25) %>% palette_plot_intensity() palette_qualitative() %>% palette_filter(thresh_similar = 0, blues = FALSE) %>% palette_plot_intensity() ## End(Not run)
works for any vector
palette_increase_length(palette = palette_qualitative(), n = 100)
palette_increase_length(palette = palette_qualitative(), n = 100)
palette |
any vector, Default: palette_qualitative() |
n |
int, length, Default: 100 |
vector with increased length
require(magrittr) length(palette_qualitative()) palette_qualitative() %>% palette_increase_length(100) %>% length()
require(magrittr) length(palette_qualitative()) palette_qualitative() %>% palette_increase_length(100) %>% length()
sum of red green and blue values
palette_plot_intensity(palette)
palette_plot_intensity(palette)
palette |
any vector containing color hex values |
ggplot2 plot
## Not run: if(interactive()){ palette_qualitative() %>% palette_filter( thresh = 25) %>% palette_plot_intensity() } ## End(Not run)
## Not run: if(interactive()){ palette_qualitative() %>% palette_filter( thresh = 25) %>% palette_plot_intensity() } ## End(Not run)
grouped bar chart
palette_plot_rgp(palette)
palette_plot_rgp(palette)
palette |
any vector containing color hex values |
ggplot2 plot
## Not run: if(interactive()){ palette_qualitative() %>% palette_filter( thresh = 50) %>% palette_plot_rgp() } ## End(Not run)
## Not run: if(interactive()){ palette_qualitative() %>% palette_filter( thresh = 50) %>% palette_plot_rgp() } ## End(Not run)
uses c('#FF0065','#009850', '#A56F2B', '#005EAA', '#710500', '#7B5380', '#9DD1D1') and then adds all unique values found in all qualitative RColorBrewer palettes
palette_qualitative()
palette_qualitative()
vector with hex values
palette_qualitative()
palette_qualitative()
will create gtable with density histograms and frequency plots of all variables of a given alluvial plot.
plot_all_hists(p, data_input, top = TRUE, keep_labels = FALSE, ...)
plot_all_hists(p, data_input, top = TRUE, keep_labels = FALSE, ...)
p |
alluvial plot |
data_input |
dataframe, input data that was used to create dataframe |
top |
logical, position of histograms, if FALSE adds them at the bottom, Default: TRUE |
keep_labels |
logical, keep title and caption, Default: FALSE |
... |
additional arguments for specific alluvial plot types: pred_train can be used to pass training predictions for model response alluvials |
gtable
## Not run: p = alluvial_wide(mtcars2, max_variables = 3) plot_all_hists(p, mtcars2) ## End(Not run)
## Not run: p = alluvial_wide(mtcars2, max_variables = 3) plot_all_hists(p, mtcars2) ## End(Not run)
plotting the condensation potential is meant as a decision aid for which variables to include in an alluvial plot. All variables are transformed to categoric variables and then two variables are selected by which the dataframe will be grouped and summarized by. The pair that results in the greatest condensation of the original dataframe is selected. Then the next variable which offers the greatest condensation potential is chosen until all variables have been added. The condensation in percent is then plotted for each step along with the number of groups (flows) in the dataframe. By experience it is not advisable to have more than 1500 flows because then the alluvial plot will take a long time to render. If there is a particular variable of interest in the dataframe this variable can be chosen as a starting variable.
plot_condensation(df, first = NULL)
plot_condensation(df, first = NULL)
df |
dataframe |
first |
unquoted expression or string denoting the first variable to be picked for condensation, Default: NULL |
ggplot2 plot
quosure
reexports
RColorBrewer
plot_condensation(mtcars2) plot_condensation(mtcars2, first = 'disp')
plot_condensation(mtcars2) plot_condensation(mtcars2, first = 'disp')
helper function used by add_marginal_histograms
plot_hist(var, p, data_input, ...)
plot_hist(var, p, data_input, ...)
var |
character vector, variable name |
p |
alluvial plot |
data_input |
dataframe used to create alluvial plot |
... |
additional arguments for specific alluvial plot types: pred_train can be used to pass training predictions for model response alluvials |
ggplot object
plot important features of model response alluvial as bars
plot_imp(p, data_input, truncate_at = 50, color = "darkgrey")
plot_imp(p, data_input, truncate_at = 50, color = "darkgrey")
p |
alluvial plot |
data_input |
dataframe used to generate alluvial plot |
truncate_at |
integer, limit number of features to that value, Default: 50 |
color |
character vector, Default: 'darkgrey' |
ggplot object
## Not run: df = mtcars2[, ! names(mtcars2) %in% 'ids' ] train = caret::train( disp ~ . , df , method = 'rf' , trControl = caret::trainControl( method = 'none' ) , importance = TRUE ) pred_train = caret::predict.train(train, df) p = alluvial_model_response_caret(train, degree = 3, pred_train = pred_train) plot_imp(p, mtcars2) ## End(Not run)
## Not run: df = mtcars2[, ! names(mtcars2) %in% 'ids' ] train = caret::train( disp ~ . , df , method = 'rf' , trControl = caret::trainControl( method = 'none' ) , importance = TRUE ) pred_train = caret::predict.train(train, df) p = alluvial_model_response_caret(train, degree = 3, pred_train = pred_train) plot_imp(p, mtcars2) ## End(Not run)
Created from nycflights13::flights
quarterly_flights
quarterly_flights
A data frame with 1608 rows and 6 variables
a unique identifier created from tailnum, origin, destination and carrier
carrier code
origin code
destination code
quarter
average delay on arrival as either on_time or late
nycflights13::flights
Quarterly mean relative sunspots number from 1749-1983
quarterly_sunspots
quarterly_sunspots
A data frame with 940 rows and 4 variables
quarter
total number of sunspots
Andrews, D. F. and Herzberg, A. M. (1985) Data: A Collection of Problems from Many Fields for the Student and Research Worker. New York: Springer-Verlag.
returns dataframe with exactly two columns, vars and imp and aggregates dummy encoded variables. Helper function called by all functions that take an imp parameter. Can be called manually if formula for aggregating dummy encoded variables must be modified.
tidy_imp(imp, df, .f = max, resp_var = NULL)
tidy_imp(imp, df, .f = max, resp_var = NULL)
imp |
dataframe or matrix with feature importance information |
df |
dataframe, modeling training data |
.f |
window function, Default: max |
resp_var |
character, prediction variable, can usually be inferred from imp and df. It does not work for all models and needs to be specified in those cases. |
dataframe
character column with feature names
numerical column, importance values
# randomforest df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance tidy_imp(imp, df)
# randomforest df = mtcars2[, ! names(mtcars2) %in% 'ids' ] m = randomForest::randomForest( disp ~ ., df) imp = m$importance tidy_imp(imp, df)
titanic data set'
titanic
titanic
A data frame with 891 rows and 10 variables
Survived
Pclass
Sex
Age
SibSp
Parch
Fare
Cabin
Embarked
title
datasets