--- title: "Use Case: Florentine Families" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{08 use case} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette describes a typical use case of the `netrankr` package. It contrasts the usual centrality analysis based on indices with the dominance based assessment. It is advisable to read the other vignettes before going through this example. ________________________________________________________________________________ ```{r libs, warning=FALSE,message=FALSE} library(igraph) library(netrankr) library(magrittr) ``` ## Data For this tutorial, we use the famous *florentine families* dataset (Padget & Ansell, 1993). The marriage links of families together with the wealth attribute are included in the `netrankr` package. ```{r plot,fig.height=5,fig.width=5,fig.align='center'} data("florentine_m") # Delete Pucci family (isolated) florentine_m <- delete_vertices(florentine_m, which(degree(florentine_m) == 0)) # plot the graph (label size proportional to wealth) set.seed(111) plot(florentine_m, vertex.label.cex = V(florentine_m)$wealth * 0.01, vertex.label.color = "black", vertex.color = "white", vertex.frame.color = "gray" ) ``` We use this dataset below to illustrate how a dominance based assessment of centrality offers an alternative approach to centrality in two ways: * Are the Medici universally the most central family? * Using centrality as explanatory variable (Can centrality explain the wealth attribute?) ________________________________________________________________________________ ## Most central family (Index approach) The network is often used to benchmark new centrality indices. The premise is that the Medici should always emerge as one of the most central once (if not **the** most central). We start by applying some of the standard centrality indices given in the `igraph` package. ```{r cent} cent.df <- data.frame( degree = degree(florentine_m), betweenness = betweenness(florentine_m), closeness = closeness(florentine_m), eigenvector = eigen_centrality(florentine_m)$vector, subgraph = subgraph_centrality(florentine_m) ) # most central family according to the 5 indices V(florentine_m)$name[apply(cent.df, 2, which.max)] ``` In all cases, the Medici are considered to be the most central family. However, it is possible to find indices that rank other families on top. An example is *odd subgraph centrality*, which can be assembled with the `netrankr` package. ```{r cent_new} # odd subgraph centrality sc_odd <- florentine_m %>% indirect_relations(type = "walks", FUN = walks_exp_odd) %>% aggregate_positions(type = "self") # family with highest score V(florentine_m)$name[which.max(sc_odd)] ``` In this example, the Strozzi family are considered to be the most central family and the Medici are ranked third. Although we have found $5$ indices that consider the Medici the most central family, we can not guarantee, that there do not exist hundreds (or thousands?) of indices that would give an entirely different result. ________________________________________________________________________________ ## Most central family (Dominance approach) We start by calculating the neighborhood-inclusion preorder, the most general requirement for any centrality index. ```{r ni} P <- neighborhood_inclusion(florentine_m) ``` With the function `comparable_pairs()` we can assess how many pairs of families are already ordered, before applying any index. ```{r ni_comp} comparable_pairs(P) ``` Only around `r round(comparable_pairs(P)*100,0)`% of pairs of families are comparable, leaving `r 100-round(comparable_pairs(P)*100,0)`% of pairs of families that could be ordered (basically) arbitrarily. If we want to visually assess the dominance relations we can use the function `dominance_graph()`. ```{r dom_graph,fig.height=5,fig.width=5,fig.align='center'} d <- dominance_graph(P) V(d)$name <- V(florentine_m)$name set.seed(113) plot(d, vertex.label.color = "black", vertex.color = "white", vertex.frame.color = "gray", edge.arrow.size = 0.5 ) ``` The Castellan family neither dominates nor is dominated by any other family. This means, that we can find indices that potentially rank them on top or on the bottom, or anything in between. To better assess the potential ranks of nodes, we can plot the rank intervals ```{r rk_inter,fig.align='center',fig.width=5,fig.height=5} plot(rank_intervals(P)) ``` Observe how big the intervals are, indicating that there is ample scope to rank the families differently. These intervals, however, only give us a rough estimate of this arbitrariness and not any rank probabilities. To get all exact probabilities, we use the function `exact_rank_prob()`. ```{r probs} res <- exact_rank_prob(P) ``` There are `r format(res$lin.ext,scientific = F,big.mark = ",")` different possibilities to rank the families! (*the value is stored in `res$lin.ext`. *). This means that theoretically, we can find almost 4 Billion indices that rank the families differently. The *rank probabilities* of families can be found in `res$rank.prob`. They are returned as a matrix where rows are families and columns are ranks. That is, an entry in row $u$ and column $k$ gives the probability that $u$ has rank $k$ (larger $k$ indicate higher ranks) Mostly, you will be interested in the probability to be the most central node of a network. Below, we calculate these probability for all families and return the one's that have a higher probability than $0.1$. ```{r likely_most_central} top_rank_prob <- res$rank.prob[, 15] names(top_rank_prob) <- V(florentine_m)$name round(top_rank_prob[top_rank_prob > 0.1], 3) ``` The Strozzi family, with $0.13$, has the highest probability to be top ranked, followed by the Medici with $0.12$. If we are only interested in a subset of nodes, in our case maybe the Strozzi and Medici, we can assess the *relative rank probabilities* in `res$relative.rank`. Again, probabilities are returned as matrix objects, where an entry in row $u$ and column $v$ gives the probability that $u$ is ranked below $v$. Below we calculate this probability for the Strozzi and Medici. ```{r medici_strozzi,eval=TRUE} id_strozzi <- which(V(florentine_m)$name == "Strozzi") id_medici <- which(V(florentine_m)$name == "Medici") res$relative.rank[id_strozzi, id_medici] ``` The probability that the Strozzi are less central than the Medici is `r round(res$relative.rank[id_strozzi,id_medici],2)` and thus very close to a "fifty-fifty" chance. The last result of interest returned by `exact_rank_prob()` are the *expected ranks* in `res$expected.rank`. The expected ranks, as the name indicates, returns the ranks that we expect families to have in a centrality ranking. ```{r exp_rank,echo=FALSE} tab <- data.frame(Name = V(florentine_m)$name, Expected = round(res$expected.rank, 2)) tab <- tab[order(tab[, 2], decreasing = TRUE), ] knitr::kable(tab, row.names = F) ``` Although the Strozzi have a higher probability to be the most central family, over all we still expect the Medici to be the most central. This very general assessment gives us a general idea of the scope of potential centrality analyses. The more possible rankings we have (as in this case!) the more unreliable an index driven approach can be. We will explore this in more detail in the following section. ________________________________________________________________________________ ## Centrality as explanatory variable (Index approach) Usually, we are not simply interested in a ranking of nodes, but we rather would like to use centrality to explain certain node attributes. In our case, we might be interested in the question: "can an index explain the wealth of families?", or if we already have a more concrete idea "can proximity to other families explain the wealth attribute?" "Proximity" can be translated to the graph-theoretic concept of shortest path distances, such that closeness centrality would be an adequate candidate as an index. We here use the pipeline approach of the `netrankr` package instead of the `closeness()` function of `igraph`. The reasons will become evident in the next section. ```{r closeness_vs_wealth} # Closeness c_C <- florentine_m %>% indirect_relations(type = "dist_sp") %>% aggregate_positions(type = "invsum") cor(c_C, V(florentine_m)$wealth, method = "kendall") ``` The correlation between closeness and wealth (`r round(cor(c_C,V(florentine_m)$wealth,method="kendall"),4)`) is far to low to constitute that "proximity" is related to wealth. However, there exist various other indices, that are based on the shortest path distances in a graph. Refer to the literature for more details on these indices. ```{r harmclos_wealth} # harmonic closeness c_HC <- florentine_m %>% indirect_relations(type = "dist_sp", FUN = dist_inv) %>% aggregate_positions(type = "sum") # residual closeness (Dangalchev,2006) c_RC <- florentine_m %>% indirect_relations(type = "dist_sp", FUN = dist_2pow) %>% aggregate_positions(type = "sum") # integration centrality (Valente & Foreman, 1998) dist_integration <- function(x) { x <- 1 - (x - 1) / max(x) } c_IN <- florentine_m %>% indirect_relations(type = "dist_sp", FUN = dist_integration) %>% aggregate_positions(type = "sum") c( cor(c_HC, V(florentine_m)$wealth, method = "kendall"), cor(c_RC, V(florentine_m)$wealth, method = "kendall"), cor(c_IN, V(florentine_m)$wealth, method = "kendall") ) ``` The highest correlation (0.1159) is achieved for residual closeness, however, this is still too low to conclude that proximity is related to wealth. Besides the already considered indices, there exist further one's that include a free parameter. The idea is that the parameter can be tuned to maximize the correlation between the index and the attribute under consideration. Again, the mathematical details can be found in the respective literature. ```{r distalpha_wealth,warning=FALSE} # generalized closeness (Agneessens et al.,2017) (alpha>0) sum(dist^-alpha) alpha <- c(seq(0.01, 0.99, 0.01), seq(1, 10, 0.1)) scores <- sapply(alpha, function(x) { florentine_m %>% indirect_relations(type = "dist_sp", FUN = dist_dpow, alpha = x) %>% aggregate_positions(type = "sum") }) cors_gc <- apply( scores, 2, function(x) cor(x, V(florentine_m)$wealth, method = "kendall") ) res_gc <- c(max(cors_gc), alpha[which.max(cors_gc)]) # decay centrality (Jackson, 2010) (alpha in [0,1]) sum(alpha^dist) alpha <- seq(0.01, 0.99, 0.01) scores <- sapply(alpha, function(x) { florentine_m %>% indirect_relations(type = "dist_sp", FUN = dist_powd, alpha = x) %>% aggregate_positions(type = "sum") }) cors_dc <- apply( scores, 2, function(x) cor(x, V(florentine_m)$wealth, method = "kendall") ) res_dc <- c(max(cors_dc), alpha[which.max(cors_dc)]) ``` The highest correlation for generalized closeness is `r res_gc[1]` achieved for $\alpha$= `r res_gc[2]` The highest correlation for decay centrality is `r res_dc[1]` achieved for $\alpha$=`r res_dc[2]` We could know accept that there is no index based on shortest path distances that could explain wealth. Or, we could start to craft new indices that might yield a better correlation with wealth. However, we then enter the dilemma that was mentioned at the end of the last section. If we find one, we can not be certain that there might not even be a better one out there. In contrast, if we do not succeed, we can not guarantee that there does not exist an index with a higher correlation. ________________________________________________________________________________ ## Centrality as explanatory variable (Dominance approach) Since we are postulating a connection between proximity and wealth, we compute the pairwise shortest path distances as our *indirect relation* of interest and calculate the positional dominance relations. ```{r pos_dom_hetero} D <- florentine_m %>% indirect_relations(type = "dist_sp") %>% positional_dominance(benefit = F) comparable_pairs(D) ``` Note that exactly the same pairs are comparable as for neighborhood-inclusion. However, with one additional assumption, we will be able to increase the number of comparable pairs significantly (and thus reduce the space of potential rankings). By summing up distances in various ways, as done by the indices above, we assume families to be *homogeneous*. It doesn't matter to whom we have a small distance, it just matters that they are small. If we can safely comply with this assumption, we can use positional dominance *under total homogeneity*. It is important to note, that if a family is dominated by another under this premise, it will have a lower score in **any** distance based centrality index. ```{r pos_dom_homo} D <- florentine_m %>% indirect_relations(type = "dist_sp") %>% positional_dominance(benefit = F, map = T) comparable_pairs(D) ``` The number of comparable pairs increased from $0.15$ to $0.82$, thus reducing the space of potential centrality rankings based on distances significantly. We proceed to explore if there is potential for a distance based ranking to explain wealth perfectly. This is only possible, if families with lower wealth do not dominate wealthier family. Otherwise they would always be ranked higher, prohibiting a perfect correlation. The figure below shows the dominance relations as a directed graph, where the x coordinate of nodes is proportional to the wealth attribute and the y coordinate to the number of dominated families. Any edge pointing to the left (shown in red) denotes a pair of "wrongly" ordered families, i.e. a wealthy family is dominated by a less wealthy one. ```{r plot dist_dom,fig.align="center",fig.height=5,fig.width=5,echo=F} d <- dominance_graph(D) V(d)$name <- V(florentine_m)$name x <- V(florentine_m)$wealth x[9] <- x[9] - 50 x[14] <- x[14] - 50 y <- colSums(D) el <- get.edgelist(d, names = F) E(d)$color <- "gray" col <- apply(el, 1, function(x) V(florentine_m)$wealth[x[1]] > V(florentine_m)$wealth[x[2]]) # wrong:41 E(d)$color[col] <- "indianred" plot(d, layout = cbind(x, y), vertex.label.cex = 0.75, vertex.label.color = "black", vertex.color = "white", vertex.frame.color = "gray", edge.arrow.size = 0.4 ) ``` In total, we find $41$ such pairs ($39$% of all pairs). This implies that we are (potentially) quite far from being able to explain wealth perfectly with shortest path distances. In the following we will explore *how* far away. We start by calculating the rank intervals to illustrate the difference to neighborhood-inclusion. ```{r rk_inter_dist,fig.align='center',fig.width=5,fig.height=5} plot(rank_intervals(D)) ``` All intervals shrunk significantly and even correspond to a single point for two families (Pazzi and Medici). This implies that no matter which distance based index we use, the Pazzi family will always be ranked last and the Medici always on top. For a more exact assessment we again use the function `exact_rank_prob()`. ```{r dist_probs} res <- exact_rank_prob(D) ``` In total, there are `r format(res$lin.ext,scientific = F,big.mark = ",")` distance based rankings possible. This is a huge reduction from the general case where almost $4$ billion are possible. To determine the best possible correlation between wealth and any distance based ranking, we first need to determine all `r format(res$lin.ext,scientific = F,big.mark = ",")` rankings. For this purpose, we rerun the previous analysis with `only.results=FALSE` to obtain the necessary data structure. ```{r dist_probs_lat} res <- exact_rank_prob(D, only.results = FALSE) ``` Now, we can use the function `get_rankings()` which returns all rankings as a matrix. ```{r all_rankings} all_ranks <- get_rankings(res) dim(all_ranks) ``` No we can simply loop over all rankings and calculate the correlation between the ranking and the wealth attribute. ```{r all_cors} dist_cor <- apply( all_ranks, 2, function(x) cor(V(florentine_m)$wealth, x, method = "kendall") ) c(max_cor = max(dist_cor), mean_cor = mean(dist_cor)) ``` The highest achievable correlation is `r round(max(dist_cor),4)`. We can conclude, that there can not be any distance based centrality index that can reasonably explain the wealth attribute. We can additionally consider the correlation between degree and wealth, calculated below. ```{r cor_deg} cor(degree(florentine_m), V(florentine_m)$wealth, method = "kendall") ``` The correlation is higher than any distance based index can have. Thus, we can additionally conclude that marriage ties are more indicative for wealth than proximity in the marriage network.