Introduction

In this tutorial we go through some of the functionality of the NetworkInference package using the example application from Desmarais et al. (2015) and extending it with the recently released State Policy Innovation and Diffusion (SPID) database. In this paper Desmarais et al. infer a latent network for policy diffusion based on adoption of policies in the US states.

netinf infers the optimal diffusion network from a set of nodes (in this case US-states) and a number of so called cascades (here a cascade corresponds to a policy that is adopted by at least two states). A cascade is a series of events occurring at a specified time.

Installing the package

Installing NetworkInference:

install.packages('NetworkInference')

Some other packages required for this tutorial:

install.packages(c('dplyr', 'igraph'))

Exploring SPID Data

The policy adoption data is available in the package:

library(NetworkInference)
# Load the `policies` dataset (?policies for details).
data('policies')

This loads two data.frame-objects. policies contains the adoption events and policies_metadata contains additional information on each policy.

head(policies)

statenam	policy	adopt_year
Alaska	aboldeapen	1957
Delaware	aboldeapen	1958
Hawaii	aboldeapen	1957
Iowa	aboldeapen	1965
Maine	aboldeapen	1887
Michigan	aboldeapen	1846

The first column (statenam) gives the state that adopted the policy in year adopt_year. Let’s take a closer look at the data.

Number of policies:

length(unique(policies$policy))
#> [1] 728

Number of adoption events:

nrow(policies)
#> [1] 17835

Some example policies:

unique(policies$policy)[1:5]
#> [1] "aboldeapen"                "aborparc"                 
#> [3] "aborparn"                  "aborpreroe"               
#> [5] "abortion_consent_1973_199"

Not all the policy abbreviations are understandable. So let’s check the metadata for more information:

library(dplyr)
filter(policies_metadata, policy %in% unique(policy)[100:104]) %>%
    select(-source)

Table continues below
policy	first_year	last_year	adopt_count
civil defense and disaster compact	1951	1978	19
civinjaut	1998	2001	15
class actions	1977	1980	2
clinic_access	1973	2005	16
cogrowman	1961	1998	10

description	majortopic
Bilateral/Mutual Aid For Disasters	Domestic Commerce
Civil Injunction Authority	Law and Crime
Provides Complete Procedures To Govern The Conduct Of A Class Action Law Suit.	Law and Crime
No-Protest Zone Around Abortion Clinic	Civil Rights
Requiring Local Government To Coordinate Growth Management	Housing

Preparing the Data for NetworkInference

Most functionality of the NetworkInference package is based on the cascades data format. So before starting with the analysis we have to transform our data to such an object.

policy_cascades <- as_cascade_long(policies, cascade_node_name = 'statenam',
                                   event_time = 'adopt_year', 
                                   cascade_id = 'policy')

In this case we used the function as_cascade_long. If your data is in wide format you can convert it using the function as_cascade_wide.

The cascade class contains the same data as the policies data.frame, just in a different format. You don’t need to understand how the object is constructed but let’s take a look for clarity:

class(policy_cascades)
#> [1] "cascade" "list"
length(policy_cascades)
#> [1] 3
names(policy_cascades)
#> [1] "cascade_nodes" "cascade_times" "node_names"

The cascade class is basically a list containing three elements:

policy_cascades$cascade_nodes[2:3]
#> $aborparc
#>  [1] "Louisiana"      "Massachusetts"  "Rhode Island"   "Missouri"      
#>  [5] "Indiana"        "Alabama"        "Maine"          "Wyoming"       
#>  [9] "South Carolina" "Michigan"       "Wisconsin"      "Kentucky"      
#> [13] "Pennsylvania"   "North Carolina" "Tennessee"     
#> 
#> $aborparn
#>  [1] "Minnesota"     "Utah"          "Arizona"       "West Virginia"
#>  [5] "Arkansas"      "Connecticut"   "Ohio"          "Georgia"      
#>  [9] "Nebraska"      "Kansas"        "Maryland"      "Tennessee"    
#> [13] "Delaware"      "Idaho"         "South Dakota"  "Virginia"     
#> [17] "Texas"

policy_cascades$cascade_times[2:3]
#> $aborparc
#>  [1] 1981 1981 1982 1983 1984 1987 1989 1989 1990 1991 1992 1994 1994 1996 1999
#> 
#> $aborparn
#>  [1] 1981 1981 1982 1984 1989 1990 1990 1991 1991 1992 1992 1992 1996 1996 1998
#> [16] 1998 2000

There are a few convenience functions to manipulate the cascades (but you can also manipulate the data before converting it to the cascade format).

Subsetting by Cascade

selected_policies <- subset_cascade(cascade = policy_cascades, 
                                    selection = c('clinic_access', 'cogrowman'))
selected_policies[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#>  [1] "Oregon"         "Wisconsin"      "Kansas"         "Maryland"      
#>  [5] "California"     "Nevada"         "Colorado"       "Connecticut"   
#>  [9] "Massachusetts"  "Minnesota"      "North Carolina" "Washington"    
#> [13] "Maine"          "Michigan"       "New York"       "Montana"       
#> 
#> $cascade_nodes$cogrowman
#>  [1] "Hawaii"       "Vermont"      "Oregon"       "Florida"      "New Jersey"  
#>  [6] "Rhode Island" "Washington"   "Maryland"     "Arizona"      "Tennessee"   
#> 
#> 
#> $cascade_times
#> $cascade_times$clinic_access
#>  [1] 1973 1985 1988 1989 1990 1991 1993 1993 1993 1993 1993 1993 1995 1995 1999
#> [16] 2005
#> 
#> $cascade_times$cogrowman
#>  [1] 1961 1970 1973 1985 1986 1988 1990 1992 1998 1998

Subsetting in Time

time_constrained <- subset_cascade_time(cascade = selected_policies, 
                                        start_time = 1990, end_time = 2000)
time_constrained[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#>  [1] "California"     "Nevada"         "Colorado"       "Connecticut"   
#>  [5] "Massachusetts"  "Minnesota"      "North Carolina" "Washington"    
#>  [9] "Maine"          "Michigan"       "New York"      
#> 
#> $cascade_nodes$cogrowman
#> [1] "Washington" "Maryland"   "Arizona"    "Tennessee" 
#> 
#> 
#> $cascade_times
#> $cascade_times$clinic_access
#>  [1] 1990 1991 1993 1993 1993 1993 1993 1993 1995 1995 1999
#> 
#> $cascade_times$cogrowman
#> [1] 1990 1992 1998 1998

Removing Nodes (States)

less_nodes <- drop_nodes(cascades = time_constrained, 
                         nodes = c('Maryland', 'Washington'))
less_nodes[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#>  [1] "California"     "Nevada"         "Colorado"       "Connecticut"   
#>  [5] "Massachusetts"  "Minnesota"      "North Carolina" "Maine"         
#>  [9] "Michigan"       "New York"      
#> 
#> $cascade_nodes$cogrowman
#> [1] "Arizona"   "Tennessee"
#> 
#> 
#> $cascade_times
#> $cascade_times$clinic_access
#>  [1] 1990 1991 1993 1993 1993 1993 1993 1995 1995 1999
#> 
#> $cascade_times$cogrowman
#> [1] 1998 1998

Visually Inspecting Cascades

It’s always good practice to visually inspect the data before working with it. The NetworkInference package provides functionality to visualize the cascade data.

The function summary.cascades() provides quick summary statistics on the cascade data:

summary(policy_cascades)
#> # cascades: 728
#> # nodes: 50
#> # nodes in cascades: 50
#> # possible edges: 2450
#> 
#> Summary statistics for cascade length and number of ties:
#>           length     ties
#> Min.     1.00000  0.00000
#> 1st Qu. 11.00000  4.00000
#> Median  23.00000 11.00000
#> Mean    24.49863 14.53571
#> 3rd Qu. 40.00000 23.00000
#> Max.    50.00000 48.00000

The first four lines provide the number of cascades, the number of nodes in the system, the number of nodes involved in cascades (there might be states that we don’t have diffusion data on, but we still want them represented in the dataset) and the possible number of edges in a potential diffusion network (a diffusion edge between nodes u and v only makes sense if there is at least one cascade in which u experiences an event before v). In this example there are 187 policies and 50 states. Each state is involved in at least one policy cascade and a fully connected diffusion network would have 2450 edges.

It also provides summary statistics on the distribution of the cascade lengths (number of nodes involved in each cascade) and the number of ties in the cascades (two nodes experiencing the same event at the same time). For our example, we can see that the ‘smallest’ policy was adopted by 10 states and the ‘largest’ by all 50 states. From the tie summaries we see that there is at least one policy that was adopted by 45 states in the same year.

The plot() method allows to plot cascades with varying degrees of detail. The argument label_nodes (TRUE/FALSE) provides node labels which require more space but provide more detail. The argument selection allows to pick a subset of cascades to visualize in case there are too many to plot. If label_nodes is set to FALSE each event is depicted by a dot, which allows to visualize more cascades simultaneously.

Let’s first look at the visualization with labels. Here we plot two cascades, selected by their name:

selection <- c('guncontrol_assaultweapon_ba', 'guncontrol_licenses_dealer')
plot(policy_cascades, label_nodes = TRUE, selection = selection)
#> Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
#> ggplot2 3.3.4.
#> ℹ Please use "none" instead.
#> ℹ The deprecated feature was likely used in the NetworkInference package.
#>   Please report the issue at
#>   <https://github.com/desmarais-lab/NetworkInference/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps

We can also plot more cascades with less detail:

selection <- c('waiting', 'threestrikes', 'unionlimits', 'smokeban', 
               'paperterror', 'miglab', 'methpre', 'lott', 'lemon', 'idtheft',
               'harass', 'hatecrime', 'equalpay')
plot(policy_cascades, label_nodes = FALSE, selection = selection)

This produces a ‘violin plot’ for each cascade with the single diffusion events overplotted as dots. As we already saw in the previous visualization, the policy data has a lot of ties (i.e. many states adopted a policy in the same year) which is indicated by the areas of higher density in the violin plot.

Inferring the Latent Diffusion Network

The netinf algorithm is implemented in the netinf() function. The netinf inferrs edges based on a diffusion model. That is we assume a parametric model for the diffusion times between edges. Currently three different diffusion models are implemented: The exponential distribution, the rayleigh distribution and the log-normal distribution. The model can be chosen with the trans_mod argument (default is the exponential distribution).

Classic Netinf

In the original implementation the number of edges to infer had to be fixed and chosen by the researcher. If you want to run netinf in this classic way you can do so by specifiying all parameters and the number of edges:

results <- netinf(policy_cascades, trans_mod = "exponential", n_edges = 100, 
                  params = 0.5, quiet = TRUE)

The exponential model has only one parameter (lambda or the rate). If there are more parameters the details section in the documentation of the netinf function (?netinf) has more detail on how to specify parameters and on the specific parametrization used by the implementation.

n_edges specifies how many edges should be inferred. See @gomez2010inferring and @desmarais2015persistent for guidance on choosing this parameter if running netinf in classic mode. If the number of edges is specified manually, it has to be lower than the maximum number of possible edges. An edge u->v is only possible if in at least one cascade u experiences an event before v. This means, that the maximum number of edges depends on the data. The function count_possible_edges() allows us to compute the maximum number of edges in advance:

npe <- count_possible_edges(policy_cascades)
npe
#> [1] 2450

Automatic Parameter Selection

With version 1.2.0 netinf can be run without manually specifying the number of edges or the parameters of the diffusion model.

Selecting the number of edges automatically

After each iteration of the netinf algorithm, we check if the edge added significant improvement to the network. This is done via a vuong style test. Given the likelihood score for each cascade conditional on the network inferred so far, we penalize the network with one addional edge and test if the increase in likelihood accross all cascades is significant. The user still has to specify a p-value cut-off. If the p-value of an edge is larger than the specified cut-off the algorithm stops inferring more edges. The cut-off is set via the p_value_cutoff argument.

results <- netinf(policy_cascades, trans_mod = "exponential", 
                  p_value_cutoff = 0.1, params = 0.5, quiet = TRUE)
nrow(results)
#> [1] 872

We see that with a fixed lambda of 0.5 and a p-value cut-off of 0.1 the algorithm inferred 872 edges.

Selecting the parameters of the diffusion model

The diffusion model parameters can be selected automatically. Setting the params argument to NULL (default value) makes the netinf function initialize the parameters automatically. The parameters are initialized at the midpoint between the MLE of the minimum diffusion times and the MLE of the maximum diffusion times, across all cascades. Edges are then inferred until either the p-value cut-off or a manually specified number of edges (n_edges) is reached.

results <- netinf(policy_cascades, trans_mod = "exponential", 
                  p_value_cutoff = 0.1, quiet = TRUE)
nrow(results)
#> [1] 866

Netinf output

Let’s take a look at the output of the algorithm. The output is a dataframe containing the inferred latent network in the form of an edgelist:

head(results)

origin_node	destination_node	improvement
California	Oregon	3745
California	Colorado	3363
California	Maryland	3356
California	Washington	3307
California	Illinois	3286
California	North Carolina	3264

Each row corresponds to a directed edge. The first column indicates the origin node, the second the destination node. The third column displays the gain in model fit from each added edge. The last column displays the p-value from the Vuong test of each edge. There is a generic plot method to inspect the results. If more tweaking is required, the results are a dataframe so it should be easy for the more experienced users to make your own plot. With type = "improvement" the improvement from each edge can be plotted:

plot(results, type = "improvement")

We can also quickly check the p-value from the Vuong test associated with each edge addition:

plot(results, type = 'p-value')

In order to produce a quick visualization of the resulting diffusion network we can use the plot method again, this time with type = "network". Note that in order to use this functionality the igraph package has to be installed.

#install.packages('igraph')
# For this functionality the igraph package has to be installed
# This code is only executed if the package is found:
if(requireNamespace("igraph", quietly = TRUE)) {
    plot(results, type = "network")
}

If additional tweaking of the plot is desired, the network can be visualized using igraph explicitly. We refer you you to the igraph documentation for details on how to customize the plot.

if(requireNamespace("igraph", quietly = TRUE)) {
    library(igraph) 
    g <- graph_from_data_frame(d = results[, 1:2])
    plot(g, edge.arrow.size=.3, vertex.color = "grey70")
}

- Introduction

NetworkInference Tutorial: Persistent Policy Diffusion Ties