In this tutorial we go through some of the functionality of the
NetworkInference
package using the example application from
Desmarais et al. (2015) and extending it with the recently released State Policy Innovation and
Diffusion (SPID) database. In this paper Desmarais et al. infer a
latent network for policy diffusion based on adoption of policies in the
US states.
netinf
infers the optimal diffusion network from a set
of nodes (in this case US-states) and a number of so called
cascades (here a cascade corresponds to a policy that is
adopted by at least two states). A cascade is a series of events
occurring at a specified time.
Installing NetworkInference
:
Some other packages required for this tutorial:
The policy adoption data is available in the package:
This loads two data.frame
-objects. policies
contains the adoption events and policies_metadata
contains
additional information on each policy.
statenam | policy | adopt_year |
---|---|---|
Alaska | aboldeapen | 1957 |
Delaware | aboldeapen | 1958 |
Hawaii | aboldeapen | 1957 |
Iowa | aboldeapen | 1965 |
Maine | aboldeapen | 1887 |
Michigan | aboldeapen | 1846 |
The first column (statenam
) gives the state that adopted
the policy
in year adopt_year
. Let’s take a
closer look at the data.
Number of policies:
Number of adoption events:
Some example policies:
unique(policies$policy)[1:5]
#> [1] "aboldeapen" "aborparc"
#> [3] "aborparn" "aborpreroe"
#> [5] "abortion_consent_1973_199"
Not all the policy abbreviations are understandable. So let’s check the metadata for more information:
policy | first_year | last_year | adopt_count |
---|---|---|---|
civil defense and disaster compact | 1951 | 1978 | 19 |
civinjaut | 1998 | 2001 | 15 |
class actions | 1977 | 1980 | 2 |
clinic_access | 1973 | 2005 | 16 |
cogrowman | 1961 | 1998 | 10 |
description | majortopic |
---|---|
Bilateral/Mutual Aid For Disasters | Domestic Commerce |
Civil Injunction Authority | Law and Crime |
Provides Complete Procedures To Govern The Conduct Of A Class Action Law Suit. | Law and Crime |
No-Protest Zone Around Abortion Clinic | Civil Rights |
Requiring Local Government To Coordinate Growth Management | Housing |
Most functionality of the NetworkInference
package is
based on the cascades
data format. So before starting with
the analysis we have to transform our data to such an object.
policy_cascades <- as_cascade_long(policies, cascade_node_name = 'statenam',
event_time = 'adopt_year',
cascade_id = 'policy')
In this case we used the function as_cascade_long
. If
your data is in wide format you can convert it using the function
as_cascade_wide
.
The cascade
class contains the same data as the
policies
data.frame
, just in a different
format. You don’t need to understand how the object is constructed but
let’s take a look for clarity:
class(policy_cascades)
#> [1] "cascade" "list"
length(policy_cascades)
#> [1] 3
names(policy_cascades)
#> [1] "cascade_nodes" "cascade_times" "node_names"
The cascade
class is basically a list containing three
elements:
policy_cascades$cascade_nodes[2:3]
#> $aborparc
#> [1] "Louisiana" "Massachusetts" "Rhode Island" "Missouri"
#> [5] "Indiana" "Alabama" "Maine" "Wyoming"
#> [9] "South Carolina" "Michigan" "Wisconsin" "Kentucky"
#> [13] "Pennsylvania" "North Carolina" "Tennessee"
#>
#> $aborparn
#> [1] "Minnesota" "Utah" "Arizona" "West Virginia"
#> [5] "Arkansas" "Connecticut" "Ohio" "Georgia"
#> [9] "Nebraska" "Kansas" "Maryland" "Tennessee"
#> [13] "Delaware" "Idaho" "South Dakota" "Virginia"
#> [17] "Texas"
policy_cascades$cascade_times[2:3]
#> $aborparc
#> [1] 1981 1981 1982 1983 1984 1987 1989 1989 1990 1991 1992 1994 1994 1996 1999
#>
#> $aborparn
#> [1] 1981 1981 1982 1984 1989 1990 1990 1991 1991 1992 1992 1992 1996 1996 1998
#> [16] 1998 2000
There are a few convenience functions to manipulate the cascades (but
you can also manipulate the data before converting it to the
cascade
format).
selected_policies <- subset_cascade(cascade = policy_cascades,
selection = c('clinic_access', 'cogrowman'))
selected_policies[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#> [1] "Oregon" "Wisconsin" "Kansas" "Maryland"
#> [5] "California" "Nevada" "Colorado" "Connecticut"
#> [9] "Massachusetts" "Minnesota" "North Carolina" "Washington"
#> [13] "Maine" "Michigan" "New York" "Montana"
#>
#> $cascade_nodes$cogrowman
#> [1] "Hawaii" "Vermont" "Oregon" "Florida" "New Jersey"
#> [6] "Rhode Island" "Washington" "Maryland" "Arizona" "Tennessee"
#>
#>
#> $cascade_times
#> $cascade_times$clinic_access
#> [1] 1973 1985 1988 1989 1990 1991 1993 1993 1993 1993 1993 1993 1995 1995 1999
#> [16] 2005
#>
#> $cascade_times$cogrowman
#> [1] 1961 1970 1973 1985 1986 1988 1990 1992 1998 1998
time_constrained <- subset_cascade_time(cascade = selected_policies,
start_time = 1990, end_time = 2000)
time_constrained[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#> [1] "California" "Nevada" "Colorado" "Connecticut"
#> [5] "Massachusetts" "Minnesota" "North Carolina" "Washington"
#> [9] "Maine" "Michigan" "New York"
#>
#> $cascade_nodes$cogrowman
#> [1] "Washington" "Maryland" "Arizona" "Tennessee"
#>
#>
#> $cascade_times
#> $cascade_times$clinic_access
#> [1] 1990 1991 1993 1993 1993 1993 1993 1993 1995 1995 1999
#>
#> $cascade_times$cogrowman
#> [1] 1990 1992 1998 1998
less_nodes <- drop_nodes(cascades = time_constrained,
nodes = c('Maryland', 'Washington'))
less_nodes[1:2]
#> $cascade_nodes
#> $cascade_nodes$clinic_access
#> [1] "California" "Nevada" "Colorado" "Connecticut"
#> [5] "Massachusetts" "Minnesota" "North Carolina" "Maine"
#> [9] "Michigan" "New York"
#>
#> $cascade_nodes$cogrowman
#> [1] "Arizona" "Tennessee"
#>
#>
#> $cascade_times
#> $cascade_times$clinic_access
#> [1] 1990 1991 1993 1993 1993 1993 1993 1995 1995 1999
#>
#> $cascade_times$cogrowman
#> [1] 1998 1998
It’s always good practice to visually inspect the data before working
with it. The NetworkInference
package provides
functionality to visualize the cascade data.
The function summary.cascades()
provides quick summary
statistics on the cascade data:
summary(policy_cascades)
#> # cascades: 728
#> # nodes: 50
#> # nodes in cascades: 50
#> # possible edges: 2450
#>
#> Summary statistics for cascade length and number of ties:
#> length ties
#> Min. 1.00000 0.00000
#> 1st Qu. 11.00000 4.00000
#> Median 23.00000 11.00000
#> Mean 24.49863 14.53571
#> 3rd Qu. 40.00000 23.00000
#> Max. 50.00000 48.00000
The first four lines provide the number of cascades, the number of
nodes in the system, the number of nodes involved in cascades (there
might be states that we don’t have diffusion data on, but we still want
them represented in the dataset) and the possible number of edges in a
potential diffusion network (a diffusion edge between nodes
u
and v
only makes sense if there is at least
one cascade in which u
experiences an event before
v
). In this example there are 187 policies and 50 states.
Each state is involved in at least one policy cascade and a fully
connected diffusion network would have 2450 edges.
It also provides summary statistics on the distribution of the cascade lengths (number of nodes involved in each cascade) and the number of ties in the cascades (two nodes experiencing the same event at the same time). For our example, we can see that the ‘smallest’ policy was adopted by 10 states and the ‘largest’ by all 50 states. From the tie summaries we see that there is at least one policy that was adopted by 45 states in the same year.
The plot()
method allows to plot cascades with varying
degrees of detail. The argument label_nodes
(TRUE/FALSE
) provides node labels which require more space
but provide more detail. The argument selection
allows to
pick a subset of cascades to visualize in case there are too many to
plot. If label_nodes
is set to FALSE
each
event is depicted by a dot, which allows to visualize more cascades
simultaneously.
Let’s first look at the visualization with labels. Here we plot two cascades, selected by their name:
selection <- c('guncontrol_assaultweapon_ba', 'guncontrol_licenses_dealer')
plot(policy_cascades, label_nodes = TRUE, selection = selection)
#> Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
#> ggplot2 3.3.4.
#> ℹ Please use "none" instead.
#> ℹ The deprecated feature was likely used in the NetworkInference package.
#> Please report the issue at
#> <https://github.com/desmarais-lab/NetworkInference/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider
#> increasing max.overlaps
We can also plot more cascades with less detail:
selection <- c('waiting', 'threestrikes', 'unionlimits', 'smokeban',
'paperterror', 'miglab', 'methpre', 'lott', 'lemon', 'idtheft',
'harass', 'hatecrime', 'equalpay')
plot(policy_cascades, label_nodes = FALSE, selection = selection)
This produces a ‘violin plot’ for each cascade with the single diffusion events overplotted as dots. As we already saw in the previous visualization, the policy data has a lot of ties (i.e. many states adopted a policy in the same year) which is indicated by the areas of higher density in the violin plot.
The netinf
algorithm is implemented in the
netinf()
function. The netinf
inferrs edges
based on a diffusion model. That is we assume a parametric model for the
diffusion times between edges. Currently three different diffusion
models are implemented: The exponential distribution, the rayleigh
distribution and the log-normal distribution. The model can be chosen
with the trans_mod
argument (default is the exponential
distribution).
In the original implementation the number of edges to infer had to be
fixed and chosen by the researcher. If you want to run
netinf
in this classic way you can do so by specifiying all
parameters and the number of edges:
results <- netinf(policy_cascades, trans_mod = "exponential", n_edges = 100,
params = 0.5, quiet = TRUE)
The exponential model has only one parameter (lambda or the rate). If
there are more parameters the details section in the documentation of
the netinf
function (?netinf
) has more detail
on how to specify parameters and on the specific parametrization used by
the implementation.
n_edges
specifies how many edges should be inferred. See
@gomez2010inferring and @desmarais2015persistent for guidance on
choosing this parameter if running netinf in classic mode. If the number
of edges is specified manually, it has to be lower than the maximum
number of possible edges. An edge u->v
is only possible
if in at least one cascade u
experiences an event
before v
. This means, that the maximum number of
edges depends on the data. The function
count_possible_edges()
allows us to compute the maximum
number of edges in advance:
With version 1.2.0 netinf
can be run without manually
specifying the number of edges or the parameters of the diffusion
model.
After each iteration of the netinf algorithm, we check if the edge
added significant improvement to the network. This is done via a vuong
style test. Given the likelihood score for each cascade conditional on
the network inferred so far, we penalize the network with one addional
edge and test if the increase in likelihood accross all cascades is
significant. The user still has to specify a p-value cut-off. If the
p-value of an edge is larger than the specified cut-off the algorithm
stops inferring more edges. The cut-off is set via the
p_value_cutoff
argument.
results <- netinf(policy_cascades, trans_mod = "exponential",
p_value_cutoff = 0.1, params = 0.5, quiet = TRUE)
nrow(results)
#> [1] 872
We see that with a fixed lambda of 0.5 and a p-value cut-off of 0.1 the algorithm inferred 872 edges.
The diffusion model parameters can be selected automatically. Setting
the params
argument to NULL
(default value)
makes the netinf
function initialize the parameters
automatically. The parameters are initialized at the midpoint between
the MLE of the minimum diffusion times and the MLE of the maximum
diffusion times, across all cascades. Edges are then inferred until
either the p-value cut-off or a manually specified number of edges
(n_edges
) is reached.
Let’s take a look at the output of the algorithm. The output is a dataframe containing the inferred latent network in the form of an edgelist:
origin_node | destination_node | improvement | p_value |
---|---|---|---|
California | Oregon | 3745 | 0 |
California | Colorado | 3363 | 0 |
California | Maryland | 3356 | 0 |
California | Washington | 3307 | 0 |
California | Illinois | 3286 | 0 |
California | North Carolina | 3264 | 0 |
Each row corresponds to a directed edge. The first column indicates
the origin node, the second the destination node. The third column
displays the gain in model fit from each added edge. The last column
displays the p-value from the Vuong test of each edge. There is a
generic plot method to inspect the results. If more tweaking is
required, the results are a dataframe so it should be easy for the more
experienced users to make your own plot. With
type = "improvement"
the improvement from each edge can be
plotted:
We can also quickly check the p-value from the Vuong test associated with each edge addition:
In order to produce a quick visualization of the resulting diffusion
network we can use the plot method again, this time with
type = "network"
. Note that in order to use this
functionality the igraph package has to be installed.
#install.packages('igraph')
# For this functionality the igraph package has to be installed
# This code is only executed if the package is found:
if(requireNamespace("igraph", quietly = TRUE)) {
plot(results, type = "network")
}
If additional tweaking of the plot is desired, the network can be
visualized using igraph
explicitly. We refer you you to the
igraph
documentation for details on how to customize the plot.