Title: | Inferring Latent Diffusion Networks |
---|---|
Description: | This is an R implementation of the netinf algorithm (Gomez Rodriguez, Leskovec, and Krause, 2010)<doi:10.1145/1835804.1835933>. Given a set of events that spread between a set of nodes the algorithm infers the most likely stable diffusion network that is underlying the diffusion process. |
Authors: | Fridolin Linder [aut, cre], Bruce Desmarais [ctb] |
Maintainer: | Fridolin Linder <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.4.9000 |
Built: | 2024-11-12 05:42:27 UTC |
Source: | https://github.com/desmarais-lab/networkinference |
Create a cascade object from data in long format.
as_cascade_long(data, cascade_node_name = "node_name", event_time = "event_time", cascade_id = "cascade_id", node_names = NULL)
as_cascade_long(data, cascade_node_name = "node_name", event_time = "event_time", cascade_id = "cascade_id", node_names = NULL)
data |
data.frame, containing the cascade data
with column names corresponding to the arguments provided to
|
cascade_node_name |
character, column name of |
event_time |
character, column name of |
cascade_id |
character, column name of the cascade identifier. |
node_names |
character, factor or numeric vector containing the names for each node. Optional. If not provided, node names are inferred from the cascade data. |
Each row of the data describes one event in the cascade. The data must contain at least three columns:
Cascade node name: The identifier of the node that experiences the event.
Event time: The time when the node experiences the event. Note that if the time column is of class date or any other special time class, it will be converted to an integer with 'as.numeric()'.
Cascade id: The identifier of the cascade that the event pertains to.
The default names for these columns are node_name
, event_time
and cascade_id
. If other names are used in the data
object the
names have to be specified in the corresponding arguments (see argument
documentation)
An object of class cascade
. This is a list containing three
(named) elements:
"node_names"
A character vector of node names.
"cascade_nodes"
A list with one character vector per
cascade containing the node names in order of the events.
"cascade_times"
A list with one element per cascade
containing the event times for the nodes in "cascade_names"
.
df <- simulate_rnd_cascades(10, n_nodes = 20) cascades <- as_cascade_long(df) is.cascade(cascades)
df <- simulate_rnd_cascades(10, n_nodes = 20) cascades <- as_cascade_long(df) is.cascade(cascades)
Create a cascade object from data in wide format.
as_cascade_wide(data, node_names = NULL)
as_cascade_wide(data, node_names = NULL)
data |
data.frame or matrix, rows corresponding to nodes, columns to cascades. Matrix entries are the event times for each node, cascade pair. Missing values indicate censored observations, that is, nodes that did not have an event). Specify column and row names if cascade and node ids other than integer sequences are desired. Note that, if the time column is of class date or any other special time class, it will be converted to an integer with 'as.numeric()'. |
node_names |
character, factor or numeric vector, containing names for each node. Optional. If not provided, node names are inferred from the provided data. |
If data is in wide format, each row corresponds to a node and each column to
a cascade. Each cell indicates the event time for a node - cascade
combination. If a node did not experience an event for a cascade (the node
is censored) the cell entry must be NA
.
An object of class cascade
. This is a list containing three
(named) elements:
"node_names"
A character vector of node names.
"cascade_nodes"
A list with one character vector per
cascade containing the node names in order of the events.
"cascade_times"
A list with one element per cascade
containing the event times for the nodes in "cascade_names"
.
data("policies") cascades <- as_cascade_long(policies, cascade_node_name = 'statenam', event_time = 'adopt_year', cascade_id = 'policy') wide_policies = as.matrix(cascades) cascades <- as_cascade_wide(wide_policies) is.cascade(cascades)
data("policies") cascades <- as_cascade_long(policies, cascade_node_name = 'statenam', event_time = 'adopt_year', cascade_id = 'policy') wide_policies = as.matrix(cascades) cascades <- as_cascade_wide(wide_policies) is.cascade(cascades)
Generates a data frame containing the cascade information in the cascade object.
## S3 method for class 'cascade' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S3 method for class 'cascade' as.data.frame(x, row.names = NULL, optional = FALSE, ...)
x |
Cascade object to convert. |
row.names |
NULL or a character vector giving the row names for the data frame. Missing values are not allowed. |
optional |
logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. (Not supported) |
... |
Additional arguments passed to |
A data frame with three columns. Containing 1) The names of
the nodes ("node_name"
) that experience an event in each cascade,
2) the event time ("event_time"
) of the corresponding node,
3) the cascade identifier "cascade_id"
.
data(cascades) as.data.frame(cascades)
data(cascades) as.data.frame(cascades)
Generates a matrix
containing the cascade information in the
cascade object in wide format. Missing values are used for nodes that do not
experience an event in a cascade.
## S3 method for class 'cascade' as.matrix(x, ...)
## S3 method for class 'cascade' as.matrix(x, ...)
x |
cascade object to convert. |
... |
additional arguments to be passed to or from methods. (Currently not supported.) |
A matrix containing all cascade information in wide format. That is,
each row of the matrix corresponds to a node and each column to a cascade.
Cell entries are event times. Censored nodes have NA
for their entry.
data(cascades) as.matrix(cascades)
data(cascades) as.matrix(cascades)
An example dataset of 31 nodes and 54 cascades. From the original netinf implementation in SNAP.
data(cascades)
data(cascades)
An object of class cascade
containing 4 objects
Character node names
A list of integer vectors. Each containing the names of the nodes infected in this cascades in the order of infection
A list of numeric vectors. Each containing the infection times for the corresponding nodes in cascade_nodes
https://github.com/snap-stanford/snap/blob/master/examples/netinf/example-cascades.txt
Across all cascades, count the edges that are possible. An edge from node
u
to node v
is only possible if in at least one cascade u
experienced an event
before v
.
count_possible_edges(cascades)
count_possible_edges(cascades)
cascades |
Object of class cascade containing the data. |
An integer count.
data(cascades) count_possible_edges(cascades)
data(cascades) count_possible_edges(cascades)
Drop nodes from a cascade object
drop_nodes(cascades, nodes, drop = TRUE)
drop_nodes(cascades, nodes, drop = TRUE)
cascades |
cascade, object to drop nodes from. |
nodes |
character or integer, vector of node_ids to drop. |
drop |
logical, Should empty cascades be dropped. |
An object of class cascade containing the cascades without the dropped nodes.
data(policies) cascades <- as_cascade_long(policies, cascade_node_name = 'statenam', event_time = 'adopt_year', cascade_id = 'policy') new_cascades <- drop_nodes(cascades, c("California", "New York"))
data(policies) cascades <- as_cascade_long(policies, cascade_node_name = 'statenam', event_time = 'adopt_year', cascade_id = 'policy') new_cascades <- drop_nodes(cascades, c("California", "New York"))
Is the object of class cascade?
is.cascade(object)
is.cascade(object)
object |
the object to be tested. |
TRUE
if object is a cascade, FALSE
otherwise.
data(cascades) is.cascade(cascades) # > TRUE is.cascade(1) # > FALSE
data(cascades) is.cascade(cascades) # > TRUE is.cascade(1) # > FALSE
Tests if an object is of class diffnet. The class diffnet is appended to the
object returned by netinf
for dispatch of appropriate plotting
methods.
is.diffnet(object)
is.diffnet(object)
object |
the object to be tested. |
TRUE
if object is a diffnet, FALSE
otherwise.
data(cascades) result <- netinf(cascades, n_edges = 6, params = 1) is.diffnet(result)
data(cascades) result <- netinf(cascades, n_edges = 6, params = 1) is.diffnet(result)
Infer a network of diffusion ties from a set of cascades. Each cascade is defined by pairs of node ids and infection times.
netinf(cascades, trans_mod = "exponential", n_edges = NULL, p_value_cutoff = NULL, params = NULL, quiet = FALSE, trees = FALSE)
netinf(cascades, trans_mod = "exponential", n_edges = NULL, p_value_cutoff = NULL, params = NULL, quiet = FALSE, trees = FALSE)
cascades |
an object of class cascade containing node and cascade
information. See |
trans_mod |
character, indicating the choice of model:
|
n_edges |
integer, number of edges to infer. Leave unspecified if using
|
p_value_cutoff |
numeric, in the interval (0, 1). If
specified, edges are inferred in each iteration until the Vuong test for
edge addition reaches the p-value cutoff or when the maximum
possible number of edges is reached. Leave unspecified if using
|
params |
numeric, Parameters for diffusion model. If left unspecified reasonable parameters are inferred from the data. See details for how to specify parameters for the different distributions. |
quiet |
logical, Should output on progress by suppressed. |
trees |
logical, Should the inferred cascade trees be returned. Note, that this will lead to a different the structure of the function output. See section Value for details. |
The algorithm is describe in detail in Gomez-Rodriguez et al. (2010). Additional information can be found on the netinf website (http://snap.stanford.edu/netinf/).
Exponential distribution: trans_mod = "exponential"
,
params = c(lambda)
.
Parametrization: .
Rayleigh distribution: trans_mod = "rayleigh"
,
params = c(alpha)
.
Parametrization: .
Log-normal distribution: trans_mod = "log-normal"
,
params = c(mu, sigma)
.
Parametrization: .
If higher performance is required and for very large data sets, a faster pure C++ implementation is available in the Stanford Network Analysis Project (SNAP). The software can be downloaded at http://snap.stanford.edu/netinf/.
Returns the inferred diffusion network as an edgelist in an object of
class diffnet
and data.frame
. The first
column contains the sender, the second column the receiver node. The
third column contains the improvement in fit from adding the edge that is
represented by the row. The output additionally has the following
attributes:
"diffusion_model"
: The diffusion model used to infer the
diffusion network.
"diffusion_model_parameters"
: The parameters for the
model that have been inferred by the approximate profile MLE
procedure.
If the argument trees
is set to TRUE
, the output is a list
with the first element being the data.frame
described above, and
the second element being the trees in edge-list form in a single
data.frame
.
M. Gomez-Rodriguez, J. Leskovec, A. Krause. Inferring Networks of Diffusion and Influence.The 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2010.
# Data already in cascades format: data(cascades) out <- netinf(cascades, trans_mod = "exponential", n_edges = 5, params = 1) # Starting with a dataframe df <- simulate_rnd_cascades(10, n_nodes = 20) cascades2 <- as_cascade_long(df, node_names = unique(df$node_name)) out <- netinf(cascades2, trans_mod = "exponential", n_edges = 5, params = 1)
# Data already in cascades format: data(cascades) out <- netinf(cascades, trans_mod = "exponential", n_edges = 5, params = 1) # Starting with a dataframe df <- simulate_rnd_cascades(10, n_nodes = 20) cascades2 <- as_cascade_long(df, node_names = unique(df$node_name)) out <- netinf(cascades2, trans_mod = "exponential", n_edges = 5, params = 1)
This package provides an R implementation of the netinf
algorithm
created by Gomez Rodriguez, Leskovec, and Krause (2010). Given a set of
events that spread between a set of nodes the algorithm infers the most likely
stable diffusion network that is underlying the diffusion process.
The package provides three groups of functions: 1) data preparation 2) estimation and 3) interpretation.
The core estimation function netinf
requires an object of class
cascade
(see as_cascade_long and as_cascade_wide).
Cascade data contains information on the potential nodes in the network as
well as on event times for each node in each cascade.
Diffusion networks are estimated using the netinf
function. It
produces a diffusion network in form of an edgelist (of class
data.frame
).
Cascade data can be visualized with the plot
method of the cascade
class (diffnet, plot.cascade
). Results of the estimation process can
be visualized using the plotting method of the diffnet
class.
If higher performance is required and for very large data sets, a faster pure C++ implementation is available in the Stanford Network Analysis Project (SNAP). The software can be downloaded at http://snap.stanford.edu/netinf/.
Allows plotting of one or multiple, labeled or unlabeled cascades.
## S3 method for class 'cascade' plot(x, label_nodes = TRUE, selection = NULL, ...)
## S3 method for class 'cascade' plot(x, label_nodes = TRUE, selection = NULL, ...)
x |
object of class cascade to be plotted. |
label_nodes |
logical, indicating if should the nodes in each cascade be
labeled. If the cascades are very dense setting this to |
selection |
a vector of cascade ids to plot. |
... |
additional arguments passed to plot. |
The function returns a ggplot plot object (class gg, ggplot
) which
can be modified like any other ggplot. See the ggplot documentation and the
examples below for more details.
A ggplot plot object.
data(cascades) plot(cascades, selection = names(cascades$cascade_nodes)[1:5]) plot(cascades, label_nodes = FALSE, selection = sample(1:54, 20)) # Modify resulting ggplot object library(ggplot2) p <- plot(cascades, label_nodes = FALSE, selection = sample(1:54, 20)) ## Add a title p <- p + ggtitle('Your Title') p ## Change Axis p <- p + xlab("Your modified y axis label") #x and y labels are flipped here p <- p + ylab("Your modified x axis label") #x and y labels are flipped here p
data(cascades) plot(cascades, selection = names(cascades$cascade_nodes)[1:5]) plot(cascades, label_nodes = FALSE, selection = sample(1:54, 20)) # Modify resulting ggplot object library(ggplot2) p <- plot(cascades, label_nodes = FALSE, selection = sample(1:54, 20)) ## Add a title p <- p + ggtitle('Your Title') p ## Change Axis p <- p + xlab("Your modified y axis label") #x and y labels are flipped here p <- p + ylab("Your modified x axis label") #x and y labels are flipped here p
Visualize the inferred diffusion network or the marginal gain in fit obtained by addition of each edge.
## S3 method for class 'diffnet' plot(x, type = "network", ...)
## S3 method for class 'diffnet' plot(x, type = "network", ...)
x |
object of class diffnet to be plotted. |
type |
character, one of |
... |
additional arguments. |
If 'type = improvement' a ggplot object is returned. It can be modified like any other ggplot. See the ggplot documentation and the examples in plot.cascade.
A ggplot plot object if type = "improvement"
otherwise an
igraph plot.
## Not run: data(cascades) res <- netinf(cascades, quiet = TRUE) plot(res, type = "network") plot(res, type = "improvement") plot(res, type = "p-value") ## End(Not run)
## Not run: data(cascades) res <- netinf(cascades, quiet = TRUE) plot(res, type = "network") plot(res, type = "improvement") plot(res, type = "p-value") ## End(Not run)
The SPID data includes information on the year of adoption for over 700 policies in the American states.
data(policies)
data(policies)
The data comes in two objects of class data.frame
. The first
object, named policies
contains the adoption events. Each row
corresponds to an adoption event. Each adoption event is described by
the three columns:
statenam
: Name of the adopting state.
policy
: Name of the policy.
adopt_year
: Year when the state adopted the policy.
The second object (policies_metadata
) contains more details on each
of the policies. It contains these columns:
policy
: Name of the policy.
source
: Original source of the data.
first_year
: First year any state adopted this policy.
last_year
: Last year any state adopted this policy.
adopt_count
: Number of states that adopted this policy.
description
: Description of the policy.
majortopic
: Topic group the policy belongs to.
Both data.frame
objects can be joined (merged) on the common column
policy
(see example code).
This version 1.0 of the database. For each policy we document the year of first adoption for each state. Adoption dates range from 1691 to 2017 and includes all fifty states. Policies are adopted by anywhere from 1 to 50 states, with an average of 24 adoptions. The data were assembled from a variety of sources, including academic publications and policy advocacy/information groups. Policies were coded according to the Policy Agendas Project major topic code. Additional information on policies is available at the source repository.
https://doi.org/10.7910/DVN/CVYSR7
Boehmke, Frederick J.; Mark Brockway; Bruce A. Desmarais; Jeffrey J. Harden; Scott LaCombe; Fridolin Linder; and Hanna Wallach. 2018. "A New Database for Inferring Public Policy Innovativeness and Diffusion Networks." Working paper.
data('policies') # Join the adoption events with the metadata merged_policies <- merge(policies, policies_metadata, by = 'policy')
data('policies') # Join the adoption events with the metadata merged_policies <- merge(policies, policies_metadata, by = 'policy')
A network from simulated data. For testing purposes.
data(sim_validation)
data(sim_validation)
An object of class data.frame
with 4 columns, containing:
Origin of diffusion edge.
Destination node of diffusion edge.
Improvement in score for the edge
p-value for vuong test
See code below.
Simulate diffusion cascades based on the generative model underlying netinf and a diffusion network.
simulate_cascades(diffnet, nsim = 1, max_time = Inf, start_probabilities = NULL, partial_cascade = NULL, params = NULL, model = NULL, nodes = NULL)
simulate_cascades(diffnet, nsim = 1, max_time = Inf, start_probabilities = NULL, partial_cascade = NULL, params = NULL, model = NULL, nodes = NULL)
diffnet |
object of class |
nsim |
integer, number of cascades to simulate. |
max_time |
numeric, the maximum time after which observations are censored |
start_probabilities |
a vector of probabilities for each node in diffnet,
to be the node with the first event. If |
partial_cascade |
object of type cascade, containing one partial cascades for which further development should be simulated. |
params |
numeric, (optional) parameters for diffusion time distribution.
See the details section of |
model |
character, diffusion model to use. One of |
nodes |
vector of node ids if different from nodes included in
|
A data frame with three columns. Containing 1) The names of
the nodes ("node_name"
) that experience an event in each cascade,
2) the event time ("event_time"
) of the corresponding node,
3) the cascade identifier "cascade_id"
.
data(cascades) out <- netinf(cascades, trans_mod = "exponential", n_edges = 5, params = 1) simulated_cascades <- simulate_cascades(out, nsim = 10) # Simulation from partial cascade
data(cascades) out <- netinf(cascades, trans_mod = "exponential", n_edges = 5, params = 1) simulated_cascades <- simulate_cascades(out, nsim = 10) # Simulation from partial cascade
Simulate random cascades, for testing and demonstration purposes. No actual diffusion model is underlying these cascades.
simulate_rnd_cascades(n_cascades, n_nodes)
simulate_rnd_cascades(n_cascades, n_nodes)
n_cascades |
Number of cascades to generate. |
n_nodes |
Number of nodes in the system. |
A data frame containing (in order of columns) node ids, event time and cascade identifier.
df <- simulate_rnd_cascades(10, n_nodes = 20) head(df)
df <- simulate_rnd_cascades(10, n_nodes = 20) head(df)
Select a subset of cascades from cascade object
subset_cascade(cascade, selection)
subset_cascade(cascade, selection)
cascade |
cascade, object to select from |
selection |
character or integer, vector of cascade_ids to select |
An object of class cascade containing just the selected cascades
data(policies) cascades <- as_cascade_long(policies, cascade_node_name = 'statenam', event_time = 'adopt_year', cascade_id = 'policy') cascade_names <- names(cascades$cascade_times) subset_cascade(cascades, selection = cascade_names[1:10])
data(policies) cascades <- as_cascade_long(policies, cascade_node_name = 'statenam', event_time = 'adopt_year', cascade_id = 'policy') cascade_names <- names(cascades$cascade_times) subset_cascade(cascades, selection = cascade_names[1:10])
Remove each all events occurring outside the desired subset for each cascade in a cascade object.
subset_cascade_time(cascade, start_time, end_time, drop = TRUE)
subset_cascade_time(cascade, start_time, end_time, drop = TRUE)
cascade |
cascade, object to subset. |
start_time |
numeric, start time of the subset. |
end_time |
numeric, end time of the subset. |
drop |
logical, should empty sub-cascades be dropped? |
An object of class cascade, where only events are included that have
times start_time
<= t < end_time
.
data(cascades) sub_cascades <- subset_cascade_time(cascades, 10, 20, drop=TRUE)
data(cascades) sub_cascades <- subset_cascade_time(cascades, 10, 20, drop=TRUE)
Generates summary statistics for single cascades and across cascades in a collection, contained in a cascades object.
## S3 method for class 'cascade' summary(object, quiet = FALSE, ...)
## S3 method for class 'cascade' summary(object, quiet = FALSE, ...)
object |
object of class cascade to be summarized. |
quiet |
logical, if |
... |
Additional arguments passed to summary. |
Prints cascade summary information to the screen
(if quiet = FALSE
). '# cascades'
is the number of cascades in
the object, '# nodes'
is the number of nodes in the system (nodes
that can theoretically experience an event), '# nodes in cascades'
is
the number of unique nodes of the system that experienced an event and
'# possible edges'
is the number of edges that are possible given
the cascade data (see count_possible_edges
for details.).
Additional summaries for each cascade are returned invisibly.
cascade), length
(length of the cascade as an integer of how many
nodes experienced and event) and n_ties
(number of tied event
times per cascade).
data(cascades) summary(cascades)
data(cascades) summary(cascades)
Contains output from original netinf C++ implementation, executed on
cascades
. For testing purposes.
data(validation)
data(validation)
An object of class data.frame
with 6 columns, containing:
Origin of diffusion edge.
Destination node of diffusion edge.
??
Marginal gain from edge.
Median time between events in origin and destination
Mean time between events in origin and destination
Output from netinf example program (https://github.com/snap-stanford/snap/tree/master/examples/netinf).