1. Introduction

Problem Description
 A company intends to select a small set of customers to distribute praises of their trial products to a larger group

Influence maximization
 Goal: Identify a small subset of seed nodes that have the best chance to influence the most number of nodes
 Competitive Influence Maximization (CIM)

Assumption
 Influence is exclusive (Once a node is influenced by one party, it will not be influenced again)
 Each round all parties choose one node and then the influence propagates before the next round starts

STORM (STrategyOriented ReinforcementLearning based influence Maximization) performs
 Data Generation
 the data, which is the experience generated through simulation by applying the current model, will become the feedbacks to refine the model for better performance
 Model Learning
 Data Generation
Difference with Others
 Known strategy → Both know and unknown
 Known or Unknown but available to compete → Train a model to learn strategy
 Unknown → Gametheoretical solution to seek the Nash equilibrium
 Singleroung → Multiround
 Model driven → learningbased, datadrivern
 Not considering different network topology → General to adapt both opponent's strategy and environment setting (e.g. underlying network topology)
2. Problem Statement
Def 1: Competitive Linear Threshold (CLT)
 CLT model is a multiparty diffusion model
 The party who has the highest influence occupied the node
Def 2: MultiRound Competitive Influence Maximization (MRCIM)
 Max its overall relative influence
4. Methodology
 NPhardness of MRCIM → looks for approxmiate solution
 Max the inflence for each round does not guarantee overall max
 Due to the fact that each round are not independent
4.1 Preliminary: Reinforcement Learning
 Learn a policy \(\pi(s)\) to determine which action to take state s (environment)
 How to estimated \(\pi\)?
 Expected Accmulated Reward of a state (V function)
 \( V^\pi(s) = E_\pi\{R_tS_t=s\}=...\)
 Expected Accmulated Reward of a stateaction pair (Q function)
 \( Q^\pi(s, a) = E_\pi\{R_tS_t=s, a_t=a\}=...\)
 Expected Accmulated Reward of a state (V function)
The optimal \(\pi\) can be obtained through Q functinon
\( \pi = \arg \min_{a\in A}Q(s,a)\)
(i.e. For all "a" in A, find the "a" such that min Q(s, a))
4.2 StrategyOriented ReinforcementLearning
Setup
 Env
 Influence propagation process
 Reward
 Delay Reward: The difference of activated nodes between parties at the last round
 After the last round, rewards are propagated to the previous states through Qfunction updating
 Slow but more accurate
 Delay Reward: The difference of activated nodes between parties at the last round
 Action
Choosing certain node to activate too many
 overfit
 Single Party IM strategies
 Namely, which strategy to choose given the current state
 The size can be reduced to strategies chosen
 Chosen Strategies
 subgreedy
 degreefirst
 block
 maxweight
 State
 Represents
 network
 environment status
record the occupation status of all nodes \(3^{V}\), too many
 overfit
 Features Designed
 Number of free nodes
 Sum of degrees of all nodes
 Sum of weight of the edges for which bot h vertices are free
 Max degree among all free nodes
 Max sum of free outedge weight of a node among nodes which are the first player's neighbors
 Second player's
 Max activated nodes of a node for the first player alter two rounds of influence propagation
 Second player's
 The feautres are quantize into
 low
 medium
 high
 Totally, \(3^9\) states
 Represents
Data For Training
 Propagation model is known (e.g. LT in the experiments)
 Strategies served as actions are predefined
In training phase, train the agent against a certain strategy and see how it performs on the given network
These data can be used to learn the value functions
Training Against Opponents
 Opponent Strategy
 Known: Simulate the strategy during training
 Unknown but available during training: Same as above
 Unknown: More General Model in 4.4
Phase
 Phase 1: Training
 The agent update its Q function from the simulation experiences throughout the training rounds
 Update \(\pi\) in the meantime
 Phase 2: Competition
 The agent would not update Qtable
 Generates \(\pi\) according to Qtable
4.3 STORM with Strategy Known
 Training the model compete against the strategy to learn \(\pi\)
 STORMQ
 Update Qfunction following the concept of Qlearning
 QLearning: \(Q(S_t, a_t) = Q(S_t, a_t) + \alpha * (r_{t+1} + \gamma * max_{a}Q(S_{t+1}, a) Q(S_t, a_t))\)
 \(\epsilon\)greedy
 Determine strategies on the current policy derived from Qtable.
 Explore the new directions to avoid local optimum
 Pure Strategy
 The most likely strategy is chosen
 Update Qfunction following the concept of Qlearning
$ Algorithm $
4.4 STORM with Strategy Unknown
Unknown but available to train
 The differece between the known case is that experience cannot be obtained through simulation
 Train against unknown opponent's strategy during competition
 It's feasible because STORMQ only needs to know the seedselection outcoms of the opponent to update the Qtable, not exact strategy it takes
Unknown
 Goal: Create a general model to compete a variety of rational strategies
 Assumption: The oppoent is rational (Wants to max influence and knows its oppoent wants so)
 STORMQQ
 Two STROMQ compete and update Qtabale at the same time
 Using current Qtable during training phase
 Pure Strategy
 Does Not guarantee that equilibrium exists in MRCIM

STORMMM
 Mix Strategy (Samples an action from the distribution of actions in each state)
 In twoplayer zerosum game
 Nash equilibrium is graranteed to exist with miexed strategies
 Use MINMAX theorem to find the equilibrium
 \(Q(s, a, o)\): The reward of first party when using strategy \(a\) against oppoent's strategy \(o\) in state \(s\)
 \(Q_{t+1}(s_t, a_t, o_t) = (1\alpha)Q_t(s_t, a_t, o_t)+\alpha[r_{t+1}+\gamma V(s_{t+1})]\)
 Operations Research

The differece between STROMQQ and STORMMM
STROMQQ  STROMMM 

Max the reward in their own Qtable  Finds equilibrium with one Qtable and determines both side's \(a\) at the same time 
Pure Strategies  Mixed Strategies 
Choose strategy by greedy  Samples from the mixed strategy \(\pi_a\) or \(\pi_o\) 
 Ideally, they should have similar result in twoparty MRCIM. In practice, the result might not due to
 STORMQQ does not guarantee equilibrium
 Although equilibrium exists in STORMMM. It does not guarantee to be found due to lack of training data or bad init or such problems.
Comments
Do you like this article? What do your tink about it? Leave you comment below