I Introduction
Recent advances in sensing, communication, and computation technologies have ushered an era of largescale networked systems, now ubiquitous in smart grids, robotics, defense installations, industrial automation, and cyberphysical systems [1, 2, 3]. In order to accomplish sophisticated tasks, these multiagent systems are expected to sense, learn, and adapt to sequentially arriving measurements while facing timevarying and nonstationary environments. Such high levels of network agility requires close coordination and cooperation among the agents, that are otherwise intermittently connected and resourceconstrained. Towards performing control and resource allocation over such dynamic graphs, it is necessary to develop innetwork optimization algorithms that can be implemented in a distributed fashion [4].
We consider the problem of tracking the minimum of a timevarying cost function that can be written as a sum of several nodespecific smooth cost functions and a nonsmooth regularizer. The framework subsumes a number of relevant problems such as modelpredictive control [5, 6], tracking timevarying parameters [7, 8, 9], path planning [10, 11], realtime magnetic resonance imaging [12], adaptive matrix completion [13], demand scheduling in smart grids [14], and so on. Of particular interest is the distributed setting, where measurements are made locally at each node and the interaction between nodes may only occur over a timevarying directed communication graph. Due to the intermittent connectivity, it is not possible for the nodes to exchange updates at every time slot, and consequently, simple variants of centralized algorithms cannot be directly utilized.
Within the context of static optimization problems, distributed algorithms running over timevarying graphs have been widely studied, with most algorithms relying on the consensus approach [15]. Consensusbased algorithms can be further categorized into those utilizing weighted averaging in the primal domain [16, 17, 18], pushsumbased approaches [19], and the alternating directions method of multipliers (ADMM) method [20]
. The weightedaveragingbased approaches utilize a doubly stochastic matrix for averaging and are amenable to gradienttracking variants that converge at geometric rates; see
[4] and references therein.Fewer distributed algorithms exist for the more challenging timevarying setting where the minimizer drifts over time. Algorithms for tracking timevarying parameters have their roots in adaptive signal processing and control theory, where the steadystate tracking error is of interest [21, 22]. Recently however, online convex optimization has emerged as a useful framework for the analysis of tracking algorithms, especially in adversarial settings [23, 24, 25, 26, 27, 28, 29, 30]. The idea here is to characterize the dynamic regret of an algorithm in terms of the cumulative variations in the problem parameters such as the path length. However, existing online algorithms for tracking timevarying objectives cannot handle timevarying graph topologies where the nodes may get disconnected. A distributed dynamic ADMM algorithm for differentiable cost functions was first proposed in [31] and the steadystate optimality gap was derived for static graphs. On the other hand, an online ADMM algorithm for nondifferentiable problems was proposed in [9] but required centralized implementation. Closer to the current work, the dynamic regret performance of a distributed mirror descent algorithm was obtained in [32] for possibly nondifferentiable functions. However, the underlying graph was required to be static and connected. In contrast, the focus here is on distributed algorithms over dynamic and sporadically connected graphs.
This works puts forth a distributed proximal online gradient descent (OGD) algorithm designed to run over a timevarying network topology and capable of tracking the minimum of a timevarying composite loss function. As the objective may contain nondifferentiable regularizers, we build upon the recently developed machinery of proximal OGD
[33]. Towards realizing a distributed implementation, we utilize the idea of weighted averaging using a doubly stochastic weight matrix, as is customary in distributed algorithms for static problems [4]. However, since the graph topology is timevarying, it is not possible for the nodes to perform a consensus update or even exchange information at each time slot. Instead a multistep consensus algorithm inspired from [34] is developed, where the objective function information is used only intermittently and the remaining time slots are utilized for consensus. It is remarked that the timevarying problem is significantly more challenging than the static version considered in [34], since the objective function continues to evolve even when the consensus steps are being carried out, resulting in the accumulation of additional regret.The proposed distributed proximal OGD algorithm is analyzed as an inexact variant of the proximal OGD algorithm [35]. Subsequently, the gradient and proximal errors arising due to the distributed operation are characterized, allowing us to obtain the required dynamic regret bounds. Interestingly, the distributed operation only results in an additional factor of as compared to the dynamic regret lower bound for the centralized case. The performance of the proposed algorithm is tested for the dynamic sparse recovery problem and compared with that of the centralized proximal OGD and ADMM algorithms.
A large number of LMS and RLSbased algorithm have been proposed to solve the dynamic sparse recovery problem, starting from [36, 37]. An online ADMM approach for timevarying LASSO problem was developed in [38] while a online proximal ADMM algorithm for solving the group LASSO problem was proposed in [39]. While analytical results have been developed, they are largely limited to the case of static parameters. In particular, these works do not generally characterize the steady state tracking performance, though it may be possible to borrow similar bounds from the matrix completion literature; see [13]. A comprehensive survey of RLSbased dynamic sparse recovery methods is provided in [8].
When the sparse parameters follow a linear statespace model, they can be tracked within the sparse Bayesian learning (SBL) formalism [40, 41]. More recently, consensus has been employed to track the sparse parameters in a distributed fashion [42]. Likewise, a robust framework for dynamic sparse recovery has been developed in [43]. Different from the system model considered here, SBLbased methods cannot handle adversarial targets and are generally not amenable to regret analysis.
Adversarial parameters are naturally handled within the dynamic or online convex optimization framework. For instance, the dynamic sparse recovery problem was formulated and solved via a ’running’ ADMM approach in [9]. Likewise, the mirror descent algorithm for solving a similar problem was proposed in [32]. Generalizing these frameworks, we consider the dynamic sparse recovery problem in a distributed setting where the number of observations per sensor are not sufficient to estimate the full parameter. Further, the underlying graph topology is timevarying and collecting the observations at a central location is not straightforward.
In summary, our key contributions include: (a) a novel multistep consensusbased proximal OGD algorithm for timevarying optimization problems over timevarying network topologies (b) performance analysis of the general inexact proximal OGD algorithm and its application to the analysis of the proposed distributed algorithm. The rest of the paper is organized as follows. We start with the problem formulation in Sec. II. The proposed distributed proximal OGD algorithm is presented in Sec. III. A bound on the dynamic regret incurred by the proposed algorithm is obtained in Sec. IV. The empirical performance of the proposed algorithm is tested on the dynamic sparse recovery problem in Sec. V and the conclusions are presented in Sec.VI.
Notation: All the scalars are denoted by regular font lower case letters, vectors by boldface lower case letters, and matrices by upper case boldface letters. Constants such as the total number of time slots and gradient bounds are denoted by regular font capital letters (such as
and ) while problem parameters such as stepsize are denoted by Greek letters (such as ). Finally, sets are denoted by calligraphic font capital letters. The th entry of the matrix is denoted either by or . The Euclidean norm of a vector is denoted by . By default, time and iteration indices (e.g. or ) will be in the subscript, while node or element indices (e.g. or ) will be in the superscript.Ii Problem Formulation
Consider a multiagent network with the set of agents or nodes interacting with each other over intermittent communication links. The network topology at time is represented by a directed graph where denotes the set of directed edges or links present at time . Specifically, agents can transmit to agent only if at time instant . The dynamic graph is arbitrary and may not necessarily be connected for all . However, is still required to be uniformly strongly connected or connected [17]; see Sec. III.
Restricted to exchanging information only over the edges in at time , the agents seek to cooperatively track the optimum of the following dynamic optimization problem
() 
where is a smooth strongly convex function, is a convex but possibly nonsmooth regularizer, and is a closed and convex set with a nonempty interior. For the sake of brevity, henceforth we denote
(1) 
and so that () may also be written as . In the multiagent setting considered here, the function is private to node while is known at all the nodes. It is remarked that the static version of () has found several applications in signal processing [44] and learning problems [45].
Algorithms to solve () are developed and analyzed within the rubric of online convex optimization, where the learning process is modeled as a sequential game between the agents and an adversary [23]. At each time , agent plays an action and in response, receives the functions from the adversary. Over a horizon of times, the goal of the agent is to minimize the cumulative dynamic regret given by [32]:
(2) 
The dynamic regret measures the loss incurred by the agent against that of a timevarying clairvoyant. Observe that the dynamic regret is different from and more stringent than the static regret where the benchmark is not timevarying [23]. The definition in (2) also includes the modification due to the distributed nature of the problem, first suggested in [32]. For instance, the righthand side of (2) includes terms of the form for , and therefore, regret minimization requires the agents to collaborate.
An algorithm is said to be noregret if grows sublinearly with . It is wellknown that the dynamic regret cannot be sublinear unless certain regularity conditions are imposed on the temporal variations of [25]. In the present case, the regret bounds will be developed in terms of the path length, defined as
(3) 
and assumed to be sublinear. For general timevarying problems with gradient feedback, the dynamic regret of any algorithm is at least [28]. Dynamic regret bounds of related algorithms capable of handling nonsmooth functions are summarized in Table I. Closely related to the proposed work, a distributed online mirror descent algorithm is discussed in [32] that is more general and can handle nondifferentiable loss functions. It is remarked that the centralized algorithm proposed in [29] is not included in Table I since it requires the loss function to be selfconcordant.
The present work develops an online and distributed algorithm for solving (), where the agents minimize the regret collaboratively and without a central controller or fusion center. While the static variant of the problem can be readily solved via consensus, such an algorithm is not directly applicable to the timevarying problem at hand. Specifically, while the spread of information is limited due to the timevarying communication graph , the problem parameters continue to change with , regardless. Towards this end, we adopt the multistep consensus idea from [34]. Sublinear regret will be obtained by carefully balancing the additional accuracy obtained from running the consensus step for multiple times against the excess regret accumulated in the meanwhile.
Iii Proposed distributed algorithm
Iiia Motivation
In order to motivate the proposed algorithm, observe that since is strongly convex, the optimality condition for () may be written as
(4) 
where recall that and the proximal operator is defined as:
(5)  
(6) 
and is the stepsize. In a centralized setting, the form of the optimality condition in (4) is suggestive of the proximal OGD algorithm [33] that takes the form
(7) 
for all . Though the algorithm in (7) is provably noregret, it is not usable in distributed settings where the timely evaluation of the average gradient is challenging. Indeed, due to the dynamic and arbitrary nature of the communication graphs , even collecting the gradients from each node is not straightforward.
The consensus algorithm, where the information transmission occurs only over the edges of a given graph , has been widely used for distributed averaging [17]. While the consensus algorithm converges only asymptotically, it is still possible for the nodes to obtain an approximate version of the average in a few iterations. The idea of running the consensus for a few iterations in order to approximately calculate the average gradient was first proposed in [34] for static problems, and will also be utilized here. A key complication that occurs in the dynamic setting is that while agents carry out the averaging via consensus routine, the problem parameters continue to change. Therefore it becomes important to carefully plan the number of consensus steps that will result in a sufficiently high update accuracy without losing track of .
IiiB The Distributed Proximal OGD Algorithm
Building upon the classical distributed proximalgradient algorithm [34], the proposed distributed proximal OGD (DPOGD) algorithm takes up multiple time slots to complete each iteration. The timevarying functions are sampled at times where
(8) 
so the th iteration starting at time takes up time slots where denotes the number of consensus steps at iteration . The two additional time slots are reserved for the gradient update and for the application of the proximal operator. In order to establish the regret bounds, we will generally choose to be a nondecreasing function of . Associate timevarying weights with each edge , while let for all . Let the matrix collect all the edge weights . The full DPOGD algorithm starts with an initial and consists of the following updates
(9a)  
(9b)  
(9c) 
where we let . Note that only one of the three updates steps in (9a)(9c) is carried out depending on the value of . These updates in time index have been effectively summarized in Algorithm 1.
In order to better understand the proposed algorithm, it may be instructive to write down the updates in (9) in terms of the iteration index :
(10a)  
(10b)  
(10c) 
For the sake of brevity, we introduce new (capped) variables and functions whose subscript indicates the iteration index instead of the time index. Specifically, for all , let
(11a)  
(11b)  
(11c) 
Likewise, the optimum at time is specified as . With the iterationindexed notation in (11), the DPOGD updates may simply be written as
(12a)  
(12b)  
(12c) 
where denotes the th entry of the matrix
(13) 
Here, the steps of consensus are all condensed into a single equation (12b). In order to better understand (12b), collect the variables , , and into supervectors , , and respectively. Recursively applying the consensus averaging, it follows that
(14) 
The full algorithm is summarized in Algorithm 2 and the updates at node are shown in Fig. 1. It is remarked that Algorithm 2 is still conceptional since we have left unspecified. An appropriate choice of is necessary to obtain a tight regret bound and a detailed discussion for the same will be provided in Sec. IV.
Iv Performance Analysis
This section develops the regret rate for the proposed algorithm. As already discussed, each iteration involves the gradient update at time , consensus steps, and a proximal update at time . As compared to the existing proximal OGD algorithm, the analysis of DPOGD algorithm is complicated due to the additional regret that is incurred from subsampling the functions at times . We begin with discussing some preliminaries before proceeding to the assumptions and the regret bounds.
Iva Preliminaries
The regret rate of Algorithm 2 will be analyzed using the network averages at each iteration , defined as
(15) 
Towards establishing the regret rates, we begin with casting the proposed algorithm as an inexact proximal gradient algorithm that can be viewed as a generalized version of [33]. Using the network averages defined in (15), it is possible to write (12a) as an inexact gradient update step:
(16)  
(17) 
where recall that and
(18) 
Along similar lines, we write the inexact proximal update as
(19)  
(20) 
where the proximal operator is as defined in [35] which implies that
(21) 
As compared to (12b)(12c), the error incurred in using the inexact proximal operation can be expressed as
(22) 
for all . At this stage, is left unspecified and appropriate bounds will be developed later. Having written the proposed DPOGD algorithm as an inexact proximal OGD variant, the rest of the analysis proceeds as follows: (a) development of bounds for updates in (17)(20) in terms of and ; (b) development of bounds on and ; and finally (c) substitution of these bounds to obtain the required regret bounds. It is remarked that such an approach is flexible and readily extendible to other variants where the sources of gradient and proximal errors may be different; e.g., quantization, asynchrony, or computational errors.
IvB Assumptions
The assumptions required for developing the regret bounds are discussed subsequently. The first three assumptions pertain to the properties of functions and .
Assumption 1.
The functions are smooth, i.e., for all , it holds that
(23) 
for all and .
Assumption 2.
Assumption 3.
The functions are convex, i.e., for all , it holds that
(26) 
It is remarked that Assumptions 13
are standard and applicable to a wide range of problems arising in machine learning, signal processing, and communications. The present analysis depends critically on these assumptions and the required regret bounds only hold when
and the parameters and are bounded. The next two assumptions pertain to the network connectivity and the choice of weights.Assumption 4.
The graph is connected, i.e., there exists some such that the graph
(27) 
is connected for all .
Assumption 5.
The weight matrices have positive entries and satisfy the following properties for each :

Double stochasticity: it holds that ;

Lower bounded nonzero entries: there exists such that for all ;

Nonzero diagonal entries: it holds that for all .
Assumptions 45 imply that the entries of the weight matrix are close to in the following sense [16, Proposition 1(b)]:
(28) 
where defining , we generally have that and . Inequality (28) holds the key to obtaining fast consensus over timevarying graphs and as we shall see later, the regret rate of the algorithm would depend critically on the value of .
For the next assumption, let the total number of time slots required to carry out updates be given by
(29) 
where recall that is the number of consensus steps at the th iteration. Inverting the relationship (29), for any , it holds that . The following assumption is required to ensure that the regret bounds are sublinear.
Assumption 6.
The number of consensus steps is a nondecreasing function of and the number of consensus steps at the last iteration is sublinear in .
An implication of Assumption 6 is that there cannot be too many consensus steps at any iteration . Having stated all the required assumptions, we are now ready to state the main results of the paper.
IvC Regret Bounds
As already discussed, we begin with developing some bounds for the generic inexact proximal gradient method (17)(20). Bounds on the error incurred from using the proposed distributed algorithm will be developed next. The final regret bounds would follow from combining these two results. The section concludes with some discussion on the nature of the bounds for various choices of .
The first lemma bounds the periteration progress of the iterate in terms of its change in distance from the current optimum .
The proof of Lemma 1 is provided in Appendix B and utilizes the strong convexity and smoothness properties of as well as the triangle inequality. The result in Lemma 1 states that the distance between and the current optimal is less than a fraction of the distance between and , but for an error term that arises due to and in the updates steps. It is remarked that the result in (30) subsumes all existing results in [30, 33].
Taking summation over and rearranging, we obtain the following corollary, whose proof is deferred to Appendix B.
The result in Corollary 1 provides an upper bound on the cumulative distance between the average current iterate and optimal over time instances. Next, Lemma 2 develops a simple bound on the iterate norm with the proof provided in Appendix C.
Lemma 2.
The bounds in Lemma 2 are not surprising given that the size of each update step is bounded implying that after steps, none of the iterates can be more than far from the starting point. Recall that the inexact proximal OGD algorithm updates in (17)(20) are really the DPOGD updates in disguise, with specific definitions of and . The next lemma provides a convenient bound on the error term that will subsequently be used to obtain the regret bounds.
The proof of Lemma 3 is provided in Appendix D. It provides a bound on the error sequence generated by the inexact algorithm in terms of a geometricpolynomial sequence. The first term of (34) results from the gradient error while the second term bounds the norm of the proximal error .
We are now ready to convert the results obtained in terms of the iteration index into bounds that depend on time index . Recall that since the th iteration incurs time slots, the last iteration incurs time slots where , as stated in Assumption 6. Also let so that the initial bounds can be developed in terms of and . Specific examples of will subsequently be provided to yield bounds as explicit functions of . As a precursor, consider a simple example when , so that and . Next, the following lemma reconciles the two definitions of the path length.
Lemma 4.
It holds that
(35) 
where is such that .
Proof:
The result follows from the use of triangle inequality:
∎
Note that in order to calculate the dynamic regret in (2), it is necessary to define for all . Towards this end, let for such that . Finally, we provide the main result of the paper in the following Theorem.
Theorem 1.
The proof of Theorem 1 is provided in Appendix E. Here, the regret bound is worse than since the algorithm updates are sporadic resulting in additional factor of . The result in (36) depends on the choice of through and . We now discuss the explicit form of the regret bound for a few choices of .
IvC1 Logarithmically increasing
Consider the case
(37) 
where is left unspecified at this stage. Given , the number of iterations are given by the largest that satisfies the inequality
(38) 
For the sake of brevity, let and likewise so that only the dominant terms may be retained. Ignoring the floor function and using Stirling’s approximation [46], we have that
(39) 
or equivalently,
(40) 
where denotes the Lambert function. It follows that . Also note that
(41)  
(42)  
(43) 
where the last step uses the approximation for large . The overall regret bound thus becomes
(44) 
For the regret to be sublinear, it is necessary but not sufficient that is also sublinear. For instance, if with , it is also required to hold that or equivalently one must choose so that the term dominates the summation. In this case, is not allowed to be arbitrarily close to 1. Instead, it is required that or equivalently, we have . In other words, for a given , it is always possible to choose an appropriate value of the parameter , though the regret bound will only be for sufficiently large .
IvC2 Constant
Taking for all , it is required that
(45)  
(46) 
for and sufficiently large. In this case, and
(47)  
(48) 
yielding the final regret bound
(49) 
In order to write the regret in explicit form, let for some . Then the regret in (49) is sublinear when or . However cannot be arbitrarily small or else the term would become too large. The minimum value of the regret is obtained for the choice of such that
(50)  
(51)  
(52) 
where we have used the result from [47]. Since is large, we make use of the approximation , which yields
Therefore the regret bound can be approximately written as
(53) 
which is sublinear as long as is not too close to 1 and . As in the previous case, the optimal choice of still requires a sufficiently large value of . In summary, the stepsize may be chosen as while may be chosen as either or in Algorithm 2.
Recall that the centralized proximal OGD algorithm achieves a regret of which is also optimal for any online algorithm; see [28]. Remarkably, the dynamic regret of the DPOGD is only worse by a factor, possibly arising out of the distributed operation over an intermittently connected graph.
V Numerical results
The performance of the proposed DPOGD algorithm is tested for the dynamic sparse signal recovery problem where the goal is to estimate a timevarying sparse parameter. Such problems have been widely studied in literature and can be broadly classified into those advocating adaptive filteringbased algorithms, those formulating the problem within the sparse Bayesian learning framework, and finally those utilizing tools from dynamic or online convex optimization.
Va The Dynamic Sparse Recovery Problem
Consider a WSN with sensors connected over a timevarying graph . The parameter of interest is a timevarying sparse signal . At time , sensor makes measurements according to the following model
(54) 
where is the observation matrix and is the noise with unknown statistics. The online learning model detailed in Sec. II is considered and the quantities are revealed in a sequential manner. Observe that given no other information, tracking the original parameter is impossible. Instead, we settle for tracking the Elastic Net estimator of given by
Comments
There are no comments yet.