Furthermore, if \(\mathcal G\) is sufficiently sparse each node will have a small number of parents; and \(X_i \text \varPi _X_i\) will have a low-dimensional parameter space, making parameter learning computationally efficient.
Computational Complexity download pdf
to provide general expressions for the (time) computational complexity of the most common class of score-based structure learning algorithms, greedy search, as a function of the number of variables N, of the sample size n, and of the number of parameters \(\varTheta \);
We will first study the (time) computational complexity of greedy search under the assumptions that are commonly used in the literature (see, for instance, Tsamardinos et al. 2006; Spirtes et al. 2001) for this purpose:
In steps 1 and 2, greedy search computes all the N local distributions for \(\mathcal G_0\). In step 3, each iteration tries all possible arc additions, deletions and reversals. Since there are \(N \atopwithdelims ()2\) possible arcs in a DAG with N nodes, this requires \(O(N^2)\) model comparisons. If we assume \(\mathcal G_0\) is the empty DAG (that is, a DAG with no arcs), hill climbing will gradually add all the arcs in \(\mathcal G_\mathrmREF\), one in each iteration. Assuming \(\mathcal G_\mathrmREF\) is sparse, and assuming that arcs are removed or reversed a negligible number of times, the overall computational complexity of hill climbing is then \(O(cN^3)\) model comparisons. Step 4 performs \(t_0\) more iterations and is therefore \(O(t_0 N^2)\). Therefore, the combined time complexity of steps 3 and 4 is \(O(cN^3 + t_0 N^2)\). Each of the random restarts involves changing \(r_1\) arcs, and thus we can expect that it will take \(r_1\) iterations of hill climbing to go back to the same maximum, followed by tabu search; and that happens for \(r_0\) times. Overall, this adds \(O(r_0 (r_1N^2 + t_0 N^2))\) to the time complexity, resulting in an overall complexity g(N) of
Hence, it is possible to dramatically reduce the computational complexity of greedy search by keeping a cache of the score values of the N local distributions for the current \(\mathcal G_ max \)
where \(\varPi _X_i^ max \) and \(\varPi _X_i^\mathcal G^*\) are the parents of \(X_i\) in \(\mathcal G_ max \) and in the \(\mathcal G^*\) obtained by removing (if present) or adding (if not) \(X_j \rightarrow X_i\) to \(\mathcal G_ max \). Only N (for arc additions and deletions) or 2N (for arc reversals) elements of \(\varDelta \) need to be actually computed in each iteration; those corresponding to the variable(s) whose parent sets were changed by the local move produced the current \(\mathcal G_ max \) in the previous iteration. After that, all possible arc additions, deletions and reversals can be evaluated without any further computational cost by adding or subtracting the appropriate \(\varDelta _ij\) from the \(B_i\). Arc reversals can be handled as a combination of arc removals and additions (e.g. reversing \(X_i \rightarrow X_j\) is equivalent to removing \(X_i \rightarrow X_j\) and adding \(X_j \rightarrow X_i\)). As a result, the overall computational complexity of greedy search reduces from \(O(cN^3)\) to \(O(cN^2)\). Finally, we briefly note that score equivalence may allow further computational saving because many local moves will produce new \(\mathcal G^*\) that are in the same equivalence class as \(\mathcal G_ max \); and for those moves necessarily \(\varDelta _ij = 0\) (for arc reversals) or \(\varDelta _ij = \varDelta _ji\) (for adding or removing \(X_i \rightarrow X_j\) and \(X_j \rightarrow X_i\)).
If n is large, or if \(\varTheta _X_i\) is markedly different for different \(X_i\), different local distributions will take different times to learn, violating the O(1) assumption from the previous section. In other words, if we denote the computational complexity of learning the local distribution of \(X_i\) as \(O(f_\varPi _X_i(X_i))\), we find below that \(O(f_\varPi _X_i(X_i)) \ne O(1)\).
which can be solved efficiently by backward substitution since \(\mathbf R\) is upper-triangular. This approach is the de facto standard approach for fitting linear regression models because it is numerically stable even in the presence of correlated \(\varPi _X_i\) (see Seber 2008, for details). Afterwards, we can compute the fitted values \(\hatx_i = \varPi _X_i\hat\varvec\beta _X_i\) and the residuals \(X_i - \hatx_i\) to estimate \(\hat\sigma ^2_X_i \propto (X_i - \hatx_i)^T (X_i - \hatx_i)\). The overall computational complexity is
As for CLGBNs, the local distributions of discrete nodes are estimated in the same way as they would be in a discrete BN. For Gaussian nodes, a regression of \(X_i\) against the continuous parents \(\varGamma _X_i\) is fitted from the \(n_\delta _X_i\) observations corresponding to each configuration of the discrete parents \(\varDelta _X_i\). Hence, the overall computational complexity is
The candidate parents in the (\(d_X_i+ 1\))th pass are evaluated but not included in \(\mathcal G\), since no further parents are accepted for a node after its parent set \(\varPi _X_i\) is complete. If we drop the assumption from Sect. 2 that each term in the expression above is O(1), and we substitute it with the computational complexity expressions we derived above in this section, then we can write
where \(O(f_jk(X_i)) = O(f_\varPi _X_i^(j - 1) \cup X_k(X_i))\), the computational complexity of learning the local distribution of \(X_i\) conditional of \(j - 1\) parents \(\varPi _X_i^(j)\) currently in \(\mathcal G\) and a new candidate parent \(X_k\).
so the computational complexity is quadratic in N. Note that this is a stronger sparsity assumption than \(\sum _i = 1^Nd_X_i= cN\), because it bounds individual \(d_X_i\) instead of their sum; and it is commonly used to make challenging learning problems feasible (e.g. Cooper and Herskovits 1992; Friedman and Koller 2003). If, on the other hand, G is dense and \(d_X_i= O(N)\), then \(c = O(N)\)
and \(O(g(N, \mathbf d))\) is more than exponential in N. In between these two extremes, the distribution of the \(d_X_i\) determines the actual computational complexity of greedy search for a specific types of structures. For instance, if \(\mathcal G\) is a scale-free DAG (Bollobás et al. 2003) the in-degree of most nodes will be small and we can expect a computational complexity closer to quadratic than exponential if the probability of large in-degrees decays quickly enough compared to N.
Deriving the computational complexity for CLGBNs is more complicated because of the heterogeneous nature of the nodes. If we consider the leading term of (5) for a BN with \(M
In Sect. 3, we have shown that the computational complexity of greedy search scales linearly in n, so greedy search is efficient in the sample size and it is suitable for learning BNs from big data. However, we have also shown that different distributional assumptions on \(\mathbf X\) and on the \(d_X_i\) lead to different complexity estimates for various types of BNs. We will now build on these results to suggest two possible improvements to speed up greedy search.
Firstly, we suggest that estimating local distributions with few parents can be made more efficient; if we assume that \(\mathcal G\) is sparse, those make up the majority of the local distributions learned by greedy search and their estimation can potentially dominate the overall computational cost of Algorithm 1. As we can see from the summations in (6), the overall number of learned local distributions with j parents is
and it suggests that learning low-order local distributions in this way can be markedly faster, thus driving down the overall computational complexity of greedy search without any change in its behaviour. We also find that issues with singularities and numeric stability, which are one of the reasons to use the QR decomposition to estimate the regression coefficients, are easy to diagnose using the variances and the covariances of \((X_i, \varPi _X_i)\); and they can be resolved without increasing computational complexity again.
As for CLGBNs, similar improvements in speed are possible for continuous nodes. Firstly, if a continuous \(X_i\) has no discrete parents (\(\varDelta _X_i= \varnothing \)) then the computational complexity of learning its local distribution using QR is again given by (4) as we noted in Sect. 3.1.3; and we are in the same setting we just described for GBNs. Secondly, if \(X_i\) has discrete parents (\(D_X_i > 0\)) and j continuous parents (\(G_X_i = j\)), the closed-form expressions above can be computed for all the configurations of the discrete parents in
Interestingly we note that (10) does not depend on \(D_X_i\), unlike (5); the computational complexity of learning local distributions with \(G_X_i \leqslant 2\) does not become exponential even if the number of discrete parents is not bounded.
that is, we learn the local distributions on \(\mathcal D^train\) and we estimate the probability of \(\mathcal D^test\). As is the case for many other models (e.g., deep neural networks; Goodfellow et al. 2016), we note that prediction is computationally much cheaper than learning because it does not involve solving an optimisation problem. In the case of BNs, computing (12) is:
Hence by learning local distributions only on \(\mathcal D^train\) we improve the speed of structure learning because the per-observation cost of prediction is lower than that of learning; and \(\mathcal D^train\) will still be large enough to obtain good estimates of their parameters \(\varTheta _X_i\). Clearly, the magnitude of the speed-up will be determined by the proportion of \(\mathcal D\) used as \(\mathcal D^test\). Further improvements are possible by using the closed-form results from Sect. 4.1 to reduce the complexity of learning local distributions on \(\mathcal D^train\), combining the effect of all the optimisations proposed in this section. 2ff7e9595c
Comments