This paper presents a two-level parallel implementation of a hierarchical symbolic analysis algorithm (SCAPP) on a 128 node nCUBE hypercube class parallel processor. The algorithm operates on a partitioned circuit and subdivides the processing into an outer-parallelism level and an inner-parallelism level. The algorithm can collapse to only one of the two levels based on program control and availability of processors. Each subcircuit is allocated a subcube of size one or larger and then the matrix operations are parallelised on that subcube. The algorithm is applicable to any general message passing multi-processor architecture.
VLSI designers are continuously seeking faster and more efficient circuit simulation techniques for verification of both, analog and digital, integrated circuits. The current state-of-the-art in circuit simulation falls short in attempting to satisfy all the needs of today's VLSI designs. Circuit size and long simulation times have a large contribution to these shortfalls. Therefore, a major concentration of efforts now-a-days is on the reduction of overall simulation times. One of the most promising options for achieving this goal is the design of parallel algorithms that exploit parallelism in the various circuit simulation algorithms [1,2,3,4].
Symbolic circuit analysis is a simulation technique which attempts to obtain expressions for circuit variables where one or more components can be represented by symbols. Most symbolic analysis methods address circuit operation in the frequency domain [5,6,7,8,9,10,11,12,13]. However, some researchers have investigated symbolic analysis in the time domain [14, 15]. All of these methods use one of two approaches to dealing with the symbolic expressions: 1) A single expression approach where the entire function is represented by one numerator over denominator equation [16,17]. 2) A sequence of expressions approach where the transfer function is an upwardly dependent hierarchical set of equations as illustrated in example 1 (section 3) [5,6].
In this paper, a parallel implementation of the symbolic circuit simulator SCAPP (Symbolic Circuit Analysis Program with Partitioning) is presented [5,18]. SCAPP generates the frequency domain network functions for large-scale circuits. The algorithm utilises the concepts of hierarchical partitioning and hierarchical sequence of expressions to obtain the symbolic expressions for circuit variables. The number of expressions generated by SCAPP is bounded by a worst case quadratic growth as opposed to the exponential growth exhibited by other simulators introduced in past [19,20,21,22,23,24]. The basic serial algorithm was introduced in [5,18], and is summarised in Figure 1
The SCAPP algorithm consists of three major steps: Hierarchical partitioning, terminal block analysis and middle block analysis.
where Yn is the nodal admittance matrix (a function of the frequency variable s) that, with B, form the KCL equations for the circuit, I is the vector of branch current variables, E is the independent voltage source values, and C and D correspond to the branch relationship equations whose currents are in I. All type of dependent power sources are allowed in the analysis. The RMNA matrix is then obtained as the final representation of the block. The RMNA equations are
where Ie is the vector of external current variables, and Ve is the vector of external node variables. The ith entry of the Je vector represents the currents entering the n-terminal block through the ith terminal, or in other words entering the ith node from another block or a power source (internal or external). Ee represents the contribution of both, external and internal, independent voltage sources to this block. The process of producing the RMNA matrix form the MNA matrix is done by successive elimination (suppression) of the rows and columns corresponding to the internal variables of the block. To reduce internal variable i (ith row and column), the following equation is used (ARhere represents the reduced MNA matrix and will be the RMNA matrix after the successive application of the equation to reduce all internal variables)
where aii is in the ith row and the ith column of AR, Acr is AR with the ith row and ith column removed, and Ri and Ci are the ith row and column, respectively, with aii removed. By using the equation to suppress all internal variables, the resulting matrix is the RMNA matrix which is a compact representation of the subcircuit in terms of only its external variables.
A detailed description of the entire algorithm can be found in [5,17].
Traditionally, serial algorithms have been independent of the architecture of the machine on which they are run. This is not the case for parallel algorithms. Finding a match between a parallel algorithm and hardware architecture is crucial to the success of parallelisation. Keeping this in mind, we attempt to identify parallelism as well as the best architecture for implementation. There are two major areas where the algorithm can be parallelised:
Both inner and outer parallelisms are coarse grained tasks requiring very little inter-processor communication. This is well suited for the message passing parallel processors. Considering these facts and machine accessibility criteria, the nCUBE parallel processor with 128 nodes was chosen to be the parallel machine for implementing the algorithms. A description of the machine follows.
The nCUBE-2 parallel processor [27] is a member of Hypercube Multiple Instruction Multiple Data (MIMD) parallel computers consisting of up to 8192 nodes connected by a hypercube network. Each node of the machine is capable of delivering 2 MIPS and is equipped with 16 Mbytes of memory. In addition to large processing power, large I/O bandwidth is also provided. The hypercube communication network and dedicated hardware reduces communication overhead by providing log order broadcast operations. A typical broadcast operation of nCUBE is shown in Figure 3 which illustrates the steps required to broadcast a message to 8 processors.
A Sun workstation serves as the front end host processor (Figure 4). The host interface supports a UNIX like operating system, Axis. Vertex is the operating system running on the node processors. Synchronisation between node tasks is achieved by message passing mechanisms. The software support includes a set of library routines which can be called from C or FORTRAN providing mechanisms for allocating processors, loading tasks on processors and message passing between processors.
One of the most frequently used strategy for designing parallel algorithms is the divide and conquer strategy. In this method, the problem is broken down so as to fit nicely in the multiple processor model with minimum boundary conditions and good load balance. The analysis is carried out on the parts of the problem independent of each other. Any synchronisation between the tasks is achieved by message passing. Finally a recombination step is performed to take care of the boundary conditions. This model of computation matches the SCAPP algorithm. The following subsections illustrate how this partition-analyse-recombine strategy is adopted in this algorithm.
3.1 Partitioning
The solution to the problem of finding an optimal partition of a network is NP- complete [28]. Therefore, it is an impractical task to obtain an optimal solution for large networks. The problem is compounded by the fact that a multiple processor machine with p processors ideally requires p partitions in order to maximise efficiency and minimise computational time. Several heuristic partitioning algorithms that produce near optimal solutions in acceptable time have been discussed in the literature [29,30,31,32]. Focusing on hypercube class of processors where p is always a power of two, a partitioning scheme that produces a power of two equal sized partitions can be considered a near-optimal solution. The partitioning algorithm described in [25] is found to be suitable for this purpose. In order to keep a good load balance, the size of partitions is chosen to be a dominant partitioning criteria. Size of partitions is measured in terms of number of branches in a partition.
An obvious way of parallelising the partitioner is to divide the network into two using one processor, then divide these partitions into four using two processors and so on till P partitions are obtained. Although the scheme appears to be simple and reducing time complexity by a log factor, this is not the case. To see this fact, time complexity of both serial and parallel algorithms are obtained. The serial binary partitioning algorithm has a worst case time complexity of O(n2) where n is the problem size (For most practical circuits, the complexity of the partitioner is found to be linear [25]). The results obtained in this section hold good for this case too). The partitioner divides the problem into two problems of size n/2 each. The expressions for serial time (Ts) and parallel time (Tp) are given in equations (4) and (5), respectively.
From above expressions, it can be seen that for large number of partitions, both serial and parallel algorithms approach the same time complexity. The reason for this is the low utilisation of processors. This analysis is valid for any binary partitioning algorithm. The point is that trying to convert a serial algorithm directly into parallel algorithm need not necessarily result in large speedups. Another important point to consider here is the fact that most circuits that will benefit from the parallelisation are already pre-partitioned by design. This is because (as will be seen in the experimental results) in order to achieve practical parallelisation speedups, the size of each partition has to be large enough to prevent processor communication time from dominating the overall analysis time and therefore only large circuits will be analysed. These large circuits will be, by design, hierarchically organised (a sound and basic design practice), and hence large circuits are always pre-partitioned. Extra partitioning can always be performed on the subcircuits in order to achieve a better balance of the size of the partitions to keep all the processors busy. To eliminate the time to communicate partitions to relevant processors from partitioning processor, it was decided that each processor perform partitioning on its own and keep the relevant partitions.
3.2 Terminal Block Analysis
Once partitioning has been performed, the next step is to schedule terminal blocks to processors and reduce them in parallel. The scheme proposed here can allocate any number of partitions to any number of processors but utilisation can be maximised if number of partitions is a power of two. In order to describe the scheme, let us define p as number of processors available and P as the number of partitions. There can be three cases: p>P, p=P and p<P. The partitioner can be constrained to produce one of the first two cases. But the last case might occur for user defined partitions.
Once scheduling of terminal blocks to processors is completed, the analysis begins with construction the MNA matrix for each terminal block. Two schemes for reducing the MNA matrix have been implemented. In the first case, there is only one processor performing reduction of the entire MNA matrix The algorithm for this scheme is exactly same as the serial algorithm used in SCAPP. The second scheme is followed in the case where a subcube with two or more processors is allocated for each terminal block analysis. This algorithm is described in following sections.
The parallel implementation of reduction of MNA matrix on a subcube consists of a controller node and worker nodes. The controller node is responsible for interfacing with the partitioning module and recombination module. This node generates a compact representation of the MNA matrix for the subcircuit assigned to the subcube. This matrix is in a form suitable for minimising communication overhead and termed as IMNA matrix. The controller node determines order of rows to be reduced and the columns to be used to reduces the rows. After completion of reduction of a row, this node writes the reduction equations to permanent storage. Figure 5 lists the steps of controller node.
The worker nodes are actually responsible for reducing the IMNA matrix. The rows of IMNA matrix are uniformly divided between these workers. The row being reduced and the column used to reduce the row are termed current row and current columns respectively. Controller node broadcasts the current row and column numbers to all workers. On receiving the row number, the worker node holding the row broadcasts the row contents to all other workers. Once current row has been received by the workers, reduction proceeds without any further communication. The reduction equations are sent back to the controller in a compact form. The controller node is responsible for decoding the compact equations and for writing them to the permanent storage unit. Figure 6 lists the steps performed by the workers.
Example 1:Consider the ladder network in Figure 7. The Symbolic MNA Matrix for the circuit is shown in Figure 8. The Initial IMNA Matrix and assignment of rows to processors is shown in Table 1. The reduction list is nodes {3,4,5}. These also refer to the rows to be suppressed. The initial symbolic sequence of expressions is shown in Table 2, and Table 3 shows the progression of the parallel algorithm.
3.3 Recombination
This step involves combining RMNA matrices for different terminal blocks. A few observations can be made here. 1) Generally only a small fraction of the total network variables will be required by the analysis. 2) The goal of partitioning is to minimise number of tearing nodes. 3) Sending short messages is expensive for hypercubes because of setup overheads. 4) Hierarchical recombination results in low utilisation. These observations suggest that timing requirements for middle block analysis are small compared to other timings and it is not critical to perform middle block analysis in parallel. Experimental results confirm these observations. So it was decided to perform the middle block analysis on a single processor. All the nodes having a RMNA matrix send the matrix to the controller node in the form of single message. Controller node analyses these matrices and produces equations to reduce the RMNA matrices.
The algorithm consists of serial portions as well as parallel portions. Speedup analysis is performed for the parallel sections. The actual execution time of the algorithm is equal to sum of serial time and longest parallel time. Several assumptions are made in computing theoretical speedup of the algorithm. They are:
4.1 Speedup For Parallel Terminal Block Analysis
Figure 9 gives a scheduling diagram for serial and parallel terminal block analysis algorithms. The serial computation time of the algorithm is of order O (n2). This results in a maximum number of nodes in each processor equal to n / p. The time complexity of the analysis on a p processor machine is O (n2 / p). The speedup is on the order of O(p) (Serial time / Parallel time) = n2 / n2/p)
4.2 Speedup Due to Partitioning
Let n, p and P denote circuit size, number of processors available and number of partitions, respectively. The serial communication time is the time required to perform P terminal block analyses and is on the order of
Speedup is computed for three cases: P=p P<p and P>p.
Therefore, speedup for this case is
The main point is that if we can partition the network into power of two partitions with equal size, linear speedup can be obtained on Hypercube machines, no matter what the cube size is. However, as the number of processors increase, the ratio of partitioning time to analysis time starts growing and at some point becomes a significant portion of the overall analysis time. Similarly the broadcast time for large cubes is considerably larger and the assumption that communication time is small becomes invalid. These observations are discussed in detail in next section.
The parallel implementation of SCAPP was run on a set of test circuits. The following examples illustrate the performance and highlights the class of circuits for which the parallelisation is most suitable.
Timing measurements were performed on the controller node (node 0, Figure 4). The results obtained here are worst case times because the controller waits for termination the algorithm on all nodes before recording the timing information. The time recorded is the clock time as opposed to processor CPU time and hence it includes most of the overhead like buffering and waiting for messages.
It can be seen from the results in Table 4 and Figures 13 and 14, that the best speedup and efficiency is obtained when the number of partitions is equal to the number of processors available. This is due to the fact that the different terminal block analyses are independent of each other and, therefore, require no communication between processors during the analysis of the individual partitions (a property of the symbolic algorithm).
The band-pass filter (Figure 10) does not show considerable speedup. This is because this circuit is too small to be analysed in parallel. The communication and set up overhead is very high compared to analysis time. The same is true for the amplifier circuit (Figure 12) because the analysis time, a function of the interconnectivity of circuit components, is small.
The ladder and mesh circuits (Figures 7 and 11, respectively) show good speedups. This is due to the size of the networks and the dense interconnection pattern. For the ladder circuit the best figure is a speedup of 7.03 with 8 processors. For the mesh circuit, the best figure is a speedup of 9.87 with 16 processors (although 32 processors give a slightly higher speedup, 10.32, the cost is 16 extra processors which makes the 16 processor case more attractive).
The implementation of SCAPP on an nCUBE parallel processor was successful. Although the results shown in Table 4 cannot be used as a complete set of information, they are a good representation of expected results for most circuits. The results show the feasibility of performing symbolic analysis on multi-processor machines. The following comments summarise the conclusions:
The above work was partially supported by a grant from the Engineering Research Institute at Iowa State University, Ames, Iowa, USA. The authors wish to thank Ames Laboratory (a Department of Energy research center) for the use of their 128 node nCUBE machine. The authors also wish to thank Dr. Gurpur Prabhu, Dr. James Davis and Dr. John Gustafson for their consultations.
| [1] | P. Cox, R. Burch, E. Dale, P. Yang, "Direct circuit simulation algorithms for parallel processing," IEEE Transactions on Circuits and systems v10 n6 Jun 1991 p 714-725. |
| [2] | G. Hung, Y. Wen , K. Gallivan, R. Saleh, "Parallel circuit simulation using hierarchical relaxation'" 27th ACM/IEEE Design Automation Conference 1990 Jun pp. 24-28. |
| [3] | J. Trotter, Agrawal Prathima, "Circuit Simulation Algorithms on a distributer memory multiprocessor system," ICCAD-1990 p 438-441. |
| [4] | I. Milovanovic, Z. Milovanovic, I. Emina, "optimal algorithm for Gaussian elimination of band matrices on an MIMD computer Parallel Computing," V 15 N 1-3 Sep 1990 p 133-145. |
| [5] | M. M. Hassoun, Symbolic Analysis of Large-Scale Networks. PhD Thesis,School of EE, Purdue University, West Lafayette, IN, August 1988. |
| [6] | Starzyk, J. A., and Konczykowska, A. , "Flowgraph Analysis of Large Electronic Networks," IEEE Trans. on CAS, Vol 33, pp. 302-315, Mar 1986. |
| [7] | S-M Chang, J. F. MacKay, and JG. M. Wierzba, "Matrix reduction and Numerical approximation during computation techniques for symbolic analog circuit analysis," Proceedings of the IEEE ISCAS, San Diego, pp. 1153-1156, May 1992. |
| [8] | F. V. Fernandez, A. Rodriguez-Vazquez, and J. L. Huertas, "An Advanced Symbolic Analyzer for the Automatic Generation of Analog Circuit Design Equations," Proceedings of the IEEE ISCAS, Singapore, pp. 810-813, June 1991. |
| [9] | G. Gielen, H. Walscharts, and W. Sansen, "ISSAC: A Symbolic Simulator for Analog Integrated Circuits," IEEE Journal of Solid-State Circuits, Vol. SC-24, pp. 1587-1597, December 1989. |
| [10] | M. Hassoun, "Hierarchical Symbolic Analysis of Large-Scale Systems Using A Mason's Signal Flow Graph Model," Proceedings of the IEEE International Symposium on Circuits and Systems, Singapore,. |
| [11] | M. Sagawa and H. Kitazawa, "Symbolic Network Analysis of Linear Networks Using Parameter Extraction Proc.," Tran of IECE of Japan, Vol.E 60, No 8, Abs, p 414 1980. |
| [12] | S. Seda, M. Degrauwe and W. Fichtner, " Lazy-Expansion Symbolic Expression Approximation in SYNAP," 1992 International Conference on Computer-Aided Design, Santa Clara, CA, pp. 310-317, 1992. |
| [13] | R. Sommer, "EASY- An Experimental Analog Design System Framework," International Workshop on Sumbolic Methods and Applications to Circuit Design, Paris, Oct. 1991. |
| [14 ] | M. Hassoun, B. Alspaugh, S. Burns, "A State-Variable Approach to Symbolic Circuit Simulation in The Time Domain," Proceedings of the IEEE International Symposium on Circuits and Systems, San Diego, CA, May 1992, pp. 682-685. |
| [15] | A. Liberatore, S. Manetti, M. Piccirilli, "A Symbolic Approach to The Time-Domain Analysis of Nonlinear or Switched Networks," Inte.Workshop on Symbolic Methods and Appl. to Ckt Design, Oct. 1991. |
| [16] | G. Gielen, and W. Sansen, Symbolic Analysis for Automated Design of Analog Integrated Circuits. Kluwer Academic, MA, 1991. |
| [17] | P. M. Lin, Symbolic Network Analysis. Elsevier Science, Amsterdam, 1991. |
| [18] | M.M. Hassoun and P.M. Lin, "A New Network Approach to Symbolic Simulation of Large-Scale Networks," Proc. of the 89 IEEE ISCAS, May 89. |
| [19] | G.E. Alderson and P.M. Lin, "Integrating Topological and Numerical Methods for Semi-Symbolic Network Analysis," Allerton Conference. On Circuit and System Theory Proc, Vol 8, pp 646-654, 1970 |
| [20] | J.K. Fidler and J.I Sewell, "Symbolic Analysis for Computer-Aided Circuit Design - The Interpolative Approach," IEEE Trans. on Circuit Theory, Vol. CT-20, No. 6, pp. 738-741, November 1973. |
| [21] | C. Pottle, CORNAP User Manual, School of Electrical Engineering, Cornell University, Ithaca, NY, 1968. |
| [22] | L. P. McNamee and H. Potash, "A User's and Programmer's Manual for NASAP," University of California at Los Angeles, Rep. 63-38, Aug. 1968. |
| [23] | P.M. Lin and G.E. Alderson, "SNAP: Symbolic-network-analysis Program," PROC IEE Vol. 119, No.2, p 160, February 1972. |
| [24] | K. Singhal and J. Vlach, "Symbolic Analysis of Analog and Digital Circuits," IEEE Tran. on Circuits and Systems, Vol CAS-24, pp.598-609, Nov. 1977. |
| [25] | M. Hassoun, P. M. Lin, "An Efficient Partitioning Algorithm for Large-Scale Circuits," Proc. of the IEEE ISCAS, New Orleans, May 90, pp. 2405-2408. |
| [26] | C. Ho, A. E. Ruehli, and Brennan, "The Modified Nodal Approach to Network Analysis," IEEE Trans. on Circuits and Systems, Vol. CAS-25, pp 504-509, June 1975 |
| [27] | nCUBE Multi-Processor System reference manual.. |
| [28] | A.V.Aho, et. al., The Design and Analysis of Computer Algorithms. Addison-Wesly,1974. |
| [29] | B.W. Kernighan, and S. Lin, "An Efficient Hueristic Procedure for Partitioning Graphs," Bell Systems Technical Journal, 1970, 49, pp. 291-307. |
| [30] | C.S. Park, and S.B. Park, "A Network Partitioning Algorithm Using the Concept of Connection Index," Proc. Int. Symp. on Circuits and Systems, Rome, 1982, pp.985-987. |
| [31] | J. Rutkowski, "Heuristic Network Partitioning Algorithm Using the Concept of Loop Index," IEE Proceedings, Vol. 131, pp. 203-208, Oct 1984. |
| [32] | A. Sangiovanni-Vincentelli, L. K. Chen, and L. O. Chua, "An efficient heuristic cluster algorithm for tearing large-Scale Networks," IEEE Tran. on Circuits and Systems, Vol CAS-24, pp. 709-717, Dec. 1977. |
| Table 1 Distribution of IMNA Matrix of example |
||
|---|---|---|
| Row# | Processor# | Column Non-zero Entries |
|
2 |
1 |
2 3 |
|
3 |
2 |
2 3 4 |
|
4 |
3 |
3 4 5 |
|
5 |
1 |
4 5 6 |
|
6 |
2 |
5 6 7 |
|
7 |
3 |
6 7 |
|
Table 2 |
|
|
Initial Assignments |
Matrix Entries |
|
Y1 = 1 / R1 |
R2C2 = Y1 |
|
Y2 = s*C2 |
R2C3 = - Y1 |
|
Y3 = 1 / R3 |
R3C2 = - Y1 |
|
Y4 = s*C4 |
R3C3 = Y1 + Y2 + Y3 |
|
Y5 = 1 / R5 |
R3C4 = - Y3 |
|
Y6 = s*C6 |
R4C3 = - Y3 |
|
Y7 = 1 / R7 |
R4C4 = Y3 + Y4 + Y5 |
|
Y8 = s*C8 |
R4C5 = - Y5 |
|
Y9 = 1 / R9 |
R5C4 = - Y5 |
|
Y10 = s*C10 |
R5C5 = Y5 + Y6 + Y7 |
|
R5C6 = - Y7 |
|
|
R6C5 = - Y7 |
|
|
R6C6 = Y7 + Y8 + Y9 |
|
|
R6C7 = - Y9 |
|
|
R7C6 = - Y9 |
|
|
R7C7 = Y9 + Y10 |
|
|
Table 3 |
|||
|
--------REDUCTION SYNCHRONISATION POINT--------- |
|||
|
Equations to reduce |
IMNA matrix after reducing row 3 |
||
|
|
|
|
Nonzero Column Entries |
|
R2C2 -= R_PIVOT * R3C2 |
2 |
1 |
2 4 |
|
R2C4 -= R_PIVOT * R3C4 |
4 |
3 |
2 4 5 |
|
|
5 |
1 |
4 5 6 |
|
R_PIVOT = R4C3 / R3C3 |
6 |
2 |
5 6 7 |
|
R4C2 -= R_PIVOT * R3C2 |
7 |
3 |
6 7 |
|
--------REDUCTION SYNCHRONISATION POINT--------- |
|||
|
Equations to reduce |
IMNA matrix after reducing row 4 |
||
|
R_PIVOT = R2C4 / R4C4 |
|
|
Nonzero Column Entries |
|
R2C5 -= R_PIVOT * R4C5 |
2 |
1 |
2 5 |
|
R2C2 -= R_PIVOT * R4C2 |
5 |
1 |
2 5 6 |
|
|
6 |
2 |
5 6 7 |
|
R_PIVOT = R5C4 / R4C4 |
7 |
3 |
6 7 |
|
R5C5 -= R_PIVOT * R4C5 |
|
|
|
--------REDUCTION SYNCHRONISATION POINT--------- |
|||
|
Equations to reduce |
IMNA matrix after reducing row 5 |
||
|
R_PIVOT = R2C5 / R5C5 |
Row# |
Processor# |
Nonzero Column Entries |
|
R2C6 -= R_PIVOT * R5C6 |
2 |
1 |
2 6 |
|
R2C2 -= R_PIVOT * R5C2 |
6 |
2 |
2 6 7 |
|
|
7 |
3 |
6 7 |
|
R_PIVOT = R6C5 / R5C5 |
|
|
|
|
R6C6 -= R_PIVOT * R5C6 |
|
|
|
|
--------REDUCTION SYNCHRONISATION POINT--------- |
|||
|
Equations to reduce |
IMNA matrix after reducing row 6 |
||
|
R_PIVOT = R2C6 / R6C6 |
|
|
Nonzero Column Entries |
|
R2C7 -= R_PIVOT * R6C7 |
2 |
1 |
2 7 |
|
R2C2 -= R_PIVOT * R6C2 |
7 |
3 |
2 7 |
|
R_PIVOT = R7C6 / R6C6 |
|
|
|
|
R7C7 -= R_PIVOT * R6C7 |
|
|
|
|
Table 4 |
|||||||||
|
Circuit |
Filter |
Ladder |
Mesh |
Amplifier |
|||||
|
Nodes |
32 |
202 |
256 |
202 |
|||||
|
Equations |
78 |
1188 |
13719 |
298 |
|||||
|
p 1 |
Parts2 |
Time 3 |
Speedup4 |
Time3 |
Speedup4 |
Time3 |
Speedup4 |
Time3 |
Speedup4 |
|
1 |
1 |
2.8 |
1.00 |
112.2 |
1.00 |
334.5 |
1.00 |
27.7 |
1.00 |
|
2 |
2.3 |
1.22 |
103.5 |
1.09 |
306.1 |
1.09 |
26.5 |
1.05 |
|
|
4 |
2.2 |
1.27 |
85.5 |
1.31 |
256.1 |
1.31 |
22.4 |
1.24 |
|
|
8 |
2.3 |
1.22 |
80.4 |
1.40 |
207.2 |
1.61 |
21.5 |
1.29 |
|
|
16 |
2.5 |
1.12 |
102.3 |
1.10 |
241.1 |
1.39 |
21.8 |
1.27 |
|
|
32 |
2.7 |
1.04 |
107.1 |
1.05 |
295.3 |
1.13 |
23.2 |
1.19 |
|
|
1 |
2 |
1.0 |
1.00 |
60.3 |
1.00 |
260.1 |
1.00 |
8.2 |
1.00 |
|
2 |
0.7 |
1.43 |
34.5 |
1.75 |
137.3 |
1.89 |
8.2 |
1.00 |
|
|
4 |
0.9 |
1.11 |
42.6 |
1.42 |
123.4 |
2.11 |
7.7 |
1.06 |
|
|
8 |
0.9 |
1.11 |
27.5 |
2.19 |
103.1 |
2.52 |
6.8 |
1.21 |
|
|
16 |
0.9 |
1.11 |
30..0 |
2.01 |
102.3 |
2.54 |
6.7 |
1.22 |
|
|
32 |
1.0 |
1.00 |
36.6 |
1.65 |
183.7 |
1.42 |
7.0 |
1.17 |
|
|
1 |
4 |
0.4 |
1.00 |
48.4 |
1.00 |
76.8 |
1.00 |
2.8 |
1.00 |
|
2 |
0.4 |
1.00 |
24.3 |
1.99 |
68.8 |
1.12 |
2.8 |
1.00 |
|
|
4 |
0.3 |
1.33 |
12.9 |
3.75 |
31.3 |
2.45 |
2.8 |
1.00 |
|
|
8 |
0.4 |
1.00 |
11.9 |
4.06 |
15.9 |
4.83 |
2.6 |
1.08 |
|
|
16 |
0.4 |
1.00 |
10.5 |
4.61 |
14.9 |
5.15 |
2.4 |
1.17 |
|
|
32 |
0.4 |
1.00 |
10.6 |
4.57 |
13.4 |
5.73 |
2.5 |
1.12 |
|
|
1 |
8 |
0.2 |
1.00 |
35.9 |
1.00 |
67.1 |
1.00 |
1.2 |
1.00 |
|
2 |
0.2 |
1.00 |
18.2 |
1.97 |
40.4 |
1.66 |
1.2 |
1.00 |
|
|
4 |
0.2 |
1.00 |
12.8 |
2.80 |
25.5 |
2.63 |
1.2 |
1.00 |
|
|
8 |
0.1 |
2.00 |
5.1 |
7.03 |
17.9 |
3.75 |
1.1 |
1.09 |
|
|
16 |
0.2 |
1.00 |
4.8 |
7.48 |
6.8 |
9.87 |
1.2 |
1.00 |
|
|
32 |
0.3 |
0.66 |
3.8 |
9.45 |
6.5 |
10.32 |
1.1 |
1.09 |
|
|
Step 1: Hierarchical partitioning of the circuit (or recognising already pre-partitioned subcircuits) |
Figure 1. SCAPP Serial Algorithm


Figure 3: Typical nCUBE Data Broadcast Operation
Step 1: Build IMNA matrix from SCAPP MNA matrix such that a non zero entry of MNA is represented as 1 and a zero entry of MNA matrix is represented by 0 in IMNA Step 2: Load all rows of IMNA onto nCUBE nodes such that row r is assigned to processor (r mod p) Step 3: For each row r to be reduced, do the following: 3.1 Broadcast row and column numbers for current row to be reduced 3.2 Receive equations from nodes 3.3 Output equations |
Figure 5. Controller algorithm for terminal block analysis
|
Step 1: Receive rows of IMNA matrix designated for processor i Step 2: Receive total rows, number of rows to be reduced. Step 3: For each row to be reduced, do the following 3.1 Receive current row and column number. 3.2 If processor i holds current row, broadcast this row else receive current row. 3.3 For rows held by processor i, Compute equations to reduce current row, and the non zero elements for next reduction step 3.4 Send equations to node 0
|
Figure 6. Worker algorithm for terminal block analysis

| col 2 | col 3 | col 4 | col 5 | col 6 | col 7 | |
| R2C2 | R2C3 | 0 | 0 | 0 | 0 | (row 2) |
| R3C2 | R3C3 | R3C4 | 0 | 0 | 0 | (row 3) |
| 0 | R2C3 | R4C4 | R4C5 | 0 | 0 | (row 4) |
| 0 | 0 | R5C4 | R5C5 | R5C6 | 0 | (row 5) |
| 0 | 0 | 0 | R6C5 | R6C6 | R6C7 | (row 6) |
| 0 | 0 | 0 | 0< | R7C6 | R7C7 | (row 7) |
Figure 8: Symbolic MNA Matrix for example 1
Figure 9: Timing diagrams for serial and parallel terminal block analysis
Figure 11: Mesh Network


Figure 10: Band-pass filter [6]


Figure 12: One stage of the cascaded amplifier circuit

Figure 13: Speedup vs. number of processors for 8 partitions