$\hbar = c = 1$ : On the Parallelism of Symmetric Sparse Matrix Vector Multiplication

Symmetric Sparse Matrix

Sparse matrices are the matrices populated primarily with zeros. Iterative linear solvers, such as the Lanczos-Arnoldi procedure, are among the most important computational tools for solving large symmetric (Hermitian) sparse linear problems. The fundamental operation in the iterative methods is the sparse matrix vector multiplication (SpMV).

Sparse matrix can be stored in several formats. The most flexible one is the coordinate storage scheme (COO). In this scheme, one use three arrays (two integral and one scalar):

row(nnz), col(nnz), mat(nnz)

to store the row indices, column indices and the values of the nonzero matrix elements. It's easy to convert a coordinate storage scheme into other schemes such as the compressed column/row scheme (CCS/CRS). For SpMV, coordinate scheme is as efficient as CCS/CRS. It demands $2 n_{nz} + n_{nz}$ memory comparing with CCS/CRS $(n_{nz} + n) + n_{nz}$, where $n$ is the dimensionality of the matrix and $n_{nz}$ is the number of nonzeros, and $n^2 \gg n_{nz} \gg n$ for most useful cases. Only half of the matrix is stored, because the matrix is symmetric (Hermitian).

Serial Algorithm

subroutine SpMV(nnz, row, col, mat, n, x, y, ierr)
implicit none
! A*x + y -> y
! n is the dimension of the vector x, y
! nnz is the number of nonzeros
integer, intent(in) :: nnz, n, row(nnz), col(nnz)
double precision, intent(in) :: mat(nnz), x(n)
double precision, intent(inout) :: y(n)
integer, intent(out) :: ierr
integer :: i

  do i = 1, nnz

    y(row(i)) = y(row(i)) + mat(i)*x(col(i))
    if(row(i) .ne. col(i)) then
      y(col(i)) = y(col(i)) + x(row(i))*conjg(mat(i))
    end if

  end do
  ierr = 0
end subroutine

Parallel Algorithm

Here we only talk about MPI parallelism. In reality, it is most efficient to use both MPI and OpenMP. The OpenMP parallelism can be used in the series part.

2D cyclic distribution

A parallel SpMV depends on the the distribution of the processors across the processors. 1D distribution has a serious load balancing problem. An simple and popular scheme is the 2D cyclic distribution.

Left: A naive 2D distribution. Center: A cyclic 2D distribution. Right: A cyclic 2D distribution with non-square block matrices.

In this scheme, the $n_p$ processors are aligned as a $n_d \times n_d$ 2D grid. For symmetric matrices, only half of the grid is needed. Then, for $n_p$ processors, $ n_d (n_d + 1)/2 \le n_p \implies n_d \le \text{floor}\left( \frac{\sqrt{1+n_p}-1}{2}\right)$. It is useful to assign each processor a 2D coordinate $(I, J)$, where $I,J = 1,2,3,\cdots, n_d$. We will discuss the processor assignment later.

The vector $x$ is stored on the diagonal and the matrix $A$ is distributed across the processor grid. More specifically, vector element $x_i$ is stored in processor $(I, I)$, where $I = \mod(i, n_d) + 1$. Matrix element $A_{ij}$ is distributed to processor $(I, J)$, $I = \mod(i, n_d)+1, J = \mod(j,n_d)+1$, forming a local block matrix $A_{IJ}$.

The local blocks need not be square matrix, except the diagonal blocks. Let $n_I$ be the dimensionality of the $I$-th diagonal blocks. It is easy to see, the row and column dimensionality of the block $A_{IJ}$ is $n_I$ and $n_J$, respectively. \[ n_I = \text{floor}\left(\frac{n-1}{n_d}\right) + \vartheta\left(n-\text{floor}\left(\frac{n-1}{n_d}\right)\cdot n_d - I \ge 0\right) \]

Processor Assignment: Diagonal Major (DM) Scheme

In the DM scheme, the processors are assigned to the matrix blocks according to their diagonal coordinate:

Matrix block $A_{IJ}$ is on processor $ p = F(I-J-1)+J-1$, where $F(x) = (x+1) (2 n_d - x)/2$. In practice, one knows the local process id $p$ and wants to determine which matrix block is stored on it: \[
\begin{split}
I &= 1+p+1+G(p+1)-F(G(p+1)-1); \\
J &= p+1-F(G(p+1)-1).
\end{split}
\] where $G(x) = \text{floor}\left(n_d - \sqrt{\left(n_d+\frac{1}{2}\right)^2-2x} - \frac{1}{2}\right)$.

To store the node information, we introduce a FORTRAN module (a struct is a similar structure in C):

module nodeinfo
implicit none

integer :: nproc, myrank, myrow, mycol, ndiag, idiag;
integer :: proc_comm, row_comm, col_comm;

contains

subroutine initnodeinfo
use mpi
implicit none
integer :: ierr

! determine ndiag, nproc and rank
  call mpi_comm_size(mpi_comm_world, nproc, ierr)
  call mpi_comm_rank(mpi_comm_world, myrank, ierr)
  ndiag = floor(0.5*sqrt(1.+8.*nproc)-0.5)
  
  color = 0
  if(ndiag*(ndiag+1)/2>rank) then
    color = 1
  endif

  call mpi_comm_split(mpi_comm_world, color, myrank, proc_comm, ierr)
  
  call mpi_comm_size(proc_comm, nproc, ierr)
  call mpi_comm_rank(proc_comm, myrank, ierr)

! determine myrow and mycol i.e. (I,J) the block ID
  mycol = 1+rank-F(G(myrank,ndiag),ndiag)
  myrow = G(myrank,ndiag)+mycol+1
  idiag = myrow - mycol

! create row_comm and col_comm, diag_comm
  call mpi_comm_split(proc_comm, myrow, abs(idiag), row_comm, ierr)
  call mpi_comm_split(proc_comm, mycol, abs(idiag), col_comm, ierr)
  call mpi_comm_split(proc_comm, idiag, myrank, diag_comm, ierr)
  
end subroutine
  
integer function F(x, nd)
implicit none
integer, intent(in) :: x, nd

  F = (x+1)*(2*nd-x)/2

end function

integer function G(x, nd)
implicit none
integer, intent(in) :: x, nd

  G = floor(nd-0.5-sqrt((nd+0.5)**2-2*x)-0.5)

end function

integer function nloc(n, nd, i)
implicit none
integer, intent(in) :: n, nd, i

  nloc = (n-1)/nd
  if(n.ge.((n-1)/nd)*nd+i) then
    nloc = nloc + 1
  end if

end function

end module

The problem of the DM scheme is that the number of processors in each row & column communicator is not the same. Indeed, it is not well balanced: some rows & columns have $n_d$ (ndiag) processors, some of them have only 1. This problem can be solved by using the so-called Balanced Column Major scheme.

Processor Assignment: Balanced Column Major (BCM) Scheme

In the BCM scheme, each row or column has $\frac{n_d+1}{2}$ blocks.

The corresponding matrix blocks distribution on processor $p$ is,
if $ p\le \text{ceiling}\left(\frac{n_d+1}{2}\right)\left[ n_d+1-\text{ceiling}\left(\frac{n_d+1}{2}\right)\right]-1$ then \[ \begin{split}
J &= \text{floor}\left(\frac{p}{\text{ceiling}\left(\frac{n_d+1}{2}\right)}\right)+1; \\ I &= \mod\left( p, \text{ceiling}\left(\frac{n_d+1}{2}\right)\right)+J;
\end{split} \]
else,
\[ \begin{split} J &= n_d+2-\text{ceiling}\left(\frac{n_d+1}{2}\right)+\text{floor}\left( \frac{q}{\text{floor}\left(\frac{n_d+1}{2}\right)}\right); \\ I &= \mod\left( \mod\left(q, \text{floor}\left(\frac{n_d+1}{2}\right)\right) +J-1, n_d\right)+1; \end{split} \] where $q = p-\text{ceiling}\left(\frac{n_d+1}{2}\right)\left[n_d+1-\text{ceiling}\left(\frac{n_d+1}{2}\right)\right]$.

The nodeinfo.mod can be modified accordingly.

The Parallel SpMV Algorithm Based on the 2D cyclic Distribution

$A = L + D + L^t$, where $L$ is a strict lower triangular matrix and $D$ is a diagonal matrix. $T = L+D$.
\[ A x + y \to y \Rightarrow T x + (x^t L)^t + y \to y \]

subroutine SpMV(nnz, row, col, mat, ncol, x, yt, nrow, y, xt, ierr)
use nodeinfo
use mpi
implicit none
! A*x + y -> y
! A = L + D + L^t, where L is a strict lower triangular matrix, 
! D is a diagonal matrix, L^t is the transpose of L.
! Let T = L+D; A = T + L^t.  
! A*x = T*x + L^t*x = T*x + (x^t*L)^t
! 
! x/y is the local vector. Only x/y on the diagonal are initialized/output
! xt/yt is the local vector of the transpose vector of x/y
! note their local dimensionalities nrow, ncol are not the same as the original vector.
integer :: nnz, ncol, nrow, ierr, row(nnz), col(nnz);
double precision :: x(ncol), yt(ncol), y(nrow), xt(nrow), mat(nnz);
intent(in) :: nnz, ncol, nrow, row, col, mat;

! (L+D)*x + y -> y
  call mpi_bcast(x, ncol, mpi_real8, 0, col_comm, ierr)
  
! local SpMV
  if(idiag.eq.0) then
    xt = y
  end if
  do i = 1, nnz
    xt( 1+(row(i)-1)/ndiag ) = xt( 1+(row(i)-1)/ndiag ) + mat(i)*x( 1+(col(i)-1)/ndiag )
  end do

  call mpi_reduce(xt, y, nrow, mpi_real8, mpi_sum, 0, row_comm, ierr)

! x^t * L -> y^t
  if(idiag.eq.0) then
    xt = x
    x = y
  end if
  call mpi_bcast(xt, nrow, mpi_real8, 0, row_comm, ierr)
  
! local SpMV
  do i = 1, nnz
    if( row(i) .eq. col(i) ) then
      x( 1+(col(i)-1)/ndiag ) = x( 1+(col(i)-1)/ndiag ) + xt( 1+(row(i)-1)/ndiag )*conjg(mat(i))
    end if
  end do

  call mpi_reduce(x, yt, ncol, mpi_real8, mpi_sum, 0, col_comm, ierr)

  if(idiag.eq.0) then
    y = y + yt
    x = xt
  endif

end subroutine

$\hbar = c = 1$

Jul 2, 2013

On the Parallelism of Symmetric Sparse Matrix Vector Multiplication