Tour: Matrix Addition

Matrix addition is a simple, embarrassingly parallel element-wise operation. With our previous knowledge, we could parallelize matrix addition by enclosing each element of the matrix in a GuardVar, but this would in all likelihood be over-synchronized and would perform poorly.

In this section, we will explore the PartList type, which allows us to express array-based parallel computations efficiently and concisely.

Formulation

Matrix addition is defined as:

\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1n} \\ b_{21} & b_{22} & \cdots & b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ b_{m1} & b_{m2} & \cdots & b_{mn} \end{bmatrix} = \begin{bmatrix} a_{11} + b_{11} & a_{12} + b_{12} & \cdots & a_{1n} + b_{1n} \\ a_{21} + b_{21} & a_{22} + b_{22} & \cdots & a_{2n} + b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} + b_{m1} & a_{m2} + b_{m2} & \cdots & a_{mn} + b_{mn} \end{bmatrix}

Put simply, we add each element of the first matrix to the corresponding element of the second matrix.

Matrix Type

Guardian currently does not have a built-in matrix or n-dimensional array type. However, we can define a simple matrix type using PartList to store a flattened array of elements. That is, for a $M \times N$ matrix, we can store the elements in a PartList of length $M \cdot N$, where the first $N$ elements comprise the first row, the next $N$ elements comprise the second row, and so on.

Formally:

\text{PartList}[i] = \text{Matrix}_{m n} \text{ where } m = \left \lfloor{\frac{i}{n}}\right \rfloor \text{ and } n = i \pmod{n}

Let's define this as a class in Guardian. Our class should have three fields:

m: The number of rows in the matrix.
n: The number of columns in the matrix.
data: The flattened PartList of elements.

The fields m and n are immutable value types (integers), because the matrix never changes dimensions. The data field is a PartList, which is a vault type, meaning that it has interior, thread-safe mutability. This means that our class must be a vault type as well.

Here's our class:

vault class Matrix { 
    m: int; // Number of rows
    n: int; // Number of columns
    data: PartList<real>; // Flattened array of elements
}

Let's add a constructor to our class that initializes the matrix. To initialize a PartList, we need to call the PartList::of function, which takes a size and an initialization function of type (int) -> T, where the int argument is an index and T is the type of the elements in the PartList.

Since our matrix type is supposed to be two-dimensional, let's say that its constructor must take an initialization function of type (int, int) -> real, where the int arguments are the row and column indices.

Note: Due to a language limitation which may be lifted in the future, we cannot attach a suffix lambda to a constructor call using the new keyword. To work around this, we create a static of method on the class that takes the same arguments as the constructor, and call the constructor from inside that method. This will create an interface similar to PartList::of.

Here's our Matrix class with the constructor added:

vault class Matrix {
    m: int;
    n: int;
    data: PartList<real>;

    // Constructor
    private new(m: int, n: int, initFn: (int, int) -> real) {
        this.m = m;
        this.n = n;
        this.data = PartList::of(m * n) |i| {
            yield initFn(i / n, i % n);
        };
    }

    // HACK: Currently can't pass suffix lambda to a constructor directly.
    static fn of(m: int, n: int, initFn: (int, int) -> real) -> Matrix {
        return new Matrix(m, n, initFn);
    }
}

Matrix Methods

First, let's define a method to print a Matrix object to the console.

To access the elements of a PartList, we must use the PartList::runParted method. This method transparently partitions the elements, then runs a task on those partitions in parallel. It takes four arguments:

A chunk specifier. This is either an int specifying the number of evenly-sized chunks to create, or a ValueList<(int, int)> specifying the exact intervals comprising each partition.
The number of ghost zones. This is not relevant to us now, but it is useful for stencil or kernel computations, where we need to read from neighboring elements. If this argument is omitted, it is implicitly zero.
At least one PartIntent, specifying which PartLists are involved in the computation, and whether they are being read to or written from.
A task to run on each partition. This task accepts a number of PartViews reflecting the number of intents. Each view corresponds to a slice of a PartList, and has reading or writing capabilities according to the intent. We can subscript and assign to these views with square brackets like a normal array type.

This is a lot of information to take in all at once, but it will hopefully become clearer when we look at some usages of runParted.

Here's our print method. Note that it has to return a future because runParted is asynchronous:

fn print() -> Fut<void> {
    return PartList::runParted(1, this.data.read()) |data|:   // Line A
    for i in 0..this.m {
        for j in 0..this.n {
            Out::print(data[i * this.n + j] + " ");
        }
        Out::println("");
    }
}

On line A, we call runParted. First, we pass in a chunk specifier of 1, which means that we want to create one partition (i.e., we are not parallelizing this operation at all). We also omit the ghost zone argument, since we are not performing a stencil computation.

Next, we pass in a PartIntent that specifies that we want to read from the data field.

Then, we attach a suffix lambda specifying what to do with each partition. In this case, we simply iterate over the elements of the partition and print them to the console.

Finally, we return the future returned by runParted to satisfy the must-consume rule.

Let's make sure everything is working by adding a main class with an entry function and executing the program. Here's our code so far:

namespace program;

import gu4::Out;

vault class Matrix {
    m: int; // num rows
    n: int; // num cols
    data: PartList<real>;

    private new(m: int, n: int, initFn: (int, int) -> real) {
        this.m = m;
        this.n = n;
        this.data = PartList::of(m * n) |i| {
            yield initFn(i / n, i % n);
        };
    }

    // HACK: Currently can't pass suffix lambda to a constructor directly.
    static fn of(m: int, n: int, initFn: (int, int) -> real) -> Matrix {
        return new Matrix(m, n, initFn);
    }

    fn print() -> Fut<void> {
        return PartList::runParted(1, this.data.read()) |data|:
        for i in 0..this.m {
            for j in 0..this.n {
                Out::print(data[i * this.n + j] + " ");
            }
            Out::println("");
        }
    }
}

value class Program {
    async entry fn main() -> Fut<void> {
        let a = Matrix::of(3, 3) |i, j| { yield 0.0 + i + j; };  // Add 0.0 to each element to coerce from int to real.

        Out::println("Matrix A:");
        a.print().then():

        async return;
    }
}

If we execute this, we should see the following output:

Matrix A:
0.0 1.0 2.0 
1.0 2.0 3.0 
2.0 3.0 4.0

Matrix Addition

Now that we have a working matrix type, let's write a method to add two matrices together. In this case, rather than modifying the first matrix in-place, we will return a new matrix containing the sum.

Here's the signature for our method:

async fn add(other: Matrix) -> Fut<Matrix> {
    /* Implement me! */
}

First, we need to check that both matrices have the same dimensions. We can do this by checking that the m and n fields of both matrices are equal:

if this.m != other.m || this.n != other.n {
    Error::die("Mismatched matrix dimensions: " + this.m + "x" + this.n + " + " + other.m + "x" + other.n);
}

The Error::die function will terminate the program with an error message. We need to import Error; to use it.

Next, we need to create a new matrix to hold the result. We can do this by calling the Matrix::of method to obtain a new Matrix, passing in the dimensions of the result and a function that initializes each element of the result to zero:

let result = Matrix::of(this.m, this.n) |i, j| {
    yield 0.0;
};

Let's also decide how many partitions we want to create. This is essentially the number of threads we want to use to compute the result. For simplicity, we will take the larger of m and n:

let threads = Math::max(this.m, this.n);

We need to import Math; to use Math::max.

Now, we can use runParted to compute the result in parallel. We need to read from the two addends and write to the result. We can do this by passing in three PartIntent objects:

PartList::runParted(threads, this.data.read(), other.data.read(), result.data.write()) |self, other, result| {
    result[..] = self[..] + other[..];
}

The [..] syntax is part of a feature in Guardian called auto-vectorization. It's used to concisely express loops over elements of subscriptable types. In this case, the .. with no indices means "all elements." The compiler will generate a loop over all elements of the views, like this:

PartList::runParted(threads, this.data.read(), other.data.read(), result.data.write()) |self, other, result| {
    for i in 0..result.writableSize() {
        result[i] = self[i] + other[i];
    }
}

Recall that runParted returns a future, and we will need to consume it. Since this future represents the completion of all tasks initiated by runParted, its completion means our result is ready. So, we will simply return the result:

PartList::runParted(threads, this.data.read(), other.data.read(), result.data.write()) |self, other, result| {
    result[..] = self[..] + other[..];
}.then():

async return result;

Here's the full program with the add function and an expanded main method:

namespace program;

import gu4::Out;
import gu4::Error;
import gu4::Math;

vault class Matrix {
    m: int; // num rows
    n: int; // num cols
    data: PartList<real>;

    private new(m: int, n: int, initFn: (int, int) -> real) {
        this.m = m;
        this.n = n;
        this.data = PartList::of(m * n) |i| {
            yield initFn(i / n, i % n);
        };
    }

    // HACK: Currently can't pass suffix lambda to a constructor directly.
    static fn of(m: int, n: int, initFn: (int, int) -> real) -> Matrix {
        return new Matrix(m, n, initFn);
    }

    async fn add(other: Matrix) -> Fut<Matrix> {
        if this.m != other.m || this.n != other.n {
            Error::die("Mismatched matrix dimensions: " + this.m + "x" + this.n + " + " + other.m + "x" + other.n);
        }

        let result = Matrix::of(this.m, this.n) |i, j| {
            yield 0.0;
        };

        let threads = Math::max(this.m, this.n);

        PartList::runParted(threads, this.data.read(), other.data.read(), result.data.write()) |self, other, result| {
            result[..] = self[..] + other[..];
        }.then():

        async return result;
    }

    fn print() -> Fut<void> {
        return PartList::runParted(1, this.data.read()) |data|:
        for i in 0..this.m {
            for j in 0..this.n {
                Out::print(data[i * this.n + j] + " ");
            }
            Out::println("");
        }
    }
}

value class Program {
    async entry fn main() -> Fut<void> {
        let a = Matrix::of(3, 3) |i, j| { yield 0.0 + i + j; };
        let b = Matrix::of(3, 3) |i, j| { yield 0.0 + i - j; };

        a.add(b).then() |c|:

        Out::println("Matrix A:");
        a.print().then():

        Out::println("Matrix B:");
        b.print().then():

        Out::println("Matrix A + B:");
        c.print().then():

        async return;
    }
}

Here's what we expect to see when we run the program:

Matrix A:
0.0 1.0 2.0 
1.0 2.0 3.0 
2.0 3.0 4.0 
Matrix B:
0.0 -1.0 -2.0 
1.0 0.0 -1.0 
2.0 1.0 0.0 
Matrix A + B:
0.0 0.0 0.0 
2.0 2.0 2.0 
4.0 4.0 4.0

All done!

In the next section, we'll look at how to use PartList to implement a more complicated parallel computation.