D. Kaya
K. Wright
University of Newcastle upon Tyne. 1994
A number of different parallel algorithms for the LU decomposition of a square A matrix are considered. One aim of the methods is to collect together updates to columns as far as possible, to make good use of the storage hierarchy of the shared memory multiprocessor used to test the algorithms. Both unit lower triangular form for L and unit upper triangular form U variants are considered. The results presented were obtained using the C++ programming language, with parallel constructs provided by the Encore Parallel Threads package, on an Encore Multimax computer. These results indicate significant improvements over a simple parallel implementation of the standard Crout algorithm, and good speedup compared to the sequential Crout algorithm.