A Domain-Specific Architecture for Elementary Function Evaluation

by Anuroop Sharma, Christopher Kumar Anand
5 September 2014

We propose a Domain-Specific Architecture for elementary function computation to improve throughput while reducing power consumption as a model for more general applications: support fine-grained parallelism by eliminating branches, eliminate the duplication required by co-processors by decomposing computation into instructions which fit existing pipelined execution models and standard register files. Our example instruction architecture (ISA) extension supports scalar and vector/SIMD implementations of table-based methods of calculating all common special functions, with the aim of improving throughput by (1) eliminating the need for tables in memory, (2) eliminating all branches for special cases, (3) reducing the total number of instructions. Two new instructions are required, a table-lookup instruction and an extended-precision floating-point multiply-add instruction with special treatment for exceptional inputs.

To estimate the performance impact of these instructions, we implemented them in a modified Cell/B.E. SPU simulator and observed an average throughput improvement of 2.5 times for optimized loops mapping single functions over long vectors.

Read the CAS Report.

Symbolic Generation of Parallel Solvers for Inverse Imaging Problems

By Jessica L M Pavlin and Christopher Kumar Anand
5 September 2014

Medical Image Reconstruction is characterized by requirements (1) to process high volumes of data quickly so sick patients need not wait or be called back for re-imaging, and (2) provide the highest level of software quality, including both correctness and safety. Parallelization today allows very high levels of performance, at the cost of duplicating hardware resources and increasing software complexity. Software complexity makes it harder to ensure correctness and safety in a domain where errors can easily arise out of miscommunication between domain experts and software developers.

To solve these problems, we propose a development environment which transforms simple mathematical models into parallel programs via term graph transformations and code generation to an intermediate language we call Atomic Verifiable Operations (AVOps), which we have previously introduced together with a linear-time verification method. In this paper, we will describe a variation on AVOps which support multi-core systems, and, for the first time, the Coconut Expression Library, a Domain Specific Language (DSL) for domain experts to specify mathematical models in the form of objective functions, and for performance experts to provide rule-based transformations to compute gradients, algebraic simplifications and finally parallelizations expressed using AVOps.

Read the CAS Report.

Type-Safety for Inverse Imaging Problems

by Maryam Moghadas, Yuriy Toporovskyy, Christopher Kumar Anand
4 August 2014

This paper gives a partial answer to the question: ``Can type systems detect modelling errors in scientific computing, particularly for inverse problems derived from physical models?'' by considering, in detail, the major aspects of inverse problems in Magnetic Resonance Imaging (MRI). We define a type-system that can capture all correctness properties for basic MRI inverse problems, including many properties that are not captured with current type-systems, e.g., frames of reference. We implement a type-system in the Haskell language that can capture the errors arising in translating a mathematical model into a linear or nonlinear system or alternatively into an objective function. Most models are--or can be approximated by--linear transformations and we demonstrate the feasibility of capturing their correctness at the type level using a difficult case: the (discrete) Fourier transformation (DFT). By this, we mean that we are able to catch, at compile time, all errors known to us in applying the DFT. We describe the Haskell implementation of vector size, physical units, frame of reference, and so on required in the mathematical modelling of inverse problems without regularization.

Read the CAS Report, and try the code with Nat, without Nat and with fixed-point decimal numbers.

Synthesising and Verifying Multi-Core Parallelism in Categories of Nested Code Graphs

By Christopher Kumar Anand and Wolfram Kahl
January 24, 2008
Revised version to appear in Process Algebra for Parallel and Distributed Processing.

We present the Multi-Core layer of the larger Coconut project to support high-performance, high-assurance scientific computation. Programs are represented by nested code graphs, using domain specific languages.
At the Multi-Core level, the language is very restricted, in order to restrict control flow to non-branching, synchronising control flow, which allows us to treat multi-core parallelism in essentially the same way as instruction-level parallelism for pipelined multi-issue processors. The resulting schedule is then presented as a ``locally sequential program'', which, like high-quality conventional assembly code in the single-core case, is arranged for hiding latencies at execution time so that peak performance can be reached, and can also be understood by programmers. We present an efficient, incremental algorithm capable of verifying the soundness of the communication aspects of such programs.

Read the SQRL Report. Slides used in video. The first half of the talk is about Coconut in general, and the second half focuses on some of the ideas in this report and touches on ExSSP.

Optimal Fourier Transform Sampling Patterns

By Christopher Kumar Anand and Anuroop Sharma
August 10, 2007

A model for optimizing the sampling of the Fourier Transform of a function is presented, together with a sequential semi-definite trust region strategy, and an implementation using the open-source solver CSDP. Numerical performance correlates with increasing dimension and decreasing number of non-zero function values. Simulation of an application to multi-dimensional NMR (used to study to protein dynamics), predict a hundred-fold reduction in data acquisition time and increased signal to noise ratio when compared to the most common experimental method.

Keywords: NMR; MRI; Medical Imaging; Sparse k-Space Sampling; Nonuniform Fourier transform; SDP; Semi-Definite Optimization

Read AdvOL Report. Slides.

Energy-Constrained Pulse Design for MRI and NMR

By Christopher Kumar Anand and Andrew Thomas Curtis
July 26, 2007

Optimizing radio frequency (rf) pulses is of interest in both Magnetic Resonance Imaging and Nuclear Magnetic Resonance Spectroscopy. Recent research has focussed on reducing rf energy for imaging, and improving excitation profile uniformity for spectroscopy. This paper summarizes the different ap- proaches used to date, and presents an approach using continuous optimization that utilizes second derivative information of the objective function, with two examples of novel, sub-millisecond pulse designs: an energy-constrained pulse for steady state imaging, and a linear-phase, broadband 90 degree pulse for NMR. The benefits of using second derivatives of the objective function are examined, and a method is presented for efficient computation. All optimization is done using the open-source optimization package IPOPT. Computationally efficient integration of the Bloch equations and derivative calculations are performed using code symbolically generated by Maple. Keywords:radio-frequency pulse; magnetic resonance imaging; optimal control; IPOPT; SAR; specific absorption rate; slice selection; broadband pulse

Read the AdvOL Report.

Volumetric k-Space Trajectories Via Genetic Algorithms

By Andrew Thomas Curtis and Christopher Kumar Anand
June 6, 2007
Revised version published in the International Journal of Biomedical Imaging

A pseudo-random, velocity-insensitive, volumetric k-space sampling trajectory is designed for use with balanced steady state imaging. Individual arcs are designed independently and do not fit together in the way that multi-shot spiral, radial or echo-planar trajectories do. Previously, it was shown that second-order cone optimization problems can be defined for each arc independent of the others, that nulling of zeroth and higher moments can be encoded as constraints, and that individual arcs can be optimized in seconds. For use in steady state imaging, sampling duty cycles exceed 95 percent. Using such pseudo-random trajectories, aliasing caused by undersampling manifests itself as incoherent noise. In this paper, a genetic algorithm is formulated and numerically evaluated. A large set of arcs is designed using previous methods, and the genetic algorithm choses particular fit subsets of a given size, corresponding to a desired acquisition time. Numerical simulations of 1s acquisitions show good detail and acceptable noise for large-volume imaging with 32 coils.

Read the AdvOL Report.

Simulation of steady-state NMR of coupled systems using Liouville space and computer algebra methods

By Christopher Kumar Anand, Alex D. Bain and Zhenghua Nie
June 4, 2007,

A series of repeated pulses and delays applied to a spin system generates a steady state. This is relatively easy to calculate for a single spin, but coupled systems present real challenges. We have used Maple, a computer algebra program to cal- culate one- and two-spin systems symbolically, and larger systems numerically. The one-spin calculations illustrate and validate the methods and show how the steady- state free precession method converges to continuous wave NMR. For two-spin sys- tems, we have derived a general formula for the creation of double-quantum signals as a function of irradiation strength, coupling constant, and chemical shift difference. The calculations on three-spin and larger systems reproduce and extend previously published results. In this paper, we have shown that the approach works well for systems in the literature. However, the formalism is general and can be extended to more complex spin systems and pulses sequences.

Read AdvOL Report.

MultiLoop: Efficient Software Pipelining for Modern Hardware

By Christopher Kumar Anand and Wolfram Kahl
May 8, 2007

This paper is motivated by trends in processor models of which the Cell BE is an exemplar, and by the need to reliably apply multi-level code optimizations in safety-critical code to achieve high performance and small code size.

A MultiLoop is a loop specification construct designed to expose in a structured way details of instruction scheduling needed for performance-enhancing transformations. We show by example how it may be used to make better use of underlying hardware features, including software branch prediction and SIMD instructions. In each case, the use of MultiLoop transformations allows us to take full advantage of software branch prediction to completely eliminate branch misses in our scheduled code, and reduce the cost of loop overhead by using SIMD vector instructions.

Given the novelty of our representation, it is important to demonstrate feasibility (of zero branch misses) and evaluate performance (of transformations) on a wide set of representative examples from numerical computation. We include simple loops, nested loops, sequentially-composed loops, and loops containing control flow. In some cases we obtain significant benefits: halving execution time, and halving code size. As many details as possible are provided for other compiler writers wishing to adopt our innovative transformations, including instruction selection for SIMD-aware control flow.

Read the full paper.

A Domain-Specific Language for the Generation of Optimized SIMD-Parallel Assembly Code

By Christopher Kumar Anand and Wolfram Kahl
April 24, 2007,
A revised version of this paper will appear in IEEE Transactions on Computers.

We present a domain-specific language embedded into Haskell that allows mathematicians to formulate novel high-performance SIMD-parallel algorithms for the evaluation of special functions.

Developing such functions involves explorations both of mathematical properties of the functions which lead to effective (rational) polynomial approximations, and of specific properties of the binary representation of floating point numbers. Our framework includes support for estimating the effectiveness of different approximation schemes in Maple. Once a scheme is chosen, the Maple-generated component is integrated into the code generation setup. Numerical experimentation can then be performed interactively, with support functions for running standard tests and tabulating results. Once a satisfactory formulation is achieved, a codegraph representation of the algorithm can be passed to other components which produce C function bodies, or to a state-of-the-art scheduler which produces optimal or near-optimal schedules, currently targetting the ``Cell Broadband Engine'' processor.

Encapsulating a considerable amount of knowledge about specific tricks in DSL constructs allows us produce algorithm specifications that are precise, readable, and compile to optimal-quality assembly code, while formulations of the equivalent algorithms in C would be almost impossible to understand and maintain.

Read SQRL Report 43.

Robust Solvers for Inverse Imaging Problems using Dense Single-Precision Hardware

By Christopher Kumar Anand
January 25, 2007
A revised version of this paper has appeared in J. Math Image Vision

We present an iterative framework for robustly solving large inverse problems arising in imaging using only single-precision (or other reduced-precision) arithmetic, which allows the use of high-density processors (e.g. Cell BE and Graphics Processing Units). Robustness here means linear-convergence even for large problems (billions of variables), with high levels of noise (signal to noise levels less than unity). This framework handles problems formulated as quadratic and general non-linear minimization problems. Sparse and dense problems can be treated, as long as there are efficient parallelizable matrix-vector products for the transformations involved. Outer iterations correspond to approximate solutions of a quadratic minimization problem, using a single Newton step. Inner iterations correspond to the estimation of the step via truncated Neumann series or minimax polynomial approximations built from operator splittings. Given the simple convergence analysis, this approach can also be used in embedded environments with fixed computation budgets, or certification requirements, like real-time medical imaging. We describe a benchmark problem from MRI, and a series of penalty functions suited to this framework. An important family of such penalties is motivated by both Bilateral Filtering and Total Variation, and we show how they can be optimized using linear programming. We also discuss penalties designed to segment images, and use different types of a priori knowledge, and show numerically that the different penalties are effective when used in combination.

Read the entire paper here. See the slides: pdf, ppt, mov.

Durga: A Heuristically-Optimized Data Collection Strategy For Volumetric Magnetic Resonance Imaging

By Christopher Kumar Anand, Andrew Thomas Curtis, and Rakshit Kumar
August 8, 2006, Now in in Engineering Optimization.

We present a heuristic design method for rapid volumetric magnetic resonance imaging data acquisition trajectories using a series of second order cone optimization subproblems. Other researchers have considered non-raster data collection trajectories and under-sampled data patterns. This work demonstrates that much higher rates of under-sampling are possible with an asymmetric set of trajectories, with very little loss in resolution, but the addition of noise-like artifacts. The proposed data collection trajectory, Durga, further minimizes collection time by incorporating short, un-refocussed excitation pulses, resulting in above 98 percent collection efficiency for balanced steady state free precession imaging. The optimization sub-problems are novel, in that they incorporate all requirements, including data collection (coverage), physicality (device limits) and signal generation (zeroth and higher moment properties) in a single convex problem, which allows the resulting trajectories to exhibit a higher collection efficiency than any existing trajectory design.

Read the entire paper here.

Magnetic Resonance Tissue Quantification using Optimal bSSFP Pulse-Sequence Design

By Christopher Kumar Anand, Renata Sotirov, Tamás Terlaky, and Zhuo Zheng
Online version published by Optimization and Engineering.

We propose a merit function for the expected contrast to noise ratio in tissue quantifications, and formulate a nonlinear, nonconvex semidefinite optimization problem to select locally-optimal balanced steady-state free precession (bSSFP) pulse-sequence design variables. The method could be applied to other pulse sequence types, arbitrary numbers of tissues, and numbers of images. To solve the problem we use a mixture of a grid search to get good starting points, and a sequential, semidefinite, trust-region method, where the subproblems contain only linear and semidefinite constraints. We give the results of numerical experiments for the case of three tissues and three, four or six images, in which we observe a better increase in contrast to noise than would be obtained by averaging the results of repeated experiments. As an illustration, we show how the pulse sequences designed numerically could be applied to the problem of quantifying intraluminal lipid deposits in the carotid artery.

Read the entire paper here.

Optimizing Teardrop, an MRI Sampling Trajectory

By Christopher Kumar Anand, Tingting Ren, and Tamás Terlaky
September 3, 2006

Teardrop is an efficient sampling tra jectory for acquiring Magnetic Resonance (MR) Imaging Data, especially balanced steady state free precession images. In this paper we present two models for optimizing such trajectories. These are the first models to incorporate motion-insensitivity constraints into a non-raster (also called spiral) sampling trajectory. The first model is nonlinear and very specific to Teardrop. The second model uses sequential second-order cone programming, and is generalisable to other trajectories in two and three dimensions. We present a weak convergence proof for the sequential method.

Read the entire paper here. Revised version in on-line edition of Optimization Methods and Software.

CELL SPU Math Library Using Register-Level Lookup

By Christopher Kumar Anand, Wei Li, Anuroop Sharma, and Sanvesh Strivastava
October 14, 2006

This paper demonstrates techniques for increasing instruction-level parallelism via register-level lookups as an alternative to predication, using examples from library of elementary math functions targetting CELL SPUs. Comparing the performance of our library with functions released in the CELL SDK 1.1 which do not use register-level lookup to improve parallelism, we measure a performance advantage of 40 percent on average. Since some functions are too simple to benefit, the average masks much larger improvements for more complicated functions. We developed this library using a declarative assembly language, and SPU simulator written in Haskell. Through the examples, we show how this environment supports the rapid prototyping of compiler-optimization such as register-level lookups, and the incorporation of code generation from mathematical models. The example code is documented in detail, and should be a good introduction to these special features of the SPU instruction set architecture for both compiler writers and software developers.

Read the entire paper here.

Thank you for your comments. We agree that dependence on knowledge of Haskell and the SPU ISA is a problem, and (now that the DSL paper is written) will revise this paper to make it more accessible.

Explicitly Staged Software Pipelining

By Christopher Kumar Anand, Wolfram Kahl, and Wolfgang Thaller
October 14, 2006

This paper describes an alternative to modulo scheduling for loops, in which the first step is to divide instructions into stages by solving a series of min-cut problems constructed from the code graph of the unscheduled loop body. Our algorithm is formulated and implemented in terms of the code graphs of our own declarative assembly language for CELL SPUs. We have measured an average 20 percent reduction in Iteration Interval between our algorithm and modulo scheduling via XLC. Read the entire paper here.