Synthesising and Verifying Multi-Core Parallelism in Categories of Nested Code GraphsBy Christopher Kumar Anand and Wolfram Kahl January 24, 2008 Revised version to appear in Process Algebra for Parallel and Distributed Processing.
We present the Multi-Core layer of the larger Coconut project to support high-performance,
high-assurance scientific computation. Programs are represented by nested code graphs, using
domain specific languages. |
![]() |
Optimal Fourier Transform Sampling PatternsBy Christopher Kumar Anand and Anuroop Sharma August 10, 2007 A model for optimizing the sampling of the Fourier Transform of a function is presented, together with a sequential semi-definite trust region strategy, and an implementation using the open-source solver CSDP. Numerical performance correlates with increasing dimension and decreasing number of non-zero function values. Simulation of an application to multi-dimensional NMR (used to study to protein dynamics), predict a hundred-fold reduction in data acquisition time and increased signal to noise ratio when compared to the most common experimental method. Keywords: NMR; MRI; Medical Imaging; Sparse k-Space Sampling; Nonuniform Fourier transform; SDP; Semi-Definite Optimization Read AdvOL Report. Slides. |
Energy-Constrained Pulse Design for MRI and NMRBy Christopher Kumar Anand and Andrew Thomas Curtis July 26, 2007 Optimizing radio frequency (rf) pulses is of interest in both Magnetic Resonance Imaging and Nuclear Magnetic Resonance Spectroscopy. Recent research has focussed on reducing rf energy for imaging, and improving excitation profile uniformity for spectroscopy. This paper summarizes the different ap- proaches used to date, and presents an approach using continuous optimization that utilizes second derivative information of the objective function, with two examples of novel, sub-millisecond pulse designs: an energy-constrained pulse for steady state imaging, and a linear-phase, broadband 90 degree pulse for NMR. The benefits of using second derivatives of the objective function are examined, and a method is presented for efficient computation. All optimization is done using the open-source optimization package IPOPT. Computationally efficient integration of the Bloch equations and derivative calculations are performed using code symbolically generated by Maple. Keywords:radio-frequency pulse; magnetic resonance imaging; optimal control; IPOPT; SAR; specific absorption rate; slice selection; broadband pulse Read the AdvOL Report. | ![]() |
![]() |
Volumetric k-Space Trajectories Via Genetic AlgorithmsBy Andrew Thomas Curtis and Christopher Kumar Anand June 6, 2007 Revised version published in the International Journal of Biomedical Imaging A pseudo-random, velocity-insensitive, volumetric k-space sampling trajectory is designed for use with balanced steady state imaging. Individual arcs are designed independently and do not fit together in the way that multi-shot spiral, radial or echo-planar trajectories do. Previously, it was shown that second-order cone optimization problems can be defined for each arc independent of the others, that nulling of zeroth and higher moments can be encoded as constraints, and that individual arcs can be optimized in seconds. For use in steady state imaging, sampling duty cycles exceed 95 percent. Using such pseudo-random trajectories, aliasing caused by undersampling manifests itself as incoherent noise. In this paper, a genetic algorithm is formulated and numerically evaluated. A large set of arcs is designed using previous methods, and the genetic algorithm choses particular fit subsets of a given size, corresponding to a desired acquisition time. Numerical simulations of 1s acquisitions show good detail and acceptable noise for large-volume imaging with 32 coils. Read the AdvOL Report. |
Simulation of steady-state NMR of coupled systems using Liouville space and computer algebra methodsBy Christopher Kumar Anand, Alex D. Bain and Zhenghua Nie June 4, 2007, A series of repeated pulses and delays applied to a spin system generates a steady state. This is relatively easy to calculate for a single spin, but coupled systems present real challenges. We have used Maple, a computer algebra program to cal- culate one- and two-spin systems symbolically, and larger systems numerically. The one-spin calculations illustrate and validate the methods and show how the steady- state free precession method converges to continuous wave NMR. For two-spin sys- tems, we have derived a general formula for the creation of double-quantum signals as a function of irradiation strength, coupling constant, and chemical shift difference. The calculations on three-spin and larger systems reproduce and extend previously published results. In this paper, we have shown that the approach works well for systems in the literature. However, the formalism is general and can be extended to more complex spin systems and pulses sequences. Read AdvOL Report. | ![]() |
![]() |
MultiLoop: Efficient Software Pipelining for Modern HardwareBy Christopher Kumar Anand and Wolfram Kahl May 8, 2007 This paper is motivated by trends in processor models of which the Cell BE is an exemplar, and by the need to reliably apply multi-level code optimizations in safety-critical code to achieve high performance and small code size. A MultiLoop is a loop specification construct designed to expose in a structured way details of instruction scheduling needed for performance-enhancing transformations. We show by example how it may be used to make better use of underlying hardware features, including software branch prediction and SIMD instructions. In each case, the use of MultiLoop transformations allows us to take full advantage of software branch prediction to completely eliminate branch misses in our scheduled code, and reduce the cost of loop overhead by using SIMD vector instructions. Given the novelty of our representation, it is important to demonstrate feasibility (of zero branch misses) and evaluate performance (of transformations) on a wide set of representative examples from numerical computation. We include simple loops, nested loops, sequentially-composed loops, and loops containing control flow. In some cases we obtain significant benefits: halving execution time, and halving code size. As many details as possible are provided for other compiler writers wishing to adopt our innovative transformations, including instruction selection for SIMD-aware control flow. Read the full paper. |
A Domain-Specific Language for the Generation of Optimized SIMD-Parallel Assembly CodeBy Christopher Kumar Anand and Wolfram Kahl April 24, 2007, A revised version of this paper will appear in IEEE Transactions on Computers. We present a domain-specific language embedded into Haskell that allows mathematicians to formulate novel high-performance SIMD-parallel algorithms for the evaluation of special functions. Developing such functions involves explorations both of mathematical properties of the functions which lead to effective (rational) polynomial approximations, and of specific properties of the binary representation of floating point numbers. Our framework includes support for estimating the effectiveness of different approximation schemes in Maple. Once a scheme is chosen, the Maple-generated component is integrated into the code generation setup. Numerical experimentation can then be performed interactively, with support functions for running standard tests and tabulating results. Once a satisfactory formulation is achieved, a codegraph representation of the algorithm can be passed to other components which produce C function bodies, or to a state-of-the-art scheduler which produces optimal or near-optimal schedules, currently targetting the ``Cell Broadband Engine'' processor. Encapsulating a considerable amount of knowledge about specific tricks in DSL constructs allows us produce algorithm specifications that are precise, readable, and compile to optimal-quality assembly code, while formulations of the equivalent algorithms in C would be almost impossible to understand and maintain. Read SQRL Report 43. |
![]() |
Robust Solvers for Inverse Imaging Problems using Dense Single-Precision HardwareBy Christopher Kumar Anand January 25, 2007 A revised version of this paper has appeared in J. Math Image Vision We present an iterative framework for robustly solving large inverse problems arising in imaging using only single-precision (or other reduced-precision) arithmetic, which allows the use of high-density processors (e.g. Cell BE and Graphics Processing Units). Robustness here means linear-convergence even for large problems (billions of variables), with high levels of noise (signal to noise levels less than unity). This framework handles problems formulated as quadratic and general non-linear minimization problems. Sparse and dense problems can be treated, as long as there are efficient parallelizable matrix-vector products for the transformations involved. Outer iterations correspond to approximate solutions of a quadratic minimization problem, using a single Newton step. Inner iterations correspond to the estimation of the step via truncated Neumann series or minimax polynomial approximations built from operator splittings. Given the simple convergence analysis, this approach can also be used in embedded environments with fixed computation budgets, or certification requirements, like real-time medical imaging. We describe a benchmark problem from MRI, and a series of penalty functions suited to this framework. An important family of such penalties is motivated by both Bilateral Filtering and Total Variation, and we show how they can be optimized using linear programming. We also discuss penalties designed to segment images, and use different types of a priori knowledge, and show numerically that the different penalties are effective when used in combination. Read the entire paper here. See the slides: pdf, ppt, mov. |
Durga: A Heuristically-Optimized Data Collection Strategy For Volumetric Magnetic Resonance ImagingBy Christopher Kumar Anand, Andrew Thomas Curtis, and Rakshit Kumar August 8, 2006, Now in in Engineering Optimization. We present a heuristic design method for rapid volumetric magnetic resonance imaging data acquisition trajectories using a series of second order cone optimization subproblems. Other researchers have considered non-raster data collection trajectories and under-sampled data patterns. This work demonstrates that much higher rates of under-sampling are possible with an asymmetric set of trajectories, with very little loss in resolution, but the addition of noise-like artifacts. The proposed data collection trajectory, Durga, further minimizes collection time by incorporating short, un-refocussed excitation pulses, resulting in above 98 percent collection efficiency for balanced steady state free precession imaging. The optimization sub-problems are novel, in that they incorporate all requirements, including data collection (coverage), physicality (device limits) and signal generation (zeroth and higher moment properties) in a single convex problem, which allows the resulting trajectories to exhibit a higher collection efficiency than any existing trajectory design. Read the entire paper here. |
Magnetic Resonance Tissue Quantification using Optimal bSSFP Pulse-Sequence DesignBy Christopher Kumar Anand, Renata Sotirov, Tamás Terlaky, and Zhuo Zheng Online version published by Optimization and Engineering. We propose a merit function for the expected contrast to noise ratio in tissue quantifications, and formulate a nonlinear, nonconvex semidefinite optimization problem to select locally-optimal balanced steady-state free precession (bSSFP) pulse-sequence design variables. The method could be applied to other pulse sequence types, arbitrary numbers of tissues, and numbers of images. To solve the problem we use a mixture of a grid search to get good starting points, and a sequential, semidefinite, trust-region method, where the subproblems contain only linear and semidefinite constraints. We give the results of numerical experiments for the case of three tissues and three, four or six images, in which we observe a better increase in contrast to noise than would be obtained by averaging the results of repeated experiments. As an illustration, we show how the pulse sequences designed numerically could be applied to the problem of quantifying intraluminal lipid deposits in the carotid artery. Read the entire paper here. |
Optimizing Teardrop, an MRI Sampling TrajectoryBy Christopher Kumar Anand, Tingting Ren, and Tamás Terlaky September 3, 2006 Teardrop is an efficient sampling tra jectory for acquiring Magnetic Resonance (MR) Imaging Data, especially balanced steady state free precession images. In this paper we present two models for optimizing such trajectories. These are the first models to incorporate motion-insensitivity constraints into a non-raster (also called spiral) sampling trajectory. The first model is nonlinear and very specific to Teardrop. The second model uses sequential second-order cone programming, and is generalisable to other trajectories in two and three dimensions. We present a weak convergence proof for the sequential method. Read the entire paper here. Revised version in on-line edition of Optimization Methods and Software. |
CELL SPU Math Library Using Register-Level LookupBy Christopher Kumar Anand, Wei Li, Anuroop Sharma, and Sanvesh Strivastava October 14, 2006 This paper demonstrates techniques for increasing instruction-level parallelism via register-level lookups as an alternative to predication, using examples from library of elementary math functions targetting CELL SPUs. Comparing the performance of our library with functions released in the CELL SDK 1.1 which do not use register-level lookup to improve parallelism, we measure a performance advantage of 40 percent on average. Since some functions are too simple to benefit, the average masks much larger improvements for more complicated functions. We developed this library using a declarative assembly language, and SPU simulator written in Haskell. Through the examples, we show how this environment supports the rapid prototyping of compiler-optimization such as register-level lookups, and the incorporation of code generation from mathematical models. The example code is documented in detail, and should be a good introduction to these special features of the SPU instruction set architecture for both compiler writers and software developers. Read the entire paper here. |
Explicitly Staged Software PipeliningBy Christopher Kumar Anand, Wolfram Kahl, and Wolfgang Thaller October 14, 2006 This paper describes an alternative to modulo scheduling for loops, in which the first step is to divide instructions into stages by solving a series of min-cut problems constructed from the code graph of the unscheduled loop body. Our algorithm is formulated and implemented in terms of the code graphs of our own declarative assembly language for CELL SPUs. We have measured an average 20 percent reduction in Iteration Interval between our algorithm and modulo scheduling via XLC. Read the entire paper here. |
Thank you for your comments. We agree that dependence on knowledge of Haskell and the SPU ISA is a problem, and (now that the DSL paper is written) will revise this paper to make it more accessible.
Model-Based Tissue Quantification from Simulated Partial k-Space MRI DataBy Mehrdad Mozafari July 2008 Pixel values in MR images are linear combinations of contributions from multiple tissue fractions. The tissue fractions can be recovered using the Moore- Penrose pseudo-inverse if the tissue parameters are known, or can be deduced using machine learning. Acquiring sufficiently many source images may be too time consuming for some applications. In this thesis, we show how tissue fractions can be recovered from partial k-space data, collected in a fraction of the time required for a full set of experiments. The key to reaching significant sample reductions is the use of regularization. As an additional benefit, regularizing the inverse problem for tissue fractions also reduces the sensitivity to measurement noise. Numerical simulations are presented showing the effectiveness of the method, showing three tissue types. Clinically, this corresponds to liver imaging, in which normal liver, fatty liver and blood would need to be included in a model, in order to get an accurate fatty liver ratio, because all three overlap in liver pixels (via partial voluming). |
![]() |
Explicitly Staged Software PipeliningBy Wolfgang Thaller August 2006 Software Pipelining is a method of instruction scheduling where loops are scheduled more efficiently by executing operations from more than one iteration of the loop in parallel. Finding an optimal software pipelined schedule is NP-complete, but many heuristic algorithms exist. In iteration i , a software pipelined loop will execute, in parallel, "stage" 1 of iteration i , stage 2 of iteration i - 1 and so on until stage k of iteration i - k + 1. We present a new approach to software pipelining based on using a heuristic algorithm to explicitly assign each operation to its stage before the actual scheduling. This explicit assignment allows us to implement control flow mechanisms that are hard to implement with traditional methods of software pipelining, which do not give us direct control over what stages instructions are assigned to. Read the entire thesis here. |
HUSC Language and Type SystemBy Gordon J. Uszkay June 1, 2006 HUSC is a high level declarative language designed to capture the relations and properties of a complex system independently of implementation and platform. It is intended for use in Coconut (COde CONstructing User Tool), a pro ject at McMaster University to create a new development environment for safety-critical, computation- ally intensive domains such as medical imaging. The language is intended to provide an interface that is comfortable for use by scientists and engineers, while providing the benefits of strong, static type analysis found in functional programming languages. HUSC supports type inferencing using constraint handling rules, including both predefined, parameterized shapes and an arbitrary number of additional properties or constraints called attributes. Each attribute class provides its own type inferencing rules according to the HUSC attribute class definition structure. Multiple imple- mentations of an operator or function can each specify a specialized type context, including attributes, and be selected based on type inferencing. The HUSC system creates a typed code hypergraph, with terms as nodes and edges being all of the operator or functions implementations that are satisfied in the type context, which can be used as the basis for an optimizing compiler back end. We present here the HUSC language and type system, a prototype implementation and an example demonstrating how the type inferencing may be used in conjunction with the function joins to provide a good starting point for code graph optimization. It also includes a number of observations and suggestions for making the transition from prototype to useable system. The results of this work are sufficient to demonstrate that this approach is promising, while highlighting some difficulties in producing a robust, practical implementation. |
![]() |