When asked at the start of a project what Advanced Processor Research Ltd.'s tasking would be,
a technical research manager familiar with
APR Ltd.'s style of analysis and design replied:
"That's easy. Anything
that doesn't fit neatly into one of my project task boxes goes to APR.".
APR Ltd. has an intuitive talent for tailoring computationally demanding Linux and Unix based applications to achieve optimum performance on different hardware architectures. The significant features of these applications are a mixture of heavy floating point, integer/logical, memory and I/O demands. The hardware architecture classes well understood by APR include symmetric multiprocessors (SMP), non-uniform memory architectures (NUMA), Linux clusters, single instruction multiple data (SIMD), systolic machines, long vector supercomputers and high end superscalar processors like the Intel® Itanium® 2, UltraSPARC-IV®, and AMD Opteron™.
Many people assume that "customizing an application to a specific hardware architecture" means mapping it to a specific chip set, or in common jargon, "plating it on the iron", with the implicit assumption that even minor model changes in the target computer hardware will invalidate the optimizations. Since at least the middle 1990s, major computer hardware vendors have been building architectures rather than hardware. The vendor publishes an architectural specification for a family of "logical" machines. The vendor is free to substitute different chip sets as chip technology and cost curves evolve. Vendors themselves sometimes choose to blur the issues when users ask uncomfortable questions about limits or weaknesses in the architectural specification.
APR is deeply experienced at knowing how far to push the envelope and shines at delivering cost effective high performance computing (HPC) solutions for both architectural families and machine classes. APR has successfully contracted in support of application specialist teams in the areas of seismic exploration, digital image processing, visualization, graphics, classification, neural network analysis, computational chemistry, finite element analysis and tailoring intrinsic function evaluation (including FFTs, convolution/correlation) and scientific matrix package functions (like LAPACK and older LINPACK, EISPACK) for specific applications and hardware environments.
If you suspect that your scientific projects need novel "outside the box" or fresh thinking, then APR may be the right solution generator for you. Accustomed to working as part of multidisciplinary teams with application experts, APR's orthogonal thinking and "fresh eyes" add a new creative dimension to your solution matrix. APR believes in realized solutions - not just exciting concepts and is prepared to go the whole distance as part of your team carrying project ideas from conception through analysis, design and implementation to production grade robustness and reliability.
Approaching scientific applications from a computer science perspective, APR frequently finds good fit solutions that are not apparent from an application specialist's viewpoint. A trademark of APR design is finding simple methods to deal with complex problems.
In novel or unconventional situations where there are few guideposts, APR does not suffer any sense of disorientation but cheerfully deploys its full battery of research tools and orthogonal methods to home in on viable alternatives. APR takes pride in providing ground breaking solutions to new and original problems.
APR takes pride in developing application libraries and tool kits for application programmers and researchers. APR libraries are well received by end use developers. The secret of this success is that APR designs in concert with end use developer needs. APR builds total packages which include all the convenience and bookkeeping utilities end use developers require to manage package activity eschewing the "they can do for themselves" mentality.
APR typically designs packages with three levels, an application level for general end use, a low level for detailed control and an internal or private level where the actual function of the package is realized.
What is the value to your company of a 1.33X speedup on a million dollar computer system over the typical high performance computing system life time of 3 years? An extra million dollars of availability. How will you harvest it? Healthier profit margins? A research advance? Undercut the competition? High performance computing (HPC) speedups when engineered by an experienced high performance computing (HPC) consultant generally carry forward to the next generation of hardware in an architectural line.
Application speedups in the two to five times range are frequently realized when a high performance computing (HPC) consultant has strong in-house support. A modern reality is that most significant achievements are multidisciplinary. As an experienced high performance computing (HPC) consultant, Advanced Processor Research Ltd. understands that the contribution of high performance computing expertise to the application mix is more like adding a trace nutrient than a pivotal event. To achieve its best results, APR Ltd. draws heavily on the experience, skills and insights of the resident experts and consequently appreciates the luxury and advantages of working with multidisciplinary in-house application teams. To paraphrase Shakespeare, "The application's the thing!".
A secondary benefit is that the numerical accuracy of calculations is improved, sometimes with startling gains in clarity. A skilled high performance computing (HPC) consultant has a better than average understanding of the implications of the fine structure properties of floating point arithmetic and will instinctively tune computational sequences to increase accuracy. For critical or marginally stable computations, a capable high performance computing (HPC) consultant will employ his full battery of of analysis tools including experience and interval and relative arithmetic analysis. On reflection, the frequency with which this bonus benefit arises should not be too surprising. High performance computational speedups are often realized by reducing the number of floating point operations. Fewer operations means less rounding error. A favourable start.
On one contract (after some careful nontrivial study), APR Ltd. asked permission to replace two approximation tables in a critical computation with a direct calculation. The R&D director expressed his wry sense of humour by replying, "So you're telling me that this will make the calculation faster and more accurate? I'll have to get back to you. I'm not sure we can handle that much good news in one day.". The result was a blazing speedup with a startling focus improvement.
Your organization has intelligent and highly skilled application specialists. What added value does a high performance computing (HPC) consultant bring to the table? A partial answer is that application specialists calculate and a high performance computing (HPC) consultant computes. In simple terms, calculation is theoretical mathematics using real or complex analysis, where as computation is about how a computer computes a result. What makes for a tight factorization or expression using theoretical mathematics does not always directly translate to an efficient computation scheme.
Fresh eyes. Application specialists tend to view a problem "top down" through the application theory and mathematics. A high performance computing (HPC) consultant tends (but not always) to view a problem "bottom up" through the computation. The more skilled the specialist, the more likely it is that this generalization holds. It should come as no surprise that by approaching an application from a very different viewpoint, a high performance computing (HPC) consultant may get fresh insights.
While a family doctor will feel pleased after deciding to refer a patient to a specialist, or a cardiologist making a referral to an endocrinologist, sometimes managers and application experts are uncomfortable with the idea of having a high performance computing (HPC) consultant review or evaluate their work.
For benchmarking and hardware evaluation, Advanced Processor Research Ltd. contracts to both high performance computing vendors and end use customers. Some end use customers prefer an independent high performance computing (HPC) contractor do the evaluation rather than a vendor analyst. For similar reasons, some vendors like to have an outside high performance computing (HPC) contractor do benchmark evaluations to show end use customers that good results can be achieved without the direct involvement of factory magicians.
With just access to architecture manuals (and sometimes factory benchmark analysts), Advanced Processor Research Ltd. employs its broad experience and intuitive understanding of high performance computing (HPC) issues and hardware, to assimilate the equivalent of six months of hands on experience in a short time frame. This is especially valuable with recently released or beta hardware where there is little or no industry experience to draw on.
In support of Advanced Processor Research Ltd. claim of an intuitive understanding of high performance computing (HPC) issues and hardware, below are some brief results of Advanced Processor Research Ltd.'s first encounters with a variety of architecture classes.
Contracted to support end use clients at a startup high performance computing (HPC) center. None of the staff had previous long vector super computer experience. Some years prior, APR had studied IBM vector processor unit architecture documents for interest and as a learning experience, done paper and pencil implementations of various key scientific high performance computing (HPC) kernels. In this startup environment, Advanced Processor Research Ltd. was the only party to basically "hit the deck running", to the benefit of the new center's clients.
Later, as a site user, noticing an over-dependence on concurrency to parallelize computations in vendor supplied system high performance computing (HPC) library codes, Advanced Processor Research Ltd. undertook to assembly code (with full software pipelining) the BLAS (Basic Linear Algebra Subprograms) suite. A further application of assembly code (with full software pipelining) to redevelop vendor FFT library codes produced a speed up of FFTs under 1K by 2X or more and FFTs over 1K approximately 1.67X.
Contracting to a startup hardware vendor, APR Ltd. implemented a seismic Dip Move-Out (DMO) benchmark for an end use customer which exceeded 90% utilization of theoretic floating point capacity.
Contracting to end use customer, benchmarked (for production use), Kirchoff time migration on new SMP machines from two different vendors. Achieved greater than 99.7% scalability on a wide (24 CPU) SMP. Peer review of code kernel by benchmark specialist from second vendor concluded "Under the constraints set by client, we can not improve on (Advanced Processor Research Ltd.'s) result". Reviewer and APR had both recommended some methodological changes which the client was unwilling to test.
Concurrently with this benchmarking project, APR developed an intuitive SMP driver for the client's application programmers. The SMP driver paradigm provides a model that shows application programmers how to efficiently partition and schedule application subtasks.