Workshop "compilation au CCL"

Fabrice Rastello de l'équipe INRIA CORSE a organisé un workshop au CCL (Centre de conception Logiciel de Minatec) avec une soutenance de thèse. Voici le programme :

The following topics was addressed : code characterization, performance, energy, hybrid-compilation, and debugging. Invited speakers: Alexandra Jimborean (Uppsala U.), Louis-Noël Pouchet (Colorado St. U.), Ayal Zaks (Intel), Kim Ahn (Uppsala U.), Fabian Grüber (Inria).

Workshop CCL Fontière Matériel / Logiciel Location: Batiment 50C - Room C203/C206 Minatec Campus 17 Rue des Martyrs, Grenoble

Period: December 13 - 14

Tuesday 14h - 14h45: Alexandra Jimborean: Automatic Detection of Extended Data-Race-Free Regions

Abstract: Data-race-free (DRF) parallel programming becomes a standard as newly adopted memory models of mainstream programming languages such as C++ or Java impose data-race- freedom as a requirement.We propose compiler techniques that automatically delineate extended data-race-free regions (xDRF), namely regions of code which provide the same guarantees as the synchronization-free regions (in the context of DRF codes). xDRF regions stretch across synchronization boundaries, function calls and loop back-edges and preserve data-race-free semantics, thus increasing the optimization opportunities exposed to the compiler and to the underlying architecture. Our compiler techniques precisely analyze the threads’ memory accessing behavior and data sharing in shared-memory, general-purpose parallel applications and can therefore infer the limits of xDRF code regions.We evaluate the potential of our technique by employing the xDRF region classification in a state-of-the-art, dual-mode cache coherence protocol. Larger xDRF regions reduce the coherence bookkeeping and enable optimizations for performance (6.1%) and energy efficiency (12.7%) compared to a standard directory-based coherence protocol.

Bio: Alexandra Jimborean is Associate Senior Lecturer at Uppsala University. Her main research interests are compile-time and run-time code analysis and optimization, and software-hardware co-designs for performance and energy-efficiency. She holds a PhD from University of Strasbourg (France) for her work on automatic speculative parallelization. Alexandra has received over 25 distinctions, awards and grants, most notably the Google Anita Borg Memorial Award for excellence in academia and a Starting Grant for young researchers from the Swedish Research Council.

Tuesday 15h15 - 16h: Louis-Noel Pouchet: Source Code Analysis for Kernel Characterization and Categorization

Abstract: Polyhedral program transformations can perform highly aggressive restructuring of programs with static control-flow. However the task of finding the actually best transformation to optimize for speed or for energy remains a daunting challenge: to date the state of practice is to perform auto-tuning on the target device, running many different versions of the input program to observe which one actually performs best. In this talk we present PolyFeat, a fast static analysis tool which can characterize a program region at compile-time in less than one second for affine programs made of possibly thousands of lines of code. It computes numerous approximate metrics from the source code, such as data cache misses, operational intensity, OpenMP scaling potential, etc. As we show, these metrics can then be used to prune a space of transformations, or implement compile-time CPU frequency selection to optimize energy, for example.

Bio: Louis-Noel Pouchet is an Assistant Professor at Colorado State University. He is working on pattern-specific languages and compilers for scientific computing, and has designed numerous approaches using optimizing compilation to effectively map applications to CPUs, FPGAs and SoCs. His work spans a variety of domains, including compiler optimization, hardware synthesis, machine learning, programming languages, and distributed computing. Pouchet is the author of the PolyOpt and PoCC compilers, and of the PolyBench benchmarking suite.

Tuesday 16h30 - 17h15: Ayal Zaks: Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer loop auto-vectorization

Abstract: Currently, LoopVectorizer in LLVM is specialized in auto-vectorizing innermost loops. SIMD and DECLARE SIMD constructs introduced in OpenMP4.0 and enhanced in OpenMP4.5 are gaining popularity among performance hungry programmers due to the ability to specify a vectorization region much larger in scope than the traditional inner loop auto-vectorization would handle and also due to several advanced vectorizing compilers delivering impressive performance for such constructs. Hence, there is a growing interest in LLVM developer community in improving LoopVectorizer in order to adequately support OpenMP functionalities such as outer loop vectorization and whole function vectorization. In this Technical Talk, we discuss our approaches in achieving that goal through a series of incremental steps and further extending it for outer loop auto-vectorization.

Bio: Ayal Zaks joined the Intel OpenCL compiler team in Haifa in 2011, where he works on LLVM compilers and optimizations. Prior to that Ayal spent 15 years at the IBM Haifa Research Laboratory where he managed its Compiler Technologies group and worked on compiler optimizations. Ayal is a member of the HiPEAC network of excellence. He received B.Sc., M.Sc., and Ph.D. degrees in mathematics and operations research from Tel Aviv University, and is an adjunct lecturer at the Technion.

Wednesday 10h - 10h40: Kim-Anh Tran: Compiling for energy efficient architectures: Hiding long-latencies on limited, energy-efficient cores

Abstract: Memory latency becomes a performance bottleneck if long latency loadscannot be overlapped with useful computation. While aggressiveout-of-order processors are able to hide long latencies, limitedout-of-order and in-order cores fail to find enough independentinstructions to hide the delay.We propose software-only and software-hardware co-designs to overcomethe performance degradation caused by long latency loads on smallcores. Energy-efficient cores can, equipped with the appropriatecompile-time support, significantly improve their performance formemory-bound applications. We separate loads from their uses, andoverlap their latencies with instructions from different blocks andloop iterations. Our techniques overcome restrictions which yieldedconventional compile-time techniques impractical: (i) staticallyunknown dependencies, (ii) insufficient independent instructions, and(iii) register pressure, and achieve a an average run time improvement of 10%, with a peak of 45% on memory-bound applications.

Bio: Kim received her Bachelor degree in Computer Science in 2011 from Bielefeld University, Germany, and continued her studies at Uppsala University, Sweden, where she completed her Master degree in 2013. In her thesis she contributed to the development of a constraint-based compiler back-end (UNISON) that is developed at KTH and SICS, Stockholm. After a year in industry, Kim started her currently ongoing PhD studies, focusing on energy-efficient software-hardware co-designs, making use of static code analysis and code transformation using LLVM.

Wednesday 10h45 - 11h30: Fabian Gruber: Extending QEMU to Build a Bottleneck Model based Performance Debugging Tool

Abstract: QEMU, short for Quick Emulator, is a CPU emulator that is able to run applications compiled for one architecture on another (such as running an ARM binary on an x86 CPU, or vice versa). QEMU is not based on an interpreter, but instead uses binary translation to allow efficient execution of foreign instructions. Performance debugging is the process of, first, finding performance problems, that is, pinpointing code regions with suboptimal resource utilization, and then diagnosing the causes for these problems. This talk presents ongoing work the CORSE team has done in collaboration with ST Microelectronics on extending QEMU to instrument executed programs in order to collect high-level performance metrics. The goal of this presentation is not only to present our work, but also to solicit feedback on our ideas from the audience. Bio: Fabian Gruber graduated with a Masters degree in Computer Science from the Vienna University of Technology, Austria, in 2014. In his master thesis he worked on implementing support for dynamic languages (JSR-292, invokedynamic) for the CACAO Java Virtual Machine. He then worked as an engineer in the Inria team CORSE for two years. The focus of his work there was on low-level compiler IR, dissasembling, program reconstruction from assembler and binary instrumentation. He is currently a PhD student at Inria and the Université Grenoble Alpes, working on instrumentation based profiling and performance debugging.

Fabian Gruber is a PhD student of the CORSE team

Wednesday 14h - 14h45: Diogo Sampaio: Profile Guided Hybrid Compilation (PhD defense)

Abstract: Heat dissipation limitations caused a paradigm change in how computational capacity of chips are scaled, ranging from increasing the clock frequency to growing parallelism. In order to explore this characteristic computer applications must be made parallel, a hard job left to software developers. To aid in this process many optimizing compilers and frameworks have been developed, such as polyhedral compilation tools (e.g. \pluto/).

In order to apply a transformation to a code, compilers must prove that this preserves the original programs semantics.

When the transformation selection and validation is done solely by reasoning over the source-code, this process is called static.

With transforming syntax poor code, such as code containing memory references with multiple possible indirections, static compilers are often unable to verify applicability and many optimization opportunities are lost. To overcome this lack of information that can be extracted from the source code dynamic compilers perform the transformation at run-time, when all variables of the program have assigned values. However their analyses and optimizations are rather simplistic, since the time used for validation and optimization introduces an overhead to the application execution time, or perform a constant-time validation, as the observed application behavior might change, invalidating performed predictions. Hybrid analyses use run-time information and feed it back into a static compiler to help with transformation selection and validation.

This works advocates for the use of hybrid analyses when optimizing loops, regions where the majority of programs spend most of their time.

It proposes a framework that statically applies a sequence of complex loop transformations in a speculative manner. Based on memory access expressions it generates lightweight run-time tests to ensure that data dependencies are not violated by a given transformation. Using information collected at run-time it discards transformations that would never be used due too constraining validity tests. At the heart of this technique is a powerful quantifier elimination scheme over multivariate integer polynomials, which provides a more precise result than any other known tool.

The soundness of the framework is demonstrated against a modified version of the \pb/ benchmark suite, where all data structures have been linearized. Performing the same transformations that a polyhedral optimizer would apply over the original programs, our framework generates tests that correctly validate transformations uses, either by proving correct ones or blocking invalid ones. To further illustrate the generality of our run-time test generation scheme, we demonstrate the capacity to correctly generate tests for programs with polynomial memory accesses, caused by packed triangular matrix access patterns.