## An Overview of the RAJA Portability Suite

Approved for public release

Arturo Vargas, Rich Hornung (LLNL) RAJA/Kokkos Project WBS 2.3.1.18

HPC Best Practices Webinar Series March 10, 2021



LLNL-PRES-819903

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC





# The RAJA Portability Suite provides complementary open-source tools for portable execution and memory management

# **RAJ∀**

## RAJA: C++ kernel execution abstractions

 Enables apps to target various programming model back-ends while maintaining single-source app code



## CHAI: C++ array abstractions

 Automates data copies, giving look and feel of unified memory



https://github.com/LLNL/RAJA https://github.com/LLNL/CHAI https://github.com/LLNL/Umpire https://github.com/LLNL/camp



#### **Umpire: memory API**

- Provides high performance memory operations, such as pool allocations. Native C++, C, Fortran APIs
- CAMP camp: low-level C++ metaprogramming facilities
  - Focuses on HPC compiler compatibility



These libraries are used selectively and in various combinations in production applications today.

## The RAJA Portability Suite insulates applications from many complexities of a diverse hardware ecosystem



Perlmutter (LBL) AMD Milan CPUs + NVIDIA Ampere GPUs



Sierra (LLNL) IBM P9 CPUs + NVIDIA Volta GPUs

ANGUAR

Astra

Aurora (ANL) Intel Xeon CPUs + Xe GPUs





#### Most ASC applications plus others at LLNL also rely on the RAJA Portability Suite to run on a wide range of platforms

| Major LLNL ASC Program Applications |                  |        |        |                            |        |                |                                    |                                 |  |  |  |
|-------------------------------------|------------------|--------|--------|----------------------------|--------|----------------|------------------------------------|---------------------------------|--|--|--|
|                                     | Ares             | ALE3D  | Kull   | MARBL                      | Ardra  | Mercury        | Teton                              | Hydra                           |  |  |  |
| Language                            | C++              | C++    | C++    | C++ &<br>Fortran           | C++    | C++            | Fortran                            | C++/C                           |  |  |  |
| CPU / GPU<br>Execution<br>Model     | RAJA             | RAJA   | RAJA   | RAJA +<br>MFEM &<br>OpenMP | RAJA   | CUDA &<br>RAJA | OpenMP &<br>CUDA-C<br>(poss. RAJA) | Exploring OpenMP,<br>CUDA, RAJA |  |  |  |
| Data<br>Transfer                    | UM +<br>Explicit | CHAI   | UM     | Explicit                   | CHAI   | UM             | Explicit                           | Explicit,<br>Exploring CHAI     |  |  |  |
| Memory<br>Allocation                | Umpire           | Umpire | Umpire | Umpire                     | Umpire | Umpire         | Umpire                             | Explicit,<br>Exploring Umpire   |  |  |  |

• Integration of these projects into other applications and libraries is ongoing

The LLNL institutional RADIUSS effort promotes and funds integration of these tools into non-ASC applications.

LLNL-PRES-819903

### **RAJA** supports a variety of loop patterns and parallel constructs

#### Simple & complex loop patterns & execution

- Non-perfectly nested loops
- Loop tiling
- Hierarchical parallelism
- Shared and thread local memory

#### Multiple execution back-ends

- Sequential
- SIMD (via vector intrinsics, in progress)
- OpenMP (CPU & device offload)
- Intel Threading Building Blocks (partial)
- CUDA
- AMD HIP
- SYCL (in development)

#### Loop transformations (without changing app code)

- Change loop iteration patterns, permute loop nest ordering
- Multi-dimensional data views with offsets and index permutations
- Fine-grained GPU thread-block mapping control
- Hierarchical parallelism, asynchronous execution
- Portable reductions, scans, atomic operations, sorts...
- GPU kernel fusing (to reduce impact of GPU launch overhead for small kernels)
- Other work in progress
  - API to encapsulate SIMD/vectorization intrinsics
  - Dynamic plugins to enable tool integration



### A simple example shows how RAJA abstracts kernel execution



In the C-style kernel, all aspects of execution are explicit in the source code; e.g., sequential execution, iteration ordering, etc.

RAJA allows you to change how a kernel runs without changing the source code.

#### **RAJA kernel execution has four core concepts**

- 1. Loop **execution template** (e.g., 'forall')
- 2. Loop execution policy type (EXEC\_POL)
- 3. Loop **iteration space** (e.g., 'RangeSegment')
- 4. Loop **body** (C++ lambda expression)



### We'll return to RAJA in a bit after we introduce Umpire and CHAI...





#### CHAI's "managed array" abstraction transfers data automatically at run time as needed to run kernels





## CHAI's "managed pointer" simplifies the use of virtual class hierarchies across host and device memory spaces

• managed\_ptr will make a copy of your object hierarchy in device memory

```
void overlay( Shape* shape, double* mesh_data ) {
    chai::managed_ptr< Shape > mgd_shape = shape->makeManaged();
    RAJA::forall< cuda_exec > ( ... {
        mgd_shape->processData(mesh_data[i]);
    } );
    mgd_shape.free();
}
```

• This requires a method to clone objects and host-device annotations on class constructors

```
chai::managed_ptr< Shape > Sphere::makeManaged( ) { ... }
__host___device__ Sphere::Sphere( ... ) { ... }
```



This mechanism allows you to use C++ virtual class hierarchy code on CPUs and GPUs without a major refactor.

LLNL-PRES-819903 10

### Umpire provides a unified, portable memory management API

- Allocate, deallocate, copy, move, query
- Memory pools
  - Much faster allocation & deallocation than malloc(), cudaMalloc()...
  - Easily shared between application components
- Introspection for better decision-making
  - Where does data associated with this pointer live?
  - Which allocator was used for this allocation?
  - What is the size of this allocation?
  - How much memory is being used on this resource?





## Umpire interface concepts allow application developers to reason about memory use

- A **Memory Resource** is a kind of memory, with specific performance and accessibility characteristics
- An Allocation Strategy decouples how and where allocations are made, allowing complex allocation mechanisms
  - Memory pools, thread-safety layers, specific algorithms for memory allocation, etc.
- An **Allocator** is a lightweight interface for making an allocation and querying it
  - One interface for all resources
- An **Operation** manipulates data in memory through one interface regardless of resource
  - Copy, move, reallocate, memset, etc.
- These concepts are coordinated by a **ResourceManager** 
  - Builds allocators based on allocation strategies and available resources, dispatches operations based on pointer locations, etc.

```
auto& rm = umpire::ResourceManager::getInstance();
auto host = rm.getAllocator("HOST");
auto device = rm.getAllocator("DEVICE");
```

```
auto device_pool =
    rm.makeAllocator<DynamicPool>("MY_POOL", device);
```

```
void* host_data = host.allocate(1024);
void* dev_data = device_pool.allocate(1024);
```

```
rm.memset(host_data, 0);
rm.copy(dev_data, host_data);
```

host.deallocate(host\_data);



## Sharing GPU memory pools among packages in multiphysics applications enables larger problems to be run

When each package has its own temporary state, space available for others is limited

When packages share a pool for temporary state, overall available space is increased







Umpire allocators and pools are easily shared across packages in an integrated code system.

LLNL-PRES-819903 13

### Umpire provides a variety of memory management capabilities

#### **Intuitive concepts**

- Resources
- Allocators
- Operations

#### Supported memory types

- Host (CPU)
- GPU global, constant, (host) pinned
- Unified memory
- Mmapped file memory
- Support for NVIDIA, AMD, and Intel GPU devices available in recent releases

#### **Features useful in HPC applications**

- Various pool allocation strategies (fixed size, dynamic, monotonic, etc.)
- NUMA support
- Memory allocation advice (preferred location, mostly read, etc.)
- Thread safe allocators
- Memory introspection
- Native interfaces for C++, C, and Fortran
- Logging, backtrace, and "replay" capabilities. These are really useful for investigating application performance, finding bugs, etc.



Returning to RAJA, we'll introduce two APIs for nested/complex loop kernels





## We will use a matrix multiplication kernel to explore some RAJA features and usage

C = A \* B, where A, B, C are N x N matrices

```
for (int row = 0; row < N; ++row) {
  for (int col = 0; col < N; ++col) {
    double dot = 0.0;
    for (int k = 0; k < N; ++k) {
      dot += A[k + N*row] * B[col + N*k];
    C[col + N*row] = dot;
```





## Nesting RAJA "forall" statements is not a good approach because loops are treated as independent entities



- Parallelize row loop?
  - Each thread runs all code in column loop sequentially
- Parallelize column loop?
  - Launch new parallel computation for each row → unwanted synchronization
- Loop interchange and other transformations require source code changes → breaks RAJA encapsulation!



Full parallelization of kernel is hard with nested RAJA foralls – we don't recommend it.

## The RAJA *kernel* API is designed to compose and transform complex parallel kernels, without changing kernel source code

```
using namespace RAJA;
                                                             Kernel execution policy
using KERNEL_POL = KernelPolicy<
                                                               (typically lives in a
                       statement::For<1, row_policy,</pre>
                                                                  header file)
                         statement::For<0, col policy,</pre>
                           statement::Lambda<0>
                         >
                       >
                     >;
kernel< KERNEL_POL >( make_tuple(col_range, row_range),
                        [=](int col, int row) {
  double dot = 0.0;
  for (int k = 0; k < N; ++k) {
                                              Kernel implementation
    dot += A(row, k) * B(k, col);
                                             (application source code)
    C(row, col) = dot;
 );
```

#### Each loop level has an execution policy and iteration space

for(int row = 0; row < N; ++row) {
 for(int col = 0; col< N; ++col) {</pre>

// row-column dot product



Integer parameter in each 'For' statement indicates the iteration space tuple item it applies to.



## Kernel transformations are made by altering the execution policy, not the algorithm source code





This is analogous to swapping for-loops in the C-style version.

LLNL-PRES-819903 20

#### Lambda statements invoke lambda expressions (loop bodies)

**});** 

```
using EXEC_POL = KernelPolicy<
   statement::For<1, row_policy,
   statement::For<0, col_policy,</pre>
```

```
double dot = 0.0;
for (int k=0; k < N; ++k) {
   dot += A(row, k)* B(k, col);
}
C(row, col) = dot;
```



for(int row = 0; row < N; ++row) {

double dot = 0.0;

C(row, col) = dot;

for(int col = 0; col< N; ++col) {

for (int k=0; k < N; ++k) {

dot += A(row, k) \* B(k, col);

## The RAJA kernel API offers numerous options to explore execution alternatives and optimization strategies

- Tiling statements to partition loops into tiles
  - Helps ensure data stays in fast memory while it is used (cache or GPU shared memory)
- Portable kernel local memory (CUDA shared memory or stack memory on a CPU)
  - Improved latency for data access, usually compliments tiling policies
- Loop interchange via execution policy change
  - Simplifies exploring different data access patterns for different platforms
- Loop Fission/Fusion
  - Breaking loops into multiple parts or merging loops
- A variety of execution policies to map loop iterates to GPU blocks & threads in different ways



# RAJA also provides a *launch API* which creates a space for writing portable kernels using RAJA loop methods

#### Launch method

Sets up a kernel execution space for host or device. **Run-time** selected by ExecPlace value

#### Launch Context

Control flow within a kernel; e.g., thread synchronization

Capture types

- Launch lambda captured by value [=] to make device copies of captured variables
- Loop lambdas captured by reference [&] to enable referencing within loop hierarchies.

using RAJA::expt; **Experimenta** Release launch< launch\_policy >(ExecPlace, Resources(Teams(NTeams), Threads(NThreads))) [=] RAJA\_HOST\_DEVICE (LaunchContext ctx) [&] (int row) { loop< row\_policy >(ctx, row\_range, loop< col\_policy >(ctx, col\_range, [&] (int col) double dot = 0.0; for(int k=0; k < N; ++k) {</pre> dot += A(row, k) \* B(k, col)Kernel execution C(row, col) = dot;space ); ); } );



## The RAJA launch API differs from kernel by encapsulating the loop hierarchy inside an execution space





## RAJA launch GPU execution uses a thread team model same as the CUDA/HIP block-thread model

TeamIdx (2)



TeamIdx (1)

Thr 0 Thr 1 Thr 2

TeamIdx (0)

Thr 0 Thr 1 Thr 2

});

RAJA Teams = HIP/CUDA Blocks RAJA Threads = HIP/CUDA Threads Loops can be mapped to CUDA/HIP blocks and threads



## Launch and loop methods are templates on both host and device policies for run-time selection of execution back-end

using launch\_policy =

LaunchPolicy<host\_launch\_t, device\_launch\_t>

- Host backends supported
  - Sequential/SIMD
  - OpenMP
- Device backends supported
  - CUDA
  - HIP

launch< launch\_policy >( host\_or\_device,

Resources(Teams(NTeams), Threads(NThreads)))

[=] RAJA\_HOST\_DEVICE (LaunchContext ctx) {

loop<row\_policy>(ctx, row\_range, [&](int row){

loop<col\_policy>(ctx, row\_range, [&](int col){

// row-column dot product

});

} );

);

## RAJA provides policies for common GPU thread striding patterns, such as CUDA block-stride loops



Runtime for N = 1e4 on NVIDIA V100: 3793 milliseconds

Runtime for N = 1e4 on NVIDIA V100: 2921 milliseconds



### **Global thread ID calculations are simplified with RAJA**





## The RAJA launch API provides portable support for device shared memory or host stack memory

(within 5%)

```
int by = blockIdx.y;
int bx = blockIdx.x;
```

\_shared\_\_ double Cs[BLK\_SZ][BLK\_SZ];

```
Cs[threadIdx.y][threadIdx.x] = 0;
```

```
// Load data tiles into shared memory
```

```
for(int k=0; k < (BLK_SZ+N-1)/BLK_SZ; ++k) {</pre>
```

```
// Tiled matrix-multiply with shared memory
// Cs[r][s] +=
```

\_syncthreads();

```
// Write out to global memory
```

• Runtime for N = 1e4 on NVIDIA V100 : 980 milliseconds

```
loop<block_y_pol>(ctx, block_y_range, [&](int by) {
  loop<block_x_pol>(ctx, block_x_range, [&](int bx) {
    RAJA TEAM SHARED double Cs[BLK SZ][BLK SZ];
    loop<thread y pol>(ctx, ty_range, [&](int ty) {
      loop<thread_x_pol>(ctx, tx_range, [&](int tx) {
        Cs[ty][tx] = 0.0;
      });
    } );
   // Load data tiles into shared memory
   for(int k=0; k < (BLK SZ+N-1)/BLK SZ; ++k) {
     // Tiled matrix-multiply with shared memory
    // Cs[r][s] +=
     ctx.teamSync();
   // Write out to global memory
   });
} );

    Runtime for N = 1e4 on NVIDIA V100 : 1026 milliseconds
```

More than a basic loop abstraction layer, RAJA provides other mechanisms to improve application performance





### **RAJA** asynchronous execution integrates with CHAI and Umpire





## Fusing small GPU kernels into one kernel launch helps alleviate negative impact of launch overhead

Packing/unpacking halo (ghost) data on a GPU into MPI buffers is a key application use case



LLNL-PRES-819903 32

### The RAJA API for fusing kernels into one launch is simple to use

Typical pattern launching many packing kernels

```
for ( neighbor : neighbors ) {
  double* buf = buffers[neighbor];
  for ( f : fields[neighbor] ) {
    int len = f.ghostLen();
    double* ghost_data = f.ghostData();
    forall(Range(0, len), [=](int i){
        buf[ i ] = ghost_data[ i ];
        });
        buf += len;
    }
    send(neighbor);
}
```

This technique is used in production apps at LLNL and yields 5-15% overall runtime reduction in typical problems.

#### Fusing the kernels into one GPU launch

```
RAJA::WorkPool< ... > fuser;
for ( neighbor : neighbors ) {
  double* buf = buffers[neighbor];
 for ( f : fields[neighbor] ) {
    int len = f.ghostLen();
    double* ghost data = f.ghostData();
    fuser.enqueue(Range(0, len), [=](int i){
     buf[ i ] = ghost data[ i ];
    });
    buf += len;
auto workgroup = fuser.instantiate();
workgroup.run();
for ( neighbor : neighbors ) {
  send(neighbor);
```



## The RAJA Performance Suite is a useful co-design tool to assess compiler performance and to collaborate with vendors



LLNL-PRES-819903 34

# RAJA "Teams" (described earlier) was co-developed with the LLNL ATDM application (MARBL) team



Hierarchical parallelism & shared memory are key performance enablers.

LLNL-PRES-819903 35

- LLNL ATDM application (high-order ALE hydro simulations) : uses RAJA, Umpire
  - Node-to-node speedup:
    - 15x : Sierra (2 P9 + 4 V100) vs. CTS-1 Intel Cascade Lake (48 core CPUs)
    - 30x : Sierra vs. Astra (Cavium ThunderX2 28 core CPUs)
  - Programmatically-relevant simulations scaled to 50% of Sierra (2048 nodes) and 100% of Astra (SNL) (2048 nodes)
    - Documented in ATDM Tri-lab Level 1 milestone report (Dec. 2020)
  - Relies heavily on MFEM library (CEED co-design)
    - Provides RAJA and Umpire execution and memory back-end options
  - RAJA "Teams" capability (discussed earlier) resulted from collaboration between RAJA and LLNL ATDM application team





- SW4 application (high-resolution earthquake simulations) : uses RAJA, Umpire
  - Node-to-node speedup:
    - 16x : Sierra vs. CTS-1 Intel Cascade Lake
    - 32x : Sierra vs. CTS-1 Intel Broadwell
  - Recent paper in *Bulletin of Seismological Society of America* presents highest resolution earthquake simulation studies to date enabled by SW4-RAJA application
  - Partial application running on AMD MI-60 GPUs additional support in HIP compiler needed
  - SW4-lite proxy app running on Intel GPUs (GEN9) additional support in DPC++ compiler needed for full SW4





- GEOSX application (subsurface solid mechanics simulations) : uses RAJA, Umpire, CHAI
  - Node-to-node speedup
    - 14x : Lassen (Sierra arch) vs. CTS-1 Intel Cascade Lake
  - Initial studies show good weak scaling up to 64 nodes on Lassen (256 V100s)
  - Team is working on scaling to 1000s of GPUs on Summit





- ExaSGD application (power grid optimization): uses RAJA, Umpire
  - Adopted RAJA & Umpire ~8 months ago
  - Key kernels using RAJA are running at near peak memory BW on Summit with little system-specific tuning
  - Parts of code running on Tulip (Frontier EA system) with good performance





Our application porting perspectives are based on production experiences and constraints:

- Large integrated code bases
- Codes must run well on a diversity of platforms always
- Codes live for decades so must be viable across multiple platform generations
- Under continual development, while continuously in production use





# Experience shows that the RAJA Portability Suite enables a diverse set of portable, high performance applications

- Insulates applications from technology disruption
  - Does not inhibit using new or platform-specific tools
- Insulates apps from variability in programming models and architectures
- Facilitates application flexibility by promoting clean encapsulation
- RAJA-app codesign has led to desirable outcomes
  - Easy to leverage features and/or optimizations developed for another application
  - Easy to grasp for all application developers
  - Easy to integrate with existing applications
  - Easy to adopt incrementally











### Porting an application isn't free – it requires a good plan

- Develop a plan that is agreeable to all developers on the team implementation and ownership
  - Meaning of manageable portability (tolerance for disruption) depends on size and complexity of application
  - A memory management strategy is as important as a strategy to manage execution
  - Plan for iterative, incremental development  $\rightarrow$  modify code, evaluate performance & cost of change, etc.  $\rightarrow$  repeat...
- Assess algorithm structures and data access patterns
  - Think about commonality across algorithms and loop patterns focus on individual kernels only when necessary
  - Keep code and data access simple in kernels C++ STL containers are not amenable to GPUs
  - Look for opportunities for changes that will yield benefits, and which are manageable and maintainable
- Strive to maintain a familiar look and feel of the code
  - Consider a code-specific wrapper layer (using templates, macros, etc.)
    - How much of an abstraction layer, such as RAJA, do you want to expose in the application code?
    - How do you sustain SME developer productivity and enable platform-specific optimizations?
    - Add instrumentation for performance analysis
  - Convert kernels to use new parallel patterns (e.g., scans) only when needed for desirable performance

### Establishing performance expectations is critical

- · Performance expectations should be based on analysis before you start porting
  - What are performance limitations in current code? memory B/W? compute bound? ...
  - For example, if application is B/W bound, set expectations by comparing effective node B/W between architectures
- First port code, then analyze performance, then optimize as needed
- Continuously monitor performance while making code changes
  - Best done as **part of CI process** to track on a per commit basis
  - Performance should not degrade on current platforms and should improve over time on new systems
  - Keep data resident on devices (GPUs) avoid host-device transfers as much as possible
- Focus optimization effort on performance critical code sections
  - Expose as much fine-grained parallelism as is reasonable how much code disruption can the team tolerate?
  - Start with 1 MPI rank per GPU explore more complex approaches later if there is potential benefit
- Don't get frustrated when initial results are not what's expected. It's an iterative process!
  - Production ASC apps at LLNL have been working at porting to GPUs for 5+ years much progress has been made, but much work remains....



#### Typically, each optimization step improves performance and reveals the next problem to solve

**GPUs have performance overheads not seen on CPUs** 

- Kernel launch overhead: try to hide with asynchronous kernel launches
- Data transfer between memory spaces: avoid or overlap with other work
- Memory allocation on GPUs is much more expensive than CPUs: memory pools are a must!

**Optimization requires coordinating many parts** 

- Libraries have different porting strategies/timelines: Un-ported parts lead to costly CPU/GPU data transfers
- GPU memory is a scarce shared resource: sharing memory pools can help



### The RAJA Portability Suite is on track to be ready for the next generation of platforms, including exascale

| Machine                                                                                                                                                                                                                                                                                   | RAJA                                                                                                     | CHAI                                              | Umpire          |       |                            |  |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|---------------------------------------------------|-----------------|-------|----------------------------|--|--|--|--|--|
| Perlmutter                                                                                                                                                                                                                                                                                | CUDA support actively used in production on Sierra<br>We continue to investigate and improve performance |                                                   |                 |       |                            |  |  |  |  |  |
| Frontier & El<br>Capitan                                                                                                                                                                                                                                                                  | HIP support available in RAJA v0.11.0 (1/2020)<br>Developed with AMD                                     | Avail. Umpire v1.0.0 (8/2019)<br>Developed w/ AMD |                 |       |                            |  |  |  |  |  |
| Aurora                                                                                                                                                                                                                                                                                    | SYCL back-end development is a collaboration<br>Currently, filling feature gaps and improvin             | Avail. Umpire v4.0.0 (9/2020)<br>Developed w/ ANL |                 |       |                            |  |  |  |  |  |
| <ul> <li>Our open-source efforts have seen significant contributions from<br/>code teams, vendors, and other external collaborators</li> </ul>                                                                                                                                            |                                                                                                          |                                                   |                 |       | Unique Monthly<br>Visitors |  |  |  |  |  |
| <ul> <li>38+ RAJA contributors, core project team has 8 people</li> <li>Up from 20 contributors last year</li> <li>Leveraging vendor interactions to support new hardware (IBM, NVIDIA, AMD, Intel, Cray)</li> <li>Tutorials at ECP meetings, ATPESC, and academic conferences</li> </ul> |                                                                                                          |                                                   |                 |       | 234                        |  |  |  |  |  |
|                                                                                                                                                                                                                                                                                           |                                                                                                          |                                                   |                 |       | 48                         |  |  |  |  |  |
|                                                                                                                                                                                                                                                                                           |                                                                                                          |                                                   |                 |       | 102                        |  |  |  |  |  |
|                                                                                                                                                                                                                                                                                           | The RAJA Portability Suite is core porting strategy. It will be                                          | to the LLNL ASC app<br>supported beyond E         | olicatio<br>CP. | on GP | U<br>LLNL-PRES-819903 4    |  |  |  |  |  |

### Why use portability solutions like RAJA, Umpire, CHAI, etc?

- It depends. What does "performance portability" means for your project?
  - If you can afford to develop and maintain platform-specific code, you may prefer that option and may not need a
    portability abstraction
  - If your application is large or if programming model, hardware architecture, and optimization expertise is sparse on your team, a portability solution can provide your team with a variety of benefits
- Portability solutions enable you to write *single-source code* that runs on a diversity of platforms
  - You may still need to write some platform-specific code to better optimize some kernels
  - Fortunately, a general abstraction approach may be good enough for most of your code
- Benefits of a portability solution include the following:
  - (Most of) the cost of developing and maintaining platform-specific code is eliminated
  - It's straightforward to get running on new hardware architectures
  - It's easier to separate software development concerns on your project optimization work can be done by experts under the abstraction layer, while application code looks familiar to SME application developers
  - You will leverage the expertise and effort of others who contribute to the portability library (features and optimizations); improving your code performance can be as simple as using an new version of a library



## User documentation, tutorials, and other code repos associated with the RAJA Portability Suite are available

- RAJA User Guide: getting started info, details about features & usage, tutorial materials (readthedocs.org/projects/raja)
- RAJA Project Template: shows how to use RAJA in an app using CMake or make (github.com/LLNL/RAJA-project-template)
- RAJA Performance Suite: collection of kernels to assess compilers & RAJA performance. Used by us, vendors, for DOE platform procurements, etc. (github.com/LLNL/RAJAPerf)
- RAJA Proxy Apps: proxy apps using RAJA, CHAI, Umpire (github.com/LLNL/RAJAProxies)

The RAJA Performance Suite and Proxy Apps are good sources of examples for RAJA usage.

- Umpire User Guide: getting started info, details about features & usage, tutorial materials (readthedocs.org/projects/umpire)
- Umpire Interactive Tutorial: interactive user tutorial using Jupyter notebooks (github.com/LLNL/umpire-interactive-tutorial)

 CARE: Collection of CHAI And RAJA Externsions that are useful to application developers to help write portable code (github.com/LLNL/CARE)



For RAJA questions and support, please email us: raja-dev@llnl.gov

#### Disclaimer

This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence National Security, LLC, and shall not be used for advertising or product endorsement purposes.



