

IDE S productivity

# Software Design for Longevity with Performance Portability

<u>Anshu Dubey</u> Argonne National Laboratory



**HPC-BP** Webinar

December 9, 2020





See slide 2 for license details

exascaleproject.org





## License, Citation and Acknowledgements

#### **License and Citation**

• This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).



 The requested citation is: Anshu Dubey, Software Design for Longevity with Performance Portability, HPC-BP webinar, December 9, 2020 : DOI:10.6084/m9.figshare.13342265

#### Acknowledgements

- This work was supported by the U.S. Department of Energy Office of Science, Office of Advanced Scientific Computing Research (ASCR), and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.
- This work was performed in part at the Argonne National Laboratory, which is managed by UChicago Argonne, LLC for the U.S. Department
  of Energy under Contract No. DE-AC02-06CH11357.



## **HPC Computational Science Use-case**











The US Exascale Computing Project (ECP) is at the forefront of these challenges

## **The ECP Performance Portability Series**

The objective of ECP is to have participating applications and software technologies needed for their science be ready for the exascale platforms For details about ECP please visit www.exascaleproject.org

- Motivation for the series
  - Platforms differ
    - What works well on one platform may not work equally well on others
    - ECP community has experiences in a variety of approaches; there is acquired wisdom
      - This wisdom should be shared as widely as possible
  - Need was felt for in-depth discussions
    - We had been considering focused in-person workshops
    - Panel series became the best available alternative during time of social distancing
- Outcomes
  - Share lessons learned, identify gaps, discover opportunities for partnerships
  - Some basic design principles for performance portability also emerged

For more information about the panel series please view https://doi.org/10.6084/m9.figshare.13283714.v1



## **General Design Principles for HPC Scientific Software**

#### Considerations

- Multidisciplinary teams
  - □ Many facets of knowledge
  - □ To know everything is not feasible
- □ Two types of code components
  - □ Infrastructure (mesh/IO/runtime ...)
  - □ Science models (numerical methods)
- □ Codes grow
  - □ New ideas => new features
  - □ Code reuse by others

#### **Design Implications**

#### Separation of Concerns

- Shield developers from unnecessary complexities
- Work with different lifecycles
   Long-lasting vs quick changing
  - □ Logically vs mathematically complex
- Extensibility built in
   Ease of adding new capabilities
   Customizing existing capabilities





## **General Design Principles for HPC Scientific Software**



Design first, then apply programming model to the design instead of taking a programming model and fitting your design to it.



## A Design Model for Separation of Concerns





## **Example: Multiphysics PDEs for Distributed Memory Parallelism**

- Virtual view of domain and functionalities
- Decomposition into components and definition of interfaces





## **Example: Multiphysics PDEs for Distributed Memory Parallelism**

- Virtual view of functionalities
- Decomposition into units and definition of interfaces





## **Example: Design for Extensibility from FLASH**

## Assumed that capabilities will be added for better models

- Assembly from components
- Decentralized maintenance of metadata
- Python tool to parse and configure
- OOP implemented through Unix directory structure and configuration tool

#### Key idea is distributed intelligence





Dubey et al 2009: Extensible component-based architecture for FLASH, a massively parallel, multiphysics simulation code https://doi.org/10.1016/j.parco.2009.08.001



## **Dividends from Investing in Design**

|            | astro-  | cosmo- | CFD/ | HEDP | solar   | recon-  | star fo- | combus- |
|------------|---------|--------|------|------|---------|---------|----------|---------|
|            | physics | logy   | FSI  |      | physics | nection | rmation  | tion    |
| compress-  | 1998    | *      |      | *    | *       |         |          | *       |
| ible hydro |         |        |      |      |         |         |          |         |
| burn       | 1999    |        |      |      |         |         |          | *       |
| MHD        | 2002    | *      |      | *    | *       | *       | *        |         |
| elliptic   | *       | 2001   | *    |      |         |         | *        |         |
| solver     |         | 2001   |      |      |         |         |          |         |
| particles  | *       | 2002   | *    | *    |         | *       | *        | *       |
| bittree    | *       | *      | 2012 | *    |         |         |          |         |
| HYPRE      |         |        | *    | 2011 |         |         |          |         |
| interface  |         |        |      |      |         |         |          |         |
| radiation  | *       | *      |      | 2011 |         |         |          |         |

52 Person years for infrastructure development

- Assume other communities reuse 75% of the infrastructure
- Saving of ~40 person years per new domain

13





## **Takeaways Until Now**



Platform complexity

- Differentiate between slow changing and fast changing components of your code
- Understand the requirements of your infrastructure
- Implement separation of concerns
- Design with portability, extensibility, reproducibility and maintainability in mind
- Do not design with a specific programming model in mind



### **ANY QUESTIONS SO FAR?**



## **A New Paradigm Because of Platform Heterogeneity**



• Question - do the design principles change?

Platform complexity



## **A New Paradigm Because of Platform Heterogeneity**



Platform complexity

- Question do the design principles change?
- The answer is not really
- The details get more involved



## A Design Model for Separation of Concerns



## **Design Guidance Articulated in the Panel Series**

Design for Hierarchical parallelism

Design towards several thousand threads

Design for a hierarchical memory space

Design patterns that count, allocate, and reuse memory

Avoid exposing/using non-portable vendor-specific options



## Features and Abstractions that must Come in





#### Historically

- Hand-tune the code for the target
- Some teams are still doing it



#### **Historically**

- Hand-tune the code for the target
- Some teams are still doing it

#### **Current Trend**

- Have multiple
   implementations
- Use third party abstraction tools



#### **Historically**

- Hand-tune the code for the target
- Some teams are still doing it

#### **Current Trend**

- Have multiple implementations
- Use third party abstraction tools

#### Intermediate Option

- Refactor the code exposing opportunities for use of abstractions
- Figure out the parameters for plugging in abstractions
- Design composability into infrastructure
- Make tools, or leverage community tools that let you hand tune without all the pain



## **Underlying Ideas**

#### Make the same code work on different devices

- A way to let compiler know that "this" expression can be specialized in many ways
- Definition of specializations

**Template meta-programming in abstraction layers** 



## **Underlying Ideas**

#### Make the same code work on different devices

- A way to let compiler know that "this" expression can be specialized in many ways
- Definition of specializations

**Template meta-programming in abstraction layers** 

## Assigning work within the node

- "Parallel For" or directives with unified memory
- Directives or specific programming model for explicit data movement

More complex data orchestration system for asynchronous computation



## **Underlying Ideas**

#### Make the same code work on different devices

- A way to let compiler know that "this" expression can be specialized in many ways
- Definition of specializations

**Template meta-programming in abstraction layers** 

## Look at what is needed, design for commonalities, encode them

## Assigning work within the node

- "Parallel For" or directives with unified memory
- Directives or specific programming model for explicit data movement

More complex data orchestration system for asynchronous computation



## Features and Abstractions that must Come in



# How do abstraction layers work

- Infer the structure of the code
- Infer the map between algorithms and devices
- Infer the data movements
- Map computations to devices
- These are specified either through constructs or pragmas
- Performance depends upon how well the mapping is done.



```
Example from Fortran with key-dictionary
                                                    Code for GPU
                                                    Subroutine recon(uPlus,uMinus,flux,iLow,iHigh,iLow,iHigh,ILow,kHigh)
                                                       real, pointer, dimension(:,:,:,:) :: uPlus,uMinus,flux
   A computation on a 4D array
                                                       integer, iLow,iHigh,jLow,jHigh,kLow,kHigh
       1 dimension for state variables
                                                      integer :: i1,i2,i3
   Copied into temporaries: uPlus, uMinus and
•
                                                      do i3 = kLow, kHigh
   flux
                                                          do i2 = jLow, jHigh
                                                             do i1 = iLow, iHigh
Code for CPU
                                                                 if (flux(HY MASS ,i1,i2,i3) > 0.) then
subroutine recon(uPlus,uMinus,flux)
                                                                    flux(HY_NUM_FLUX+1:NFLUXES ,i1,i2,i3) = &
  real, pointer, dimension(:) :: uPlus,uMinus,flux
                                                                       uPlus(HY_NUM_VARS+1:NRECON ,i1,i2,i3)* &
   if (flux(HY MASS) > 0.) then
                                                                       flux(HY MASS ,i1,i2,i3)
      flux(HY_NUM_FLUX+1:NFLUXES ) = &
                                                                 else
        uPlus(HY_NUM_VARS+1:NRECON )* &
                                                                   flux(HY NUM FLUX+1:NFLUXES ,i1,i2,i3) = &
       flux(HY MASS)
                                                                      uMinus(HY_NUM_VARS+1:NRECON ,i1,i2,i3)* &
    else
                                                                      flux(HY MASS ,i1,i2,i3)
      flux(HY NUM FLUX+1:NFLUXES) = &
                                                                  end if
        uMinus(HY NUM VARS+1:NRECON)* &
                                                             enddo
       flux(HY MASS)
                                                           enddo
    end if
                                                        enddo
```



```
Example from Fortran with key-dictionary
                                                    Code for GPU
                                                    Subroutine recon(uPlus,uMinus,flux,iLow,iHigh,iLow,iHigh,ILow,kHigh)
                                                       real, pointer, dimension(:,:,:,:) :: uPlus,uMinus,flux
   A computation on a 4D array
                                                       integer, iLow,iHigh,jLow,jHigh,kLow,kHigh
       1 dimension for state variables
   Copied into temporaries: uPlus, uMinus and
                                                       integer :: i1,i2,i3
•
                                                       do i3 = kLow, kHigh
   flux
                                                           do i2 = jLow, jHigh
                                                             do i1 = iLow, iHigh
Code for CPU
                                                                  if (flux(HY MASS ,i1,i2,i3) > 0.) then
subroutine recon(uPlus,uMinus,flux)
                                                                    flux(HY_NUM_FLUX+1:NFLUXES ,i1,i2,i3) = &
  real, pointer, dimension(:) :: uPlus,uMinus,flux
                                                                       uPlus(HY NUM VARS+1:NRECON ,i1,i2,i3)* &
   if (flux(HY_MASS) > 0.) then
                                                                       flux(HY MASS ,i1,i2,i3)
      flux(HY_NUM_FLUX+1:NFLUXES ) = &
                                                                 else
        uPlus(HY_NUM_VARS+1:NRECON )* &
                                                                   flux(HY_NUM_FLUX+1:NFLUXES ,i1,i2,i3 ) = &
       flux(HY MASS)
                                                                      uMinus(HY_NUM_VARS+1:NRECON ,i1,i2,i3)* &
    else
                                                                      flux(HY MASS ,i1,i2,i3)
      flux(HY NUM FLUX+1:NFLUXES) = &
                                                                  end if
        uMinus(HY NUM VARS+1:NRECON)* &
                                                             enddo
       flux(HY MASS)
                                                           enddo
    end if
                                                        enddo
```

- Different dimensionalities for the temporaries
- No do loop vs explicit do loop in the kernel



#### **Step 1: temporaries and arguments**

Key Definitions for CPU [hy\_recon\_args] uPlus, uMinus, flux

[hy\_recon\_declare] real, pointer, dimension(:) :: uPlus, uMinus,flux

### Key Definitions for GPU

[hy\_recon\_args] uPlus, uMinus, flux,iLow,iHigh,jLow,jHigh,kLow,kHigh

[hy\_recon\_declare] real, pointer, dimension(:,:,:,:) :: uPlus, uMinus, flux integer :: iLow,iHigh,jLow,jHigh,kLow,kHigh Step 2: constructs Key definitions for CPU kernels (null) [hy\_ind3spec] [hy\_inline\_loop] [hy\_inline\_loop\_end]

Key definitions for GPU kernels [hy\_inline\_loop] do i3 = kLow,kHigh do i2 = jLow,jHigh do i1 = iLow, iHigh

[hy\_inline\_loop\_end] enddo enddo enddo

[hy\_ind3spec] ,i1,i2,i3



```
Subroutine Definition
subroutine recon(@hy recon args)
@hy recon declare
  @hy inline loop
  if (flux(HY MASS @hy ind3spec) > 0.) then
   flux(HY NUM FLUX+1:NFLUXES @hy_ind3spec) = &
     uPlus(HY NUM VARS+1:NRECON @hy ind3spec)* &
     flux(HY MASS @hy ind3spec)
  else
   flux(HY NUM FLUX+1:NFLUXES @hy ind3spec) = &
     uMinus(HY NUM VARS+1:NRECON @hy ind3spec)* &
    flux(HY MASS @hy ind3spec)
  end if
  @hy inline loop end
```



```
Subroutine Definition
subroutine recon(@hy_recon_args)
@hy_recon_declare
@hy_inline_loop
if (flux(HY_MASS @hy_ind3spec) > 0.) then
flux(HY_NUM_FLUX+1:NFLUXES @hy_ind3spec) = &
    uPlus(HY_NUM_VARS+1:NRECON @hy_ind3spec)* &
    flux(HY_MASS @hy_ind3spec)
else
flux(HY_NUM_FLUX+1:NFLUXES @hy_ind3spec) = &
    uMinus(HY_NUM_VARS+1:NRECON @hy_ind3spec)* &
    uMinus(HY_NUM_VARS+1:NRECON_WARS+1:NRECON @hy_ind3spec)* &
    uMinus(HY_NUM_VARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WARS+1:NRECON_WAR
```

flux(HY MASS @hy ind3spec)

@hy inline loop end

Ideally one would go through a similar exercise of locating good use of abstractions to obtain good results from using thirdparty abstraction tools

```
IDEAS
productivity
```

end if

#### **Historically**

- Hand-tune the code for the target
- Some teams are still doing it

#### **Current Trend**

- Have multiple implementations
- Use third party abstraction tools

#### **Intermediate Option**

- Refactor the code exposing opportunities for use of abstractions
- Figure out the parameters for plugging in abstractions
- Design composability into infrastructure
- Make tools, or leverage community tools that let you hand tune without all the pain

A highlight from the panel series is that users of Kokkos and Raja derived greater benefit if they understood their code's structure and needs

In other words, thought about design



### FINAL TAKEAWAYS

- The key to both performance portability and longevity is careful software design
- Extensibility should be built into the design
- Design should be independent of any specific programming model
- Composability and flexibility help with performance portability

#### **RESOURCES:**

https://www.exascaleproject.org/

https://doi.org/10.6084/m9.figshare.13283714.v1

https://figshare.com/articles/presentation/SC20\_Tutorial\_Better\_Scientific\_Software/12994376?file=252193 46

https://bssw.io/blog\_posts/performance-portability-and-the-exascale-computing-project

https://www.exascaleproject.org/event/kokkos-class-series

