## The Development of a General Purpose Processing Unit for the Upgraded Electronics of the ATLAS TileCal

MITCHELL COX UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG

SAIP 2014





# Overview

- ATLAS TILE CALORIMETER (TILECAL)
- THE OFFLINE DATA PROBLEM
- TILECAL READ-OUT ARCHITECTURE
  - Phase II Upgrades
- ENERGY RECONSTRUCTION WITH HIGH PILE-UP
- GENERAL PURPOSE PROCESSING UNIT

### ATLAS Detector

- 40 MHz Bunch Crossings
  - 1 GHz Interaction Rate
  - Millions of Sensor Channels
  - Petabytes per Second!





Photos: ATLAS Experiment © 2014 CERN

## **Tile Calorimeter**



- Steel (absorber) and Scintillator /
  - Wavelength shifting fibres -

10 000 Photomultiplier Tubes





### The Offline Problem

• PB/s storage is not feasible.



#### The Offline Problem

• PB/s storage is not feasible.



Reference: J Dursi. Parallel I/O doesn't have to be so hard: The ADIOS Library. 2012.

### ATLAS Triggering and Data Acquisition System

#### • PB/s Raw reduced to MB/s Interesting Data

| 40 MHz | 100 kHz | 200 Hz |
|--------|---------|--------|
| PB/s   | GB/s    | MB/s   |



#### Algorithmic Intensity

### ATLAS Triggering and Data Acquisition System

#### • PB/s Raw reduced to MB/s Interesting Data

| 40 MHz | 100 kHz | 200 Hz |
|--------|---------|--------|
| PB/s   | GB/s    | MB/s   |



#### Algorithmic Intensity

My contributions are here...

Photos: ATLAS Experiment © 2014 CERN

#### TileCal Read Out Architecture

#### Current:



#### Future:



### sROD PMT Energy Reconstruction

- PMT signal is conditioned, digitised and sent to sROD
- "Compresses" PMT Data with
  - Optimal Filtering
  - Matched Filter



Figures Reference: B S Peralva. The TileCal Energy Reconstruction for Collision Data using the Matched Filter. 2013.

#### sROD PMT Energy Reconstruction

• Pile-up impairs energy reconstruction performance



- Could verify and tune algorithms "online"
  - Need a general purpose processing unit

Figures Reference: B S Peralva. The TileCal Energy Reconstruction for Collision Data using the Matched Filter. 2013.

#### **Processing Unit Integration**

#### Current:



Future:



**General purpose Processing Unit links to sROD** 

### **Processing Unit Integration**

- Not in critical data path (for now)
- 40 Gb/s Data Throughput
- General Purpose CPUs









## System on Chips

- ARM or Intel Atom SoC
  - Low Power Consumption
  - Low Cost
  - High CPU Performance per Watt
- What about I/O performance?









Cortex-A9



Cortex-A15

#### System on Chip External I/O Ports

#### Ethernet





#### 100 Mb/s - 1 Gb/s 10 - 100 MB/s



N x 5 GT/s ≥ 500 MB/s



#### System on Chip External I/O Ports

#### Ethernet



#### 100 Mb/s - 1 Gb/s 10 - 100 MB/s







#### N x 5 GT/s ≥ 500 MB/s



## PCI-Express Benchmark Rig

- Test PCI-Express with a pair of SoCs:
  - Wandboard is a Quad-Core Cortex-A9 at 1 GHz



## PCI-Express Benchmark Rig

#### • Test PCI-Express with a pair of SoCs:

• Wandboard is a Quad-Core Cortex-A9 at 1 GHz (i.MX6 SoC)



#### PCI-Express Test Results

- PCIe x1 Link on i.MX6 SoC :
  - 500 MB/s Theoretical

|              | CPU memcpy  | DMA (Slave) | DMA (Master) |
|--------------|-------------|-------------|--------------|
| Read (MB/s)  | 94.8 ±1.1%  | 174.1 ±0.3% | 236.4 ±0.2%  |
| Write (MB/s) | 283.3 ±0.3% | 352.2 ±0.3% | 357.9 ±0.4%  |

- 72 % of theoretical with Direct Memory Access (DMA)
  - Superior to Ethernet
  - Successful Proof of Concept
- 40 Gb/s PU needs 12 Freescale i.MX6 SoCs
  - 12 x 5 W = 60 W Power Consumption

## Further Prototyping

- Test 8 i.MX6 SoCs via PCI-Express Switch
- Develop Linux Driver:
  - Emulate Ethernet (RDMA)
  - Emulate File
  - "Programmer Friendly"





PCIe Development Board at Wits

## Summary

- General Purpose Processing Unit
  - Help with the TileCal energy reconstruction pile-up issue
  - 40 Gb/s Streaming Data Throughput
    - 12 Freescale i.MX6 Quad Cortex-A9 System on Chips
  - Programmable in C++
- Cost Effective
  - ARM SoCs are mass produced
- Power Efficient
  - 60 W

# Questions or Comments?

MITCHELL.COX@STUDENTS.WITS.AC.ZA

## Acknowledgements

- The "Massive Affordable Computing Project" team:
  - Robert Reed, Thomas Wrigley, Matthew Spoor (Physics)
  - Daniel O Kwofie, Ekow Etutu (Elec. Eng.)
  - Carlos Solans Sanchez, Alberto Valero Biot (Valencia CERN)
- MSc Supervisors: Prof. Bruce Mellado, Prof. Ivan Hofsajer
- The financial assistance of the National Research Foundation (NRF) towards this research is hereby acknowledged. Opinions expressed and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF.
- I would also like to acknowledge the School of Physics, the Faculty of Science and the Research Office at the University of the Witwatersrand, Johannesburg.



# **Backup Slides**

#### ATLAS Triggering and Data Acquisition System



## **ARM Performance**

|                 | Cortex-A7 | Cortex-A9 | Cortex-A15 |
|-----------------|-----------|-----------|------------|
| CPU Clock (MHz) | 1008      | 996       | 1000       |
| HPL (SP GFLOPS) | 1.76      | 5.12      | 10.56      |
| HPL (DP GFLOPS) | 0.70      | 2.40      | 6.04       |
| CoreMark        | 4858      | 11327     | 14994      |
| Peak Power (W)  | 2.85      | 5.03      | 7.48       |
| DP GFLOPS/Watt  | 0.25      | 0.48      | 0.81       |