New benchmark test results update this This Configuration and Deployment Guide which explores designing and building a Memcached infrastructure that is scalable, reliable, manageable and secure. Benchmark tests included the latest Intel Atom and Intel Xeon processor-based microservers and dual-socket servers to explore differing business scenarios and tradeoffs for different Service Level Agreement (SLA) requirements and Total Cost of Ownership (TCO) objectives.
Dec. 2013 Performance Update: Memcached Configuration & Deployment
Accelerating Performance for Server-Side Java* Applications
This paper describes the key architectural advancements of the latest Intel Xeon processors and Intel Atom
processor C2000s that are beneficial to Java applications. It also describes some of the techniques and
strategies used to optimize JVM software and the benefits those optimizations bring to Java applications.
Optimizing Infrastructure for Workloads in OpenStack-Based Public Cloud Services
This paper examines how business needs translate to infrastructure considerations for infrastructure-as-a-service (IaaS) when building out or enhancing an OpenStack* cloud environment. The paper looks at these requirements and the foundational platform technologies that can support a wide range of service level agreement (SLA) requirements.
Intel® Fortran Vectorization Diagnostics
Intel® Fortran Compiler Vectorization Diagnostics
We have a similar catalog of vectorization diagnostics for the Intel® C++ Compiler HERE
The following diagnostic messages from the vectorization report produced by the Intel® Fortran Compiler. To obtain a vectorization report, use the -vec-report[n] option in Intel® Fortran for Linux* and Intel® Fortran Mac OS* X, or from the Intel® Visual Fortran for Windows /Qvec-report[:n] option.
Diagnostics Number | Diagnostic Description |
Diagnostic 15043 | loop was not vectorized: nonstandard loop is not a vectorization candidate |
Diagnostic 15038 | loop was not vectorized: conditional assignment to a scalar |
Diagnostic 15015 | loop was not vectorized: unsupported data type |
Diagnostic 15011 | loop was not vectorized: statement cannot be vectorized |
Diagnostic 15046 | loop was not vectorized: existence of vector dependence |
Diagnostic 15018 | loop was not vectorized: not inner loop |
Diagnostic 15002 | LOOP WAS VECTORIZED |
Diagnostic 15003 | PARTIAL LOOP WAS VECTORIZED OpenMP SIMD LOOP WAS VECTORIZED REMAINDER LOOP WAS VECTORIZED FUSED LOOP WAS VECTORIZED REVERSED LOOP WAS VECTORIZED |
Diagnostic 15037 | loop was not vectorized: vectorization possible but seems inefficient remainder loop was not vectorized: vectorization possible but seems inefficient |
Diagnostic 15144 | vectorization support: unroll factor set to xxxx |
Diagnostic 15133 | vectorization support: reference xxxx has aligned access |
Diagnostic 15143 | loop was not vectorized: loop was transformed to memset or memcpy |
Diagnostic 15023 | loop was not vectorized: unsupported loop structure |
Diagnostic 15134 | vectorization support: reference xxxx has unaligned access |
Diagnostic 15126 | vectorization support: unaligned access used inside loop body |
Diagnostic 15048 | vector dependence: assumed ANTI dependence between xxxx line and xxxx line |
Diagnostic 15042 | pragma supersedes previous setting |
Diagnostic 15033 | loop was not vectorized: modifying order of operation not allowed under given switches |
Diagnostic 15032 | loop was not vectorized: unsupported reduction |
Diagnostic 15040 | loop was not vectorized: no instance of vectorized math function satisfies specified math function attributes |
Diagnostic 15028 | loop was not vectorized: must preserve precise exceptions under given switches |
Diagnostic 15021 | loop was not vectorized: #pragma novector used |
Diagnostic 15123 | loop was not vectorized: the freestanding compiler flag prevents vector library calls |
Diagnostic 13398 | Default vector length for function 'xxxx' is 1 and therefore invalid. A valid vector length needs to be explicitly specified. |
Diagnostic 13399 | The 'xxxx' processor is not supported for vector functions. |
Diagnostic 13395 | Command line flag overrides the target processor to 'xxxx' for the vector function xxxx where processor clause is 'xxxx'. |
Diagnostic 15092 | FUNCTION WAS VECTORIZED. |
Diagnostic 15140 | The function 'xxxx' declared as vector function in one module does not have prototype in another module. |
Diagnostic 15142 | No suitable vector variant of function 'xxxx' found. |
Diagnostic 15127 | vectorization support: call to function xxxx cannot be vectorized |
Diagnostic 13378 | loop was not vectorized with "simd assert" |
Diagnostic 13379 | loop was not vectorized with "simd" |
Diagnostic 18015 | Switch statement inside #pragma simd loop has more than <var> case labels |
Diagnostic 18016 | Switch statement in vector function has more than <var> case labels |
Diagnostic 15125 | Invalid vectorlength clause specified |
Diagnostic 15017 | loop was not vectorized: low trip count |
Diagnostic 15167 | loop was not vectorized: memory reference is not naturally aligned |
Diagnostic 13400 | The 'xxxx' processor specified in command line is not supported as default for vector functions |
Diagnostic 15163 | vectorization support: number of FP up converts: single precision to double precision <val> |
Diagnostic 15164 | vectorization support: number of FP down converts: double precision to single precision <val> |
Diagnostic 15156 | vectorization support: conversion from float to int will be emulated |
Diagnostic 15155 | conversion from int to float will be emulated |
Diagnostic 15122 | function can't be vectorized: too many registers required to return value (big vector length) |
Diagnostic 15135 | vectorization support: vectorization of this loop under -Os has impact on code size |
Optimizing Hadoop Deployments
This paper provides guidance, based on extensive lab testing conducted at Intel, to help IT organizations plan an optimized infrastructure for deploying Apache Hadoop*. It includes:
- Best practices for establishing server hardware specifications
- level software guidance regarding the operating system (OS), Java Virtual Machine (JVM), and Hadoop version
- Configuration and tuning recommendations to provide optimized performance with reduced effort
Boosting Kingsoft Cloud* Image Processing with Intel® Xeon® Processors
Background
Kingsoft1 Cloud* is a public cloud service provider. It provides many services including cloud storage. Massive images are stored in Kingsoft Cloud storage. Kingsoft provides not only data storage for their customers but also image processing services to its public cloud customers. Customers can use these image processing services to complete functions such as image scaling, cutting, quality changing, and image watermarking according to their service requirements, which helps them provide the best experience to end users.
In the next section we will see how Kingsoft optimizes the imaging processing task to run on systems equipped with Intel® Xeon® processors.
Kingsoft Image Processing and Intel® Xeon® Processors
Intel® Advanced Vector Extensions 2 (Intel® AVX2)8 accelerates compression and decompression while processing a JPEG file. Those tasks are usually done using libjpeg-turbo2. Libjpeg-turbo is a widely used JPEG software codec. Unfortunately, the libjpeg-turbo library is implemented using Intel® Streaming SIMD Extensions 2 (Intel® SSE2)9, not Intel AVX2.
To optimize libjpeg-turbo to take advantage of Intel AVX2, Kingsoft engineers modified that library to include support for Intel AVX2—the libjpeg-turbo library with Intel AVX2 implemented is found in the library3. The new library accelerates the processes of color space conversion, down/up sampling, integer sample conversion, fast integer forward discrete cosine transform (DCT)4, slow integer forward DCT, integer quantization and integer inverse DCT.
Besides taking advantage of Intel AVX2 to reduce processing time, Kingsoft image processing tasks also gain performance when running on systems equipped with Intel® Xeon® processors E5 v4 over systems equipped with Intel® Xeon® processors E5 v3, due to having more cores and larger cache size. Image processing tasks like image cutting, scaling, and quality changing are cache-sensitive workloads; therefore, a larger CPU cache will make it run faster. Also, more cores means more images can be processed in parallel. Together, tasks finish faster, and running in parallel increases the overall performance.
Kingsoft makes use of the Intel® Math Kernel Library (Intel® MKL)7, in which its functions are optimized using Intel AVX2.
The next section shows how we tested the Kingsoft image processing workload to compare the performance between the current generation of Intel Xeon processors E5 v4 and those of the previous generation of Intel Xeon processors E5 v3.
Performance Test Procedure
We performed tests on two platforms. One system was equipped with the Intel® Xeon® processor E5-2699 v3 and the other with the Intel® Xeon® processor E5-2699 v4. We wanted to see how much performance improved when comparing the previous and the current generation of Intel Xeon processors and how Intel AVX2 plays a role in reducing the image processing time.
Test Configuration
System equipped with the dual-socket Intel Xeon processor E5-2699 v4
- System: Preproduction
- Processors: Intel Xeon processor E5-2699 v4 @2.2 GHz
- Cache: 55 MB
- Cores: 22
- Memory: 128 GB DDR4-2133MT/s
System equipped with the dual-socket Intel Xeon processor E5-2699 v3
- System: Preproduction
- Processors: Intel Xeon processor E5-2699 v3 @2.3 GHz
- Cache: 45 MB
- Cores: 18
- Memory: 128 GB DDR4-2133 MT/s
Operating System: Red Hat Enterprise Linux* 7.2- kernel 3.10.0-327
Software:
- GNU* C Compiler Collection 4.8.2
- GraphicsMagick* 1.3.22
- libjpeg-turbo 1.4.2
- Intel MKL 11.3
Application: Kingsoft image cloud workload
Test Results
The following test results show the performance improvement when running the application on systems equipped with the current and previous generations of Intel Xeon processors and when running the application with the Intel AVX2 non-supported and Intel AVX2 supported libjpeg-turbo libraries.
Figure 1: Comparison between the application using the Intel® Xeon® processor E5-2699 v3 and the Intel® Xeon® processor E5-2699 v4.
Figure 1 shows the results between the application using the Intel Xeon processor E5-2699 v3 and the Intel Xeon processor E5-2699 v4. The performance improvement is because Intel Xeon processors E5 v4 have more cores, a larger cache, and Intel AVX2.
Figure 2: Performance comparison between non-supported Intel® Advanced Vector Extensions 2 (Intel® AVX2) jpeg-turbo and supported Intel AVX2 jpeg-turbo.
Figure 2 shows that application performance improves up to 45 percent when using the libjpeg-turbo library with Intel AVX2 implemented over that with Intel SSE2 implemented. The improvement is achieved because Intel AVX2 instructions perform better than Intel SSE2 instructions. The application is running on a system equipped with the Intel Xeon processor E5 v4.
Conclusion
Kingsoft added support to Intel AVX2 on the libjpeg-turbo library. This allows their applications using the newly modified library to take advantage of the new features in Intel Xeon processors E5 v4. More cores and larger cache size also play an important role in improving performance of the applications running on systems equipped with these processors over systems having previous generations of Intel Xeon processors.
References
Happy Together: Ground-Breaking Media Performance with Intel® Processors + Software - Oct. 27 Free Webinar
Now, you can get the sweetest, fastest, high density and quality results for media workloads and video streaming - with the latest Intel hardware and media software working together. Take advantage of these platforms and learn how to access hardware-accelerated codecs on Intel® Xeon® E3-1500 v5 and 6th generation Intel® Core™ processors (codenamed Skylake) in a free webinar on Oct. 27 at 9.a.m. (Pacific).
- Optimize media solutions and apps for HEVC, AVC and MPEG-2 using Intel® Media Server Studioor Intel® Media SDK
- Achieve up to real-time 4K@60fps HEVC, or up to 18 AVC HD@30fps transcoding sessions on one platform**
- Access the big performance boosts possible with Intel graphics processors (GPUs)
- Get the skinny on shortcuts to fast-track your results
Sign Up Today Oct. 27 Free Webinar: Happy Together: Ground-Breaking Media Performance with Intel® Processors + Software

Webinar Speaker
Jeff McAllister– Media Software Technical Consulting Engineer
How to Mount a Shared Directory on Intel® Xeon Phi™ Coprocessor
In order to run a native program on the Intel® Xeon Phi™ coprocessor, the program and any dependencies must be copied to the target platform. However, this approach takes away memory from the native application. To reserve memory resource (16-GB GDDR5 memory on board the Intel Xeon Phi coprocessor), it is practical to mount a Network File System (NFS) shared directory on the Intel Xeon Phi coprocessor from the host server so that most of its memory can be used for applications. This article shows two ways to accomplish this task: the preferred method is using micctrl utility and the second method is a manual procedure.
Using micctrl
utility
The preferred method to mount a shared directory on an Intel Xeon Phi coprocessor is to use the micctrl utility shipped with the Intel® Manycore Platform Software Stack (Intel® MPSS). The following example shows how to share the Intel® Compiler C++ library using micctrl
. In the host machine used for this example, the MPSS 3.4.8 was installed.
- On the host machine, ensure that the shared directory exists:
[host ~]# ls /opt/intel/compilers_and_libraries_2017.0.098/linux/
- Add a new descriptor to the
/etc/exports
configuration file in the host machine, in order to export the directory/var/mpss/mic0.exports
to the coprocessor mic0 whose IP address is172.31.1.1
. Use the option read only so that the coprocessor cannot delete anything in the shared library mistakenly:[host ~]# cat /etc/exports /opt/intel/compilers_and_libraries_2017.0.098/linux172.31.1.1(ro,async,no_root_squash)
For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
- Next, update the NFS export table in the host:
[host ~]# exportfs -a
- From the host, use the
micctrl
utility to add an NFS entry on the coprocessors:[host ~]# micctrl --addnfs=/opt/intel/compilers_and_libraries_2017.0.098/linux --dir=/mnt-library --options=defaults
- Restart the MPSS service:
[host ~]# service mpss restart Shutting down Intel(R) MPSS: [ OK ] Starting Intel(R) MPSS: [ OK ] mic0: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner) mic1: online (mode: linux image: /usr/share/mpss/boot/bzImage-knightscorner)
- Finally, from the coprocessor, verify that the remote directory is accessible:
[host ~]# ssh mic0 cat /etc/fstab rootfs / auto defaults 1 1 proc /proc proc defaults 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0 172.31.1.254:/opt/intel/compilers_and_libraries_2017.0.098/linux /mnt-library nfs defaults 1 1 [host ~]# ssh mic0 ls /mnt-mic0
Mounting manually
As an example of the manual procedure, let’s assume we want to mount an NFS shared directory /mnt-mic0
on the Intel Xeon Phi coprocessor from the host machine (/var/mpss/mic0.export
is the directory that the host machine exports). In this method, steps 1-3 are the same as in the previous method:
- On the host machine, ensure that the shared directory exists; if doesn’t exist, create it:
[host ~]# mkdir /var/mpss/mic0.export
- Add a descriptor to the
/etc/exports
configuration file in the host machine to export the directory/var/mpss/mic0.exports
to the coprocessor mic0, which in this case has an IP address of172.31.1.1
:[host ~]# cat /etc/exports /var/mpss/mic0.export 172.31.1.1(rw,async,no_root_squash)
For more information on the export options, you can refer to http://nfs.sourceforge.net/nfs-howto/ar01s03.html.
- Next, update the NFS export table:
[host ~]# exportfs -a
- Next, login on the coprocessor mic0:
[host ~]# ssh mic0
- Create the mount point /mnt-mic0 on the coprocessor:
(mic0)# mkdir /mnt-mic0
- Add the following descriptor to the
/etc/fstab
file of the coprocessor to specify the server, the path name of the exported directory, the local directory (mount point), the type of the file system, and the list of mount options: “172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults 1 1
”(mic0)# cat /etc/fstab rootfs / auto defaults 1 1 proc /proc proc defaults 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0 172.31.1.254:/var/mpss/mic0.export /mnt-mic0 nfs defaults 1 1
- To mount the shared directory
/var/mpss/mic0.export
on the coprocessor, we can type:(mic0)# mount –a
Notes:
- If "Connection refused" error is received, restart NFS server in the host:
[host~]# service nfs restart Shutting down NFS daemon: [ OK ] Shutting down NFS mountd: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS mountd: [ OK ] Stopping RPC idmapd: [ OK ] Starting RPC idmapd: [ OK ] Starting NFS daemon: [ OK ]
- If "Permission denied" error is received, review and correct the
/etc/exports
file in the host. - If the coprocessor reboots, you have to mount the directory in the coprocessor again.
- The above shared directory can be read/write. To change to read only option, use the option (
ro,async,no_root_squash
) as seen in step 2.
Conclusion
This article shows two methods to mount a shared directory on the Intel Xeon Phi coprocessor. One method is using micctrl
utility, the other is the common manual method. Although both methods work, using micctrl
utility is the preferred method as it prevents users from entering data incorrectly in the /etc/fstab
table of the coprocessor.
References
- Intel® Manycore Platform Software Stack User’s Guide revision 3.7 (from https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss)
- Article, “Setting up an NFS Server”
Advanced Bitrate Control Methods in Intel® Media SDK
Introduction
In the world of media, there is a great demand to increase encoder quality but this comes with tradeoffs between quality and bandwidth consumption. This article addresses some of those concerns by discussing advanced bitrate control methods, which provide the ability to increase quality (relative to legacy rate controls) while maintaining the bitrate constant using Intel® Media SDK/ Intel® Media Server Studio tools.
The Intel Media SDK encoder offers many bitrate control methods, which can be divided into legacy and advanced/special purpose algorithms. This article is the 2nd part of 2-part series of Bitrate Control Methods in Intel® Media SDK. The legacy rate control algorithms are detailed in the 1st part, which is Bitrate Control Methods (BRC) in Intel® Media SDK; the advanced rate control methods (summarized in the table below) will be explained in this article.
Rate Control | HRD/VBV Compliant | OS supported | Usage |
LA | No | Windows/Linux | Storage transcodes |
LA_HRD | Yes | Windows/Linux | Storage transcodes; Streaming solution (where low latency is not a requirement) |
ICQ | No | Windows | Storage transcodes (better quality with smaller file size) |
LA_ICQ | No | Windows | Storage transcodes |
Following tools (along with their downloadable links) are used to explain the concepts and generate performance data for this article:
- Software- Intel® Media Server Studio and Intel® Media SDK
- Code Samples - Version 6.0.0.142
- Analysis tool- Intel® Video Pro Analyzer(VPA) and Video Quality Caliper(VQC), a component in Intel® Media Server Studio Professional Edition and Intel® Video Pro Analyzer.
- Raw input stream -Sintel 1080p
- Codec - H264/AVC
- System Used -
- CPU: Intel® Core® i5-5300U CPU @ 2.30GHz
- OS: Microsoft Windows 8.1 Enterprise
- Architecture: 64-bit
- Graphics Devices: Intel® HD Graphics 5500
Look Ahead (LA) Rate Control
As the name explains, this bitrate control method looks at successive frames, or the frames to be encoded next, and stores them in a look-ahead buffer. The number of frames or the length of the look ahead buffer can be specified by the LookAheadDepth parameter. This rate control is recommended for transcoding/encoding in a storage solution.
Generally, many parameters can be used to modify the quality/performance of the encoded stream. In this particular rate control, the encoding performance can be varied by changing the size of the look ahead buffer. The LookAheadDepth parameter value can be changed between 10 - 100 to specify the size of the look ahead buffer. The LookAheadDepth parameter specifies the number of frames that the SDK encoder analyzes before encoding. As the LookAheadDepth increases, so does the number of frames that the encoder looks into; this results in an increase in quality of the encoded stream, however the performance (encoding frames per second) will decrease. In our experiments, this performance tradeoff is negligible for smaller input streams.
Look Ahead rate control is enabled by default in sample_encode and sample_multi_transcode, part of code samples. The example below describes how to use this rate control method using the sample_encode application.
sample_encode.exe h264 -i sintel_1080p.yuv -o LA_out.264 -w 1920 -h 1080 -b 10000 –f 30 -lad 100 -la
As the value of LookAheadDepth increases, encoding quality improves, because the number of frames stored in the look ahead buffer have also increased, and the encoder will have more visibility to upcoming frames.
It should be noted that LA is not HRD (Hypothetical Reference Decoder) compliant. The following picture, obtained from Intel® Video Pro Analyzer shows a HRD buffer fullness view with “Buffer” mode enabled where sub-mode “HRD” is greyed out. This means no HRD parameters were passed in the stream headers, which indicates LA rate control is not HRD compliant.
Sliding Window condition:
Sliding window algorithm is a part of the Look Ahead rate control method. This algorithm is applicable for both LA and LA_HRD rate control methods by defining WinBRCMaxAvgKbps and WinBRCSize through the mfxExtCodingOption3 structure.
Sliding window condition is introduced to strictly constrain the maximum bitrate of the encoder by changing two parameters: WinBRCSize and WinBRCMaxAvgKbps. This helps in limiting the achieved bitrate which makes it a good fit in limited bandwidth scenarios such as live streaming.
- WinBRCSize parameter specifies the sliding window size in frames. A setting of zero means that sliding window condition is disabled.
- WinBRCMaxAvgKbps specifies the maximum bitrate averaged over a sliding window specified by WinBRCSize.
In this technique, the average bitrate in a sliding window of WinBRCSize must not exceed WinBRCMaxAvgKbps. The above condition becomes weaker as the sliding window size increases and becomes stronger if the sliding window size value decreases. Whenever this condition fails, the frame will be automatically re-encoded with a higher quantization parameter and performance of the encoder decreases as we keep encountering failures. To reduce the number of failures and to avoid re-encoding, frames within the look ahead buffer will be analyzed by the encoder. A peak will be detected when there is a condition failure by encountering a large frame in the look ahead buffer. Whenever a peak is predicted, the quantization parameter value will be increased, thus reducing the frame size.
Sliding window can be implemented by adding the following code to the pipeline_encode.cpp program in the sample_encode application.
m_CodingOption3.WinBRCMaxAvgKbps = 1.5*TargetKbps; m_CodingOption3.WinBRCSize = 90; //3*framerate m_EncExtParams.push_back((mfxExtBuffer *)&m_CodingOption3);
The above values were chosen when encoding sintel_1080p.yuv of 1253 frames with H.264 codec, TargetKbps = 10000, framerate = 30fps. Sliding window parameter values (WinBRCMaxAvgKbps and WinBRCSize) are subject to change when using different input options.
If WinBRCMaxAvgKbps is close to TargetKbps and WinBRCSize almost equals 1, the sliding window will degenerate into the limitation of the maximum frame size (TargetKbps/framerate).
Sliding window condition can be evaluated by checking in any WinBRCSize consecutive frames, the total encoded size doesn't exceed the value set by WinBRCMaxAvgKbps. The following equation explains the sliding window condition.
The condition of limiting frame size can be checked after the asynchronous encoder run and encoded data is written back to the output file in pipeline_encode.cpp.
Look Ahead with HRD Compliance (LA_HRD) Rate Control
As Look Ahead bitrate control is not HRD compliant, there is a dedicated mode to achieve HRD compliance with the LookAhead algorithm, known as LA_HRD mode (MFX_RATECONTROL_LA_HRD). With HRD compliance, the Coded Picture Buffer should neither overflow nor underflow. This rate control is recommended in storage transcoding solutions and streaming scenarios, where low latency is not a major requirement.
To use this rate control in sample_encode, it will require code changes as illustrated below -
Statements to be added in sample_encode.cpp file within ParseInputString() function
else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-hrd"))) pParams->nRateControlMethod = MFX_RATECONTROL_LA_HRD;
LookAheadDepth value can be mentioned in the command line when executing the sample_encode binary. The example below describes how to use this rate control method using the sample_encode application.
sample_encode.exe h264 -i sintel_1080p.yuv -o LA_out.264 -w 1920 -h 1080 -b 10000 –f 30 -lad 100 –hrd
In the following graph, the LookAheadDepth(lad) value is 100.
Figure 2: a snapshot of Intel® Video Pro Analyzer(VPA), which verifies that LA_HRD rate control is HRD compliant. The buffer fullness mode is activated by selecting “Buffer” mode and “HRD” is chosen in sub-mode.
The above figure shows HRD buffer fullness view with “Buffer” mode enabled in Intel VPA, in which the sub-mode “HRD” is selected. The horizontal red lines show the upper and lower limits of the buffer and green line shows the instantaneous buffer fullness. The buffer fullness didn’t cross the upper and lower limits of the buffer. This means neither overflow nor underflow occurred in this rate control.
Extended Look Ahead (LA_EXT) Rate Control
For 1:N transcoding scenarios (1 decode and N encode session), there is an optimized lookahead algorithm knows as Extended Look Ahead Rate Control algorithm (MFX_RATECONTROL_LA_EXT), available only in Intel® Media Server Studio (not part of the Intel® Media SDK). This is recommended for broadcasting solutions.
An application should be able to load the plugin ‘mfxplugin64_h264la_hw.dll’ to support MFX_RATECONTROL_LA_EXT. This plugin can be found in the following location in the local system, where the Intel® Media Server Studio is installed.
- “\Program Installed\Software Development Kit\bin\x64\588f1185d47b42968dea377bb5d0dcb4”.
The path of this plugin needs to be mentioned explicitly because it is not part of the standard installation directory. This capability can be used in either of two ways:
- Preferred Method - Register the plugin with registry and point all necessary attributes such as API version, plugin type, path etc; so the dispatcher, which is a part of the software, can find it through the registry and connect to a decoding/encoding session.
- Have all binaries (Media SDK, plugin, and app) in a directory and execute from the same directory.
LookAheadDepth parameter must be mentioned only once and considered to be the same value of LookAheadDepth of all N transcoded streams. LA_EXT rate control can be implemented using sample_multi_transcode, below is the example cmd line -
sample_multi_transcode.exe -par file_1.par
Contents of the par file are
-lad 40 -i::h264 input.264 -join -la_ext -hw_d3d11 -async 1 -n 300 -o::sink -h 1088 -w 1920 -o::h264 output_1.0.h264 -b 3000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300 -h 1088 -w 1920 -o::h264 output_2.h264 -b 5000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300 -h 1088 -w 1920 -o::h264 output_3.h264 -b 7000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300 -h 1088 -w 1920 -o::h264 output_4.h264 -b 10000 -join -async 1 -hw_d3d11 -i::source -l 1 -u 1 -n 300
Intelligent Constant Quality (ICQ) Rate Control
The ICQ bitrate control algorithm is designed to improve subjective video quality of an encoded stream: it may or may not improve video quality objectively - depending on the content. ICQQuality is a control parameter which defines the quality factor for this method. ICQQuality parameter can be changed between 1 - 51, where 1 corresponds to the best quality. The achieved bitrate and encoder quality (PSNR) can be adjusted by increasing or decreasing ICQQuality parameter. This rate control is recommended for storage solutions, where high quality is required while maintaining a smaller file size.
To use this rate control in sample_encode, it will require code changes as explained below -
Statements to be added in sample_encode.cpp within ParseInputString() function
else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-icq"))) pParams->nRateControlMethod = MFX_RATECONTROL_ICQ;
ICQQuality is available in the mfxInfoMFX structure. The desired value can be entered for this variable in InitMfxEncParams() function, e.g.:
m_mfxEncParams.mfx.ICQQuality = 12;
The example below describes how to use this rate control method using the sample_encode application.
sample_encode.exe h264 -i sintel_1080p.yuv -o ICQ_out.264 -w 1920 -h 1080 -b 10000 -icq

Using about the same bitrate, ICQ shows improved Peak Signal to Noise Ratio (PSNR) in the above plot. The RD-graph data for the above plot is captured using the Video Quality Caliper, which compares two different streams encoded with ICQ and VBR.
Observation from above performance data:
- At the same achieved bitrate, ICQ shows much improved quality (PSNR) compared to VBR, while maintaining the same encoding FPS.
- The encoding bitrate and quality of the stream decreases as the ICQQuality parameter value increases.
The snapshot below shows a subjective comparison between encoded frames using VBR (on the left) and ICQ (on the right). Highlighted sections demonstrate missing details in VBR and improvements in ICQ.

Look Ahead & Intelligent Constant Quality (LA_ICQ) Rate Control
This method is the combination of ICQ with Look Ahead. This rate control is also recommended for storage solutions. ICQQuality and LookAheadDepth are the two control parameters where the qualify factor is specified by mfxInfoMFX::ICQQuality and look ahead depth is controlled by the mfxExtCodingOption2: LookAheadDepth parameter.
To use this rate control in sample_encode, it requires code changes as explained below -
Statements to be added in sample_encode.cpp within ParseInputString() function
else if (0 == msdk_strcmp(strInput[i], MSDK_STRING("-laicq"))) pParams->nRateControlMethod = MFX_RATECONTROL_LA_ICQ;
ICQQuality is available in the mfxInfoMFX structure. Desired values can be entered for this variable in InitMfxEncParams() function
m_mfxEncParams.mfx.ICQQuality = 12;
LookAheadDepth can be mentioned in command line as lad.
sample_encode.exe h264 -i sintel_1080p.yuv -o LAICQ_out.264 -w 1920 -h 1080 -b 10000 –laicq -lad 100

At similar bitrate, better PSNR is observed for LA_ICQ compared to VBR as shown in the above plot. By keeping LookAheadDepth value at 100, the ICQQuality parameter value was changed between 1 - 51. The RD-graph data for this plot was captured using the Video Quality Caliper, which compares two different streams encoded with LA_ICQ and VBR.
Conclusion
There are several advanced bitrate control methods available to play with, to see if higher quality encoded streams can be achieved while maintaining bandwidth requirements constant. Each rate control has its own advantages and can be used in specific industry level use-cases depending on the requirement. This article focuses on H264/AVC encoder rate control methods, and might not be applicable to MPEG2 and H265/HEVC encoder. To implement these bitrate controls, also refer to the Intel® Media SDK Reference Manual, which comes with an installation of the Intel® Media SDK or Intel® Media Server Studio, and the Intel® Media Developer’s Guide from the documentation website. Visit Intel’s media support forum for further questions.
Resources
How to Emulate Persistent Memory on an Intel® Architecture Server
Introduction
This tutorial provides a method for setting up persistent memory (PMEM) emulation using regular dynamic random access memory (DRAM) on an Intel® processor using a Linux* kernel version 4.3 or higher. The article covers the hardware configuration and walks you through setting up the software. After following the steps in this article, you'll be ready to try the PMEM programming examples in the NVM Library at pmem.io.
Why do this?
If you’re a software developer who wants to get started early developing software or preparing your applications to have PMEM awareness, you can use this emulation for development before PMEM hardware is widely available.
What is persistent memory?
Traditional applications organize their data between two tiers: memory and storage. Emerging PMEM technologies introduces a third tier. This tier can be accessed like volatile memory, using processor load and store instructions, but it retains its contents across power loss like storage. Because the emulation uses DRAM, data will not be retained across power cycles.
Hardware and System Requirements
Emulation of persistent memory is based on DRAM memory that will be seen by the operating system (OS) as a Persistent Memory region. Because it is a DRAM-based emulation it is very fast, but will lose all data upon powering down the machine. The following hardware was used for this tutorial:
CPU and Chipset | Intel® Xeon® processor E5-2699 v4 processor, 2.2 GHz
|
Platform | Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass)
|
Memory | Memory size: 256 GB (16X16 GB) DDR4 2133P Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG |
Storage | Brand and model: 1 TB Western Digital* (WD1002FAEX) |
Operating system | CentOS* 7.2 with kernel 4.5.3 |
Table 1 - System configuration used for the PMEM emulation.
Linux* Kernel
Linux Kernel 4.5.3 was used during development of this tutorial. Support for persistent memory devices and emulation have been present in the kernel since version 4.0, however a kernel newer than 4.2 is recommended for easier configuration. The emulation should work with any Linux distribution able to handle an official kernel. To configure the proper driver installation, run make nconfig and enable the driver. Per the instructions below, Figures 1 to 5 show the correct setting for the NVDIMM Support in the Kernel Configuration menu.
$ make nconfig -> Device Drivers -> NVDIMM Support -><M>PMEM; <M>BLK; <*>BTT
Figure 1:Set up device drivers.
Figure 2:Set up the NVDIMM device.
Figure 3:Set up the file system for Direct Access support.
Figure 4: Set up for Direct Access (DAX) support.
Figure 5:NVDIMM Support property.
The kernel will offer these regions to the PMEM driver so they can be used for persistent storage. Figures 6 and 7 show the correct setting for the processor type and features in the Kernel Configuration menu.
$ make nconfig -> Processor type and features<*>Support non-standard NVDIMMs and ADR protected memory
Figures 4 and 5 show the selections in the Kernel Configuration menu.
Figure 6:Set up the processor to support NVDIMMs.
Figure 7:Enable NON-standard NVDIMMs and ADR protected memory.
Now you are ready to build your kernel using the instructions below.
$ make -jX Where X is the number of cores on the machine
During the new kernel build process, there is a performance benefit to compiling the new kernel in parallel. An experiment with one thread to multiple threads shows that the compilation can be up to 95 percent faster than a single thread. With the time saved using multiple thread compilation for the kernel, the whole new kernel setup goes much faster. Figures 8 and 9 show the CPU utilization and the performance gain chart for compiling at different numbers of threads.
Figure 8:Compiling the kernel sources.
Figure 9:Performance gain for compiling the source in parallel.
Install the Kernel
# make modules_install install
Figure 10:Installing the kernel.
Reserve a memory region by modifying kernel command line parameters so it appears as a persistent memory location to the OS. The region of memory to be used is from ss to ss+nn. [KMG] refers to kilo, mega, giga.
memmap=nn[KMG]!ss[KMG]
For example, memmap=4G!12G reserves 4 GB of memory between 12th and 16th GB. Configuration is done within GRUB and varies between Linux distributions. Here are two examples of a GRUB configuration.
Under CentOS 7.0
# vi /etc/default/grub GRUB_CMDLINE_LINUX="memmap=nn[KMG]!ss[KMG]" On BIOS-based machines: # grub2-mkconfig -o /boot/grub2/grub.cfg
Figure 11 shows the added PMEM statement in the GRUB file. Figure 12 shows the instructions to make the GRUB configuration.
Figure 11:Define PMEM regions in the /etc/default/grub file.
Figure 12:Generate the boot configuration file bases on the grub template.
After the machine reboots, you should be able to see the emulated device as /dev/pmem0…pmem3. Trying to get reserved memory regions for persistent memory emulation will result in split memory ranges defining persistent (type 12) regions as shown in Figure 13. A general recommendation would be to either use memory from the 4GB+ range (memmap=nnG!4G) or to check the e820 memory map upfront and fitting within. If you don’t see the device, verify the memmap setting correctness in the grub file as shown in Figure 9, followed by dmesg(1) analysis as shown in Figure 13. You should be able to see reserved ranges as shown on the dmesg output snapshot: dmesg.
Figure 13:Persistent memory regions are highlighted as (type 12).
You'll see that there can be multiple non-overlapping regions reserved as a persistent memory. Putting multiple memmap="...!..." entries will result in multiple devices exposed by the kernel and visible as /dev/pmem0, /dev/pmem1, /dev/pmem2, …
DAX - Direct Access
The DAX (direct access) extensions to the filesystem creates a PMEM-aware environment. Some distros, such as Fedora* 24 and later, already have DAX/PMEM built in as a default, and have NVML available as well. One quick way to check to see if the kernel has DAX and PMEM built into it is to grep the kernel’s config file which is usually provided by the distro under /boot. Use the command below:
# egrep ‘(DAX|PMEM)’ /boot/config-`uname –r`
The result should be something like:
CONFIG_X86_PMEM_LEGACY_DEVICE=y CONFIG_X86_PMEM_LEGACY=y CONFIG_BLK_DEV_RAM_DAX=y CONFIG_BLK_DEV_PMEM=m CONFIG_FS_DAX=y CONFIG_FS_DAX_PMD=y CONFIG_ARCH_HAS_PMEM_API=y
To install a filesystem with DAX (available today for ext4 and xfs):
# mkdir /mnt/pmemdir # mkfs.ext4 /dev/pmem3 # mount -o dax /dev/pmem3 /mnt/pmemdir Now files can be created on the freshly mounted partition, and given as an input to NVML pools.
Figure 14:Persistent memory blocks.
Figure 15:Making a file system.
It is additionally worth mentioning that you can emulate persistent memory with ramdisk (i.e., /dev/shm) or force PMEM-like behavior by setting environment variable PMEM_IS_PMEM_FORCE=1. This would eliminate performance hit caused by msync(2).
Conclusion
By now, you know how to set up an environment where you can build a PMEM application without actual PMEM hardware. With the additional cores on an Intel® architecture server, you can quickly build a new kernel with PMEM support for your emulation environment.
References
Author(s)
Thai Le is the software engineer focusing on cloud computing and performance computing analysis at Intel Corporation.
Introduction to the Heterogeneous Streams Library
Introduction
To efficiently utilize all available resources for the task concurrency application on heterogeneous platforms, designers need to understand the memory architecture, the thread utilization on each platform, the pipeline to offload the workload to different platforms, and to coordinate all these activities.
To relieve designers of the burden of implementing the necessary infrastructures, the Heterogeneous Streaming (hStreams) library provides a set of well-defined APIs to support a task-based parallelism model on heterogeneous platforms. hStreams explores the use of the Intel® Coprocessor Offload Infrastructure (Intel® COI) to implement these infrastructures. That is, the host decomposes the workload into tasks, one or more tasks are executed in separate targets, and finally the host gathers the results from all of the targets. Note that the host can also be a target too.
Intel® Manycore Platform Software Stack (Intel® MPSS) version 3.6 contains the hStreams library, documentation, and sample codes. Starting from Intel MPSS 3.7, hStreams is removed from Intel MPSS software and becomes an open source project. The current version 1.0 supports the Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor as targets. hStreams binaries version 1.0.0 can be downloaded:
- https://01.org/sites/default/files/downloads/hetero-streams-library/hstreams-1.0.0.tar (for Linux*)
- https://01.org/sites/default/files/downloads/hetero-streams-library/hstreams-1.0.0.zip (for Windows*)
Users can contribute to hStreams development at https://github.com/01org/hetero-streams. The following tables summarize the tools that support hStreams in Linux and Windows:
Name of Tool (Linux*) | Supported Version |
---|---|
Intel® Manycore Platform Software Stack | 3.4, 3.5, 3.6, 3.7 |
Intel® C++ Compiler | 15.0, 16.0 |
Intel® Math Kernel Library | 11.2, 11.3 |
Name of Tool (Windows*) | Supported Version |
---|---|
Intel MPSS | 3.4, 3.5, 3.6, 3.7 |
Intel C++ Compiler | 15.0, 16.0 |
Intel Math Kernel Library | 11.2, 11.3 |
Visual Studio* | 11.0 (2012) |
This whitepaper briefly introduces hStreams and highlights its concepts. For a full description, readers are encouraged to read the tutorial included in the hStreams package mentioned above.
Execute model concepts
This section highlights some basic concepts of hStreams: source and sink, domains, streams, buffers, and actions:
- Streams are FIFO queues where actions are enqueued. Streams are associated with logical domains. Each stream has two endpoints: source and sink, which is bound to a logical domain.
- Source is where the work is enqueued and sink where the work is executed. In the current implementation, the source process runs on an Intel Xeon processor-based machine, and the sink process runs on a machine that can be the host itself, an Intel Xeon Phi coprocessor, or in the future, even any hardware platform. The library allows the source machine to invoke the user’s defined function on the target machine.
- Domains represent the resources of hetero platforms. A physical domain is the set of all resources available in a platform (memory and computing). For example, an Intel Xeon processor-based machine and an Intel Xeon Phi coprocessor are two different physical domains. A logical domain is a subset of a given physical domain; it uses any subset of available cores in a physical domain. The only restriction is that two logical domains cannot be partially overlapping.
- Buffers represent memory resources to transfer data between source and sink. In order to transfer data, the user must create a buffer by calling an appropriate API, and a corresponding physical buffer is instantiated at the sink. Buffers can have properties such as memory type (for example, DDR or HBW) and affinity (for example, sub-NUMA clustering).
- Actions are requests to execute functions at the sinks (compute action), to transfer data from source to sink or vise-versa (memory movement action), and to synchronize tasks among streams (synchronization action). Actions enqueued in a stream are proceeded in first in, first out (FIFO) semantics: The source places the action in and the sink removes the action. All actions are non-blocking (asynchronous) and have completion events. Remote invocation can be user-defined functions or optimized convenient functions (for example, dgemm). Thus, a FIFO stream queue handles dependencies within a stream while synchronization actions handle dependencies among streams.
In a typical scenario, the source-side code allocates stream resources, allocates memory, transfers data to the sink, invokes the sink to execute a predefined function, handles synchronization, and eventually terminates streams. Note that actions such as data transferring, remote invocation, and synchronization are handled in FIFO streams. The sink-side code simply executes the function that the source requested.
For example, consider the pseudo-code of a simple hStreams application that creates two streams, the source transfers data to the sinks, performs remote invocation at the sinks, and then transfers results back to the source host:
Step 1: Initialize two streams 0 and 1
Step 2: Allocate buffers A0, B0, C0, A1, B1, C1
Step 3: Use stream i, transfer memory Ai, Bi to sink (i=0,1)
Step 4: Invoke remote computing in stream i: Ai + Bi -> Ci (i=0,1)
Step 5: Transfer memory Ci back to host (i=0,1)
Step 6: Synchronize
Step 7: Terminate streams
The following figure illustrates the actions generated at the host:
Actions are placed in the corresponding streams and removed at the sinks:
hStreams provides two levels of APIs: the app API and the core API. The app API offers simple interfaces; it is targeted to novice users to quickly ramp on hStreams library. The core API gives advanced users the full functionality of the library. The app APIs in fact call the core layer APIs, which in turn use Intel COI and the Symmetric Communication Interface (SCIF). Note that users can mix these two levels of API when writing their applications. For more details on the hStreams API, refer to the document Programing Guide and API Reference. The following figure illustrates the relation between the hStreams app API and the core API.
Refer to the document “Hetero Streams Library 1.0 Programing Guide and API” and the tutorial included in the hStreams download package for more information.
Building and running a sample hStreams program
This section illustrates a sample code that makes use of the hStreams app API. It also demonstrates how to build and run the application. The sample code is an MPI program running on an Intel Xeon processor host with two Intel Xeon Phi coprocessors connected.
First, download the package from https://github.com/01org/hetero-streams. Then, follow the instruction to build and install the hStreams library on an Intel Xeon processor-based host machine that runs Intel MPSS 3.7.2 in this case. This host machine has two Intel Xeon Phi coprocessors installed and connects to a remote Intel Xeon processor-based machine. This remote machine (10.23.3.32) also has two Intel Xeon Phi coprocessors.
This sample code creates two streams; each stream runs explicitly on a separate coprocessor. An MPI rank manages these two streams.
The application consists of two parts: The source-side code is shown in Appendix A and the corresponding sink-side code is shown in Appendix B. The sink-side code contains a user-defined function vector_add, which is to be invoked by the source.
This sample MPI program is designed to run with two MPI ranks. Each MPI rank runs on a different domain (Intel Xeon processor host) and initializes two streams; each stream is responsible for communicating with a coprocessor. The MPI ranks enqueues the required actions into the streams in the following order: Memory transfer action from source to sink action, remote invocation action, and memory transfer action from sink to source. The following app APIs are called in the source-side code:
hStreams_app_init:
Initialize and create streams across all available Intel Xeon Phi coprocessors. This API assumes one logical domain per physical domain.hStreams_app_create_buf:
Create an instantiation of buffers in all currently existing logical domains.hStreams_app_xfer_memory:
Enqueue memory transfer action in a stream; depending on the specified direction, memory is transferred from source to sink or sink to source.hStreams_app_invoke:
Enqueue a user-defined function in a stream. This function is executed at the stream sink. Note that the user also needs to implement the remote target function in the sink-side program.hStreams_app_event_wait:
This sync action blocks until the set of specified events is completed. In this example, only the last transaction in a stream is required, since all other actions should be completed.hStreams_app_fini:
Destroy hStreams internal structures and clear the library state.
Intel MPSS 3.7.2 and Intel® Parallel Studio XE 2016 update 3 are installed on the host machine Intel® Xeon® processor E5-2600. First, bring the Intel MPSS service up and set up compiler environment variables on the host machine:
$ sudo service mpss start
$ source /opt/intel/composerxe/bin/compilervars.sh intel64
To compile the source-side code, link the source-side code with the dynamic library hstreams_source
which provides source functionality:
$ mpiicpc hstream_sample_src.cpp –O3 -o hstream_sample -lhstreams_source \ -I/usr/include/hStreams -qopenmp
The above command generates the executable hstream_sample
. To generate the user kernel library for the coprocessor (as sink), compile with the flag –mmic
:
$ mpiicpc -mmic -fPIC -O3 hstream_sample_sink.cpp –o ./mic/hstream_sample_mic.so \ -I/usr/include/hStreams -qopenmp -shared
To follow the convention, the target library takes the form <exec_name>_mic.so
for the Intel Xeon Phi coprocessor and <exec_name>_host.so
for the host. This generates the library named hstream_sample_mic.so
under the folder /mic
.
To run this application, set the environment variable SINK_LD_LIBRARY_PATH
so that hStreams runtime can find the user kernel library hstream_sample_mic.so
$ export SINK_LD_LIBRARY_PATH=/opt/mpss/3.7.2/sysroots/k1om-mpss-linux/usr/lib64:~/work/hStreams/collateral/delivery/mic:$MIC_LD_LIBRARY_PATH
Run this program with two ranks, one rank running on this current host and one rank running on the host whose IP address is 10.23.3.32, as follows:
$ mpiexec.hydra -n 1 -host localhost ~/work/hstream_sample : -n 1 -wdir ~/work -host 10.23.3.32 ~/work/hstream_sample
Hello world! rank 0 of 2 runs on knightscorner5
Hello world! rank 1 of 2 runs on knightscorner0.jf.intel.com
Rank 0: stream 0 moves A
Rank 0: stream 0 moves B
Rank 0: stream 1 moves A
Rank 0: stream 1 moves B
Rank 0: compute on stream 0
Rank 0: compute on stream 1
Rank 0: stream 0 Xtransfer data in C back
knightscorner5-mic0
knightscorner5-mic1
Rank 1: stream 0 moves A
Rank 1: stream 0 moves B
Rank 1: stream 1 moves A
Rank 1: stream 1 moves B
Rank 1: compute on stream 0
Rank 1: compute on stream 1
Rank 1: stream 0 Xtransfer data in C back
knightscorner0-mic0.jf.intel.com
knightscorner0-mic1.jf.intel.com
Rank 0: stream 1 Xtransfer data in C back
Rank 1: stream 1 Xtransfer data in C back
sink: compute on sink in stream num: 0
sink: compute on sink in stream num: 0
sink: compute on sink in stream num: 1
sink: compute on sink in stream num: 1
C0=97.20 C1=90.20 C0=36.20 C1=157.20 PASSED!
Conclusion
hStreams provides a well-defined set of APIs allowing users to design a task-based application on heterogeneous platforms quickly. Two levels of hStreams API co-exist: The app API offers simple interfaces for novice users to quickly ramp on the hStreams library, and the core API gives advanced users the full functionality of the rich library. This paper presents some basic hStreams concepts and illustrates how to build and run an MPI program that takes advantages of the hStreams interface.
About the Author
Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.
Improve Vectorization Performance using Intel® Advanced Vector Extensions 512
This article shows a simple example of a loop that was not vectorized by the Intel® C++ Compiler due to possible data dependencies, but which has now been vectorized using the Intel® Advanced Vector Extensions 512 instruction set on an Intel® Xeon Phi™ processor. We will explore why the compiler using this instruction set automatically recognizes the loop as vectorizable and will discuss some issues about the vectorization performance.
Introduction
When optimizing code, the first efforts should be focused on vectorization. The most fundamental way to efficiently utilize the resources in modern processors is to write code that can run in vector mode by taking advantage of special hardware like vector registers and SIMD (Single Instruction Multiple Data) instructions. Data parallelism in the algorithm/code is exploited in this stage of the optimization process.
Making the most of fine grain parallelism through vectorization will allow the performance of software applications to scale with the number of cores in the processor by using multithreading and multitasking. Efficient use of single-core resources will be critical in the overall performance of the multithreaded application, because of the multiplicative effect of vectorization and multithreading.
The new Intel® Xeon Phi™ processor features 512-bit wide vector registers. The new Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set architecture (ISA), which is supported by the Intel Xeon Phi processor (and future Intel® processors), offers support for vector-level parallelism, which allows the software to use two vector processing units (each capable of simultaneously processing 16 single precision (32-bit) or 8 double precision (64-bit) floating point numbers) per core. Taking advantage of these hardware and software features is the key to optimal use of the Intel Xeon Phi processor.
This document describes a way to take advantage of the new Intel AVX-512 ISA in the Intel Xeon Phi processor. An example of an image processing application will be used to show how, with Intel AVX-512, the Intel C++ Compiler now automatically vectorizes a loop that was not vectorized with Intel® Advanced Vector Extensions 2 (Intel® AVX2). We will discuss performance issues arising with this vectorized code.
The full specification of the Intel AVX-512 ISA consists of several subsets. Some of those subsets are available in the Intel Xeon Phi processor. Some subsets will also be available in future Intel® Xeon® processors. A detailed description of the Intel AVX-512 subsets and their presence in different Intel processors is described in (Zhang, 2016).
In this document, the focus will be on the subsets of the Intel AVX-512 ISA, which provides vectorization functionality present both in current Intel Xeon Phi processor and future Intel Xeon processors. These subsets include the Intel AVX-512 Foundation Instructions (Intel AVX-512F) subset (which provides core functionality to take advantage of vector instructions and the new 512-bit vector registers) and the Intel AVX-512 Conflict Detection Instructions (Intel AVX-512CD) subset (which adds instructions that detect data conflicts in vectors, allowing vectorization of certain loops with data dependences).
Vectorization Techniques
There are several ways to take advantage of vectorization capabilities on an Intel Xeon Phi processor core:
- Use optimized/vectorized libraries, like the Intel® Math Kernel Library (Intel® MKL).
- Write vectorizable high-level code, so the compiler will create corresponding binary code using the vector instructions available in the hardware (this is commonly called automatic vectorization).
- Use language extensions (compiler intrinsic functions) or direct calling to vector instructions in assembly language.
Each one of these methods has advantages and disadvantages, and which method to use will depend on the particular case we are working with. This document focuses on writing vectorizable code, which lets our code be more portable and ready for future processors. We will explore a simple example (a histogram) for which the new Intel AVX-512 instruction set will create executable code that will run in vector mode on the Intel Xeon Phi processor. The purpose of this example is to give insight on why the compiler can now vectorize source code containing data dependencies using the Intel AVX-512 ISA, which was not recognized as vectorizable when the compiler uses previous instruction sets, like Intel AVX2. Detailed information about Intel® AVX-512 ISA can be found at (Intel, 2016).
In future documents, techniques to explicitly guide vectorization using the language extensions and compiler intrinsics will be discussed. Those techniques will be helpful in complex loops for which the compiler is not able to safely vectorize the code due to complex flow or data dependencies. However the relatively simple example shown in this document will be helpful in understanding how the compiler is using the new features present in the AVX-512 ISA to improve the performance of some common loop structures.
Example: histogram computation in images.
To understand the new features offered by the AVX512F and AVX512CD subsets, we will use the example of computing an image histogram.
An image histogram is a graphical representation of the distribution of pixel values in an image (Wikipedia, n.d.). The pixel values can be single scalars representing grayscale values or vectors containing values representing colors, as in RGB images (where the color is represented using a combination of three values: red, green, and blue).
In this document, we used a 3024 x 4032 grayscale image. The total number of pixels in this image is 12,192,768. The original image and the corresponding histogram (computed using 1-pixel 256 grayscale intensity intervals) are shown in Figure 1.
Figure 1: Image used in this document (image credit: Alberto Villarreal), and its corresponding histogram.
A basic algorithm to compute the histogram is the following:
- Read image
- Get number of rows and columns in the image
- Set image array [1: rows x columns] to image pixel values
- Set histogram array [0: 255] to zero
- For every pixel in the image
{ histogram [ image [ pixel ] ] = histogram [ image [ pixel ] ] + 1 }
Notice that in this basic algorithm, the image array is used as an index to the histogram array (a type conversion to an integer is assumed). This kind of indirect referencing cannot be unconditionally parallelized, because neighboring pixels in the image might have the same intensity value, in which case the results of processing more than one iteration of the loop simultaneously might be wrong.
In the next sections, this algorithm will be implemented in C++, and it will be shown that the compiler, when using the AVX-512 ISA, will be able to safely vectorize this structure (although only in a partial way, with performance depending on the image data).
It should be noticed that this implementation of a histogram computation is used in this document for pedagogical purposes only. It does not represent an efficient way to perform the histogram computation, for which there are efficient libraries available. Also, our purpose is to show, using a simple code, how the new AVX-512 ISA is adding vectorization opportunities, and to help us understand the new functionality provided by the AVX-512 ISA.
There are other ways to implement parallelism for specific examples of histogram computations. For example in (Colfax International, 2015) the authors describe a way to automatically vectorize a similar algorithm (a binning application) by modifying the code using a strip-mining technique.
Hardware
To test our application, the following system will be used:
Processor: Intel Xeon Phi processor, model 7250 (1.40 GHz)
Number of cores: 68
Number of threads: 272
The information above can be checked in a Linux* system using the command
cat /proc/cpuinfo.
Notice that when using the command shown above, the “flags” section in the output will include the “avx512f” and “avx512cd” processor flags. Those flags indicate that Intel AVX512F and Intel AVX512CD subsets are supported by this processor. Notice that the flag “avx2” is defined also, which means the Intel AVX2 ISA is also supported (although it does not take advantage of the 512-bit vector registers in this processor).
Vectorization Results Using The Intel® C++ Compiler
This section shows a basic vectorization analysis of a fragment of the histogram code. Specifically, two different loops in this code will be analyzed:
LOOP 1: A loop implementing a histogram computation only. This histogram is computed on the input image, stored in floating point single precision in array image1.
LOOP 2: A loop implementing a convolution filter followed by a histogram computation. The filter is applied to the original image in array image1 and then a new histogram is computed on the filtered image stored in array image2.
The following code section shows the two loops mentioned above (image and histogram data have been placed in aligned arrays):
// LOOP 1 #pragma vector aligned for (position=cols; position<rows*cols-cols; position++) { hist1[ int(image1[position]) ]++; } (…) // LOOP 2 #pragma vector aligned for (position=cols; position<rows*cols-cols; position++) { if (position%cols != 0 || position%(cols-1) != 0) { image2[position] = ( 9.0f*image1[position] - image1[position-1] - image1[position+1] - image1[position-cols-1] - image1[position-cols+1] - image1[position-cols] - image1[position+cols-1] - image1[position+cols+1] - image1[position+cols]) ; } if (image2[position] >= 0 && image2[position] <= 255) hist2[ int(image2[position]) ]++; }
This code was compiled using Intel C++ Compiler’s option to generate an optimization report as follows:
icpc histogram.cpp -o histogram -O3 -qopt-report=2 -qopt-report-phase=vec -xCORE-AVX2…
Note that, in this case, the -xCORE-AVX2 compiler flag has been used to ask the compiler to use the Intel AVX2 ISA to generate executable code.
The section of the optimization report that the compiler created for the loops shown above looks like this:
LOOP BEGIN at histogram.cpp(92,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between line 94 and line 94 LOOP END LOOP BEGIN at histogram.cpp(92,5) <Remainder> LOOP END LOOP BEGIN at histogram.cpp(103,5) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed FLOW dependence between line 118 and line 118 LOOP END
As can be seen in the section of the optimization report shown above, the compiler has prevented vectorization in both loops, due to dependences present in the lines of code where histogram computations are taking place (lines 94 and 118).
Now let’s compile the code using the -xMIC-AVX512 flag, to indicate the compiler to use the Intel AVX-512 ISA:
icpc histogram.cpp -o histogram -O3 -qopt-report=2 -qopt-report-phase=vec -xMIC-AVX512…
This creates the following output for the code segment in the optimization report, showing that both loops have now been vectorized:
LOOP BEGIN at histogram.cpp(92,5) remark #15300: LOOP WAS VECTORIZED LOOP BEGIN at histogram.cpp(94,8) remark #25460: No loop optimizations reported LOOP END LOOP END LOOP BEGIN at histogram.cpp(92,5) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP BEGIN at histogram.cpp(94,8) remark #25460: No loop optimizations reported LOOP END LOOP END LOOP BEGIN at histogram.cpp(103,5) remark #15300: LOOP WAS VECTORIZED LOOP BEGIN at histogram.cpp(118,8) remark #25460: No loop optimizations reported LOOP END LOOP END LOOP BEGIN at histogram.cpp(103,5) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED
The compiler reports results can be summarized as follows:
- LOOP 1, which implements a histogram computation, is not being vectorized using the Intel AVX2 flag because of an assumed dependency (which was described in section 3 in this document). However, the loop was vectorized when using the Intel AVX-512 flag, which means that the compiler has solved the dependency using instructions present in the Intel AVX-512 ISA.
- LOOP 2 gets the same diagnostics as LOOP1. The difference between these two loops is that LOOP 2 adds, on top of the histogram computation, a filter operation that has no dependencies and would be vectorizable otherwise. The presence of the histogram computation is preventing the compiler from vectorizing the entire loop (when using the Intel AVX2 flag).
Note: As can be seen in the section of the optimization report shown above, the compiler split the loop into two sections: The main loop and the reminder loop. The remainder loop contains the last few iterations in the loop (those that do not completely fill the vector unit). The compiler will usually do this, unless it knows in advance that the total number of iterations for this loop will be a multiple of the vector length.
We will ignore the reminder loop in this document. Ways to improve performance by eliminating the reminder loop are described in the literature.
Analyzing Performance of The Code
Performance of the above code segment was analyzed by adding timing instructions at the beginning and at the end of each one of the two loops, so that the time spent in each loop can be compared between different executables generated using different compiler options.
The table below shows the timing results of executing, on a single core, the vectorized and non-vectorized versions of the code (results are the average of 5 executions) using the input image without preprocessing. Baseline performance is defined here as the performance of the non-vectorized code generated by the compiler when using the Intel AVX2 compiler flag.
Test case | Loop | Baseline (Intel® Advanced Vector Extensions 2) | Speedup Factor with Vectorization (Intel® Advanced Vector Extensions 512) |
Input image | LOOP 1 | 1 | 2.2 |
| LOOP 2 | 1 | 7.0 |
To further analyze the performance of the code as a function of the input data, the input image was preprocessed using blurring and sharpening filters. Blurring filters have the effect of smoothing the image, while sharpening filters increase the contrast of the image. Blurring and sharpening filters are available in image processing or computer vision libraries. In this document, we used the OpenCV* library to preprocess the test image.
The table below shows the timing results for the three experiments:
Test case | Loop | Baseline (Intel® Advanced Vector Extensions 2) | Speedup Factor with Vectorization (Intel® Advanced Vector Extensions 512) |
Input image | LOOP 1 | 1 | 2.2 |
| LOOP 2 | 1 | 7.0 |
Input image sharpened | LOOP 1 | 1 | 2.6 |
| LOOP 2 | 1 | 7.4 |
Input image blurred | LOOP 1 | 1 | 1.7 |
| LOOP 2 | 1 | 5.6 |
Looking at the results above, three questions arise:
- Why the compiler when using the Intel AVX512 flag is vectorizing the code, and when using the Intel AVX2 flag is not?
- If the code in LOOP 1 using the Intel AVX512 ISA is indeed vectorized, why is the improvement in performance relatively small compared to the theoretical speedup when using 512-bit vectors?
- Why does the performance gain of the vectorized code changes when the image is preprocessed? Specifically, why does the performance of the vectorized code increase when using a sharpened image, while it decreases when using a blurred image?
In the next section, the above questions will be answered based on a discussion about one of the subsets of the Intel AVX512 ISA, the Intel AVX512CD (conflict detection) subset.
The Intel AVX-512CD Subset
The Intel AVX-512CD (Conflict Detection) subset of the Intel AVX512 ISA adds functionality to detect data conflicts in the vector registers. In other words, it provides functionality to detect which elements in a vector operand are identical. The result of this detection is stored in mask vectors, which are used in the vector computations, so that the histogram operation (updating the histogram array) will be performed only on elements of the array (which represent pixel values in the image) that are different.
To further explore how the new instructions from the Intel AVX512CD subset work, it is possible to ask the compiler to generate an assembly code file by using the Intel C++ Compiler –S option:
icpc example2.cpp -o example2.s -O3 -xMIC-AVX512 –S …
The above command will create, instead of the executable file, a text file containing the assembly code for our C++ source code. Let’s take a look at part of the section of the assembly code that implements line 94 (the histogram update) in LOOP 1 in the example source code:
vcvttps2dq (%r9,%rax,4), %zmm5 #94.19 c1 vpxord %zmm2, %zmm2, %zmm2 #94.8 c1 kmovw %k1, %k2 #94.8 c1 vpconflictd %zmm5, %zmm3 #94.8 c3 vpgatherdd (%r12,%zmm5,4), %zmm2{%k2} #94.8 c3 vptestmd %zmm0, %zmm3, %k0 #94.8 c5 kmovw %k0, %r10d #94.8 c9 stall 1 vpaddd %zmm1, %zmm2, %zmm4 #94.8 c9 testl %r10d, %r10d #94.8 c11 je ..B1.165 # Prob 30% #94.8 c13
In the above code fragment, vpconflictd detects conflicts in the source vector register (containing the pixel values) by comparing elements with each other in the vector, and writes the results of the comparison as a bit vector to the destination. This result is further tested to define which elements in the vector register will be used simultaneously for the histogram update, using a mask vector (The vpconflictd instruction is part of the Intel AVX-512CD subset, and the vptestmd instruction is part of the Intel AVX-512F subset. Specific information about these subsets can be found in the Intel AVX-512 ISA documentation (Intel, 2016) ). This process can be described with a diagram in Figures 2 and 3.
Figure 2: Pixel values in array (smooth image).
Figure 3: Pixel values in array (sharp image).
Figure 2 shows the case where some neighboring pixels in the image have the same value. Only the elements in the vector register that have different values in the array image1 will be used to simultaneously update the histogram. In other words, only the elements that will not create a conflict will be used to simultaneously update the histogram. The elements in conflict will still be used to update the histogram, but at a different time.
In this case, the performance will vary depending on how smooth the image is. The worst case scenario would be when all the elements in the vector register are the same, which would decrease the performance considerably, not only because at the end the loop would be processed in scalar mode, but also because of the overhead introduced by the conflict detection and testing instructions.
Figure 3 shows the case where the image was sharpened. In this case it is more likely that neighboring pixels in the vector register will have different values. Most or all of the elements in the vector register will be used to update the histogram, thereby increasing the performance of the loop because more elements will be processed simultaneously in the vector register.
It is clear that the best performance will be obtained when all elements in the array are different. However, the best performance will still be less than the theoretical speedup (16x in this case), because of the overhead introduced by the conflict detection and testing instructions.
The above discussion can be used to get answers to the questions that arose in section 5.
Regarding the first question about why the compiler generates vectorized code when using the Intel AVX512 flag, the answer is that the Intel AVX512CD and Intel AVX512F subsets include new instructions to detect conflicts in the subsets of elements in the loop and to create conflict-free subsets that can be safely vectorized. The size of these subsets will be data dependent. Vectorization was not possible when using the Intel AVX2 flag because the Intel AVX2 ISA does not include conflict detection functionality.
The second question about the reduced performance (compared to the theoretical speedup) obtained with the vectorized code, can be answered considering that there is some overhead introduced when the conflict detection and testing instructions are executed. This penalty in performance is notorious in LOOP 1, where the only computation that takes place is the histogram update.
However, in LOOP 2, where extra work is performed (on top of the histogram update), the performance gain, relative to the baseline, increases. The compiler, using the Intel AVX512 flag, is resolving the dependency created by the histogram computation, increasing the total performance of the loop. In the Intel AVX2 case, the dependency in the histogram computation is preventing other computations in the loop (even if they are dependency-free) from running in vector mode. This is an important result of the use of the Intel AVX512CD subset. The compiler will now be able to generate vectorized code for more complex loops that include histogram-like dependencies, which possibly required code rewriting in order to be vectorized before Intel AVX-512.
To the third question, it should be noticed that the total performance of the vectorized loops becomes data-dependent when using the conflict detection mechanism. As it is shown in figures 2 and 3, the speedup when running in vector mode will depend on how many values in the vector register are not identical (conflict-free). Sharp or noisy images (in this case) are less likely to have similar/identical values in neighboring pixels, compared to a smooth/blurred image.
Conclusions
This article showed a simple example of a loop which, because of the possibility of memory conflicts, was not vectorized by the Intel C++ compiler using the Intel AVX2 (and earlier) instruction sets, but which is now vectorized when using the Intel AVX-512 ISA on an Intel Xeon Phi processor. In particular, the new functionality in the Intel AVX512CD and Intel AVX512F subsets (which is currently available in the Intel Xeon Phi processor and in future Intel Xeon processors) lets the compiler automatically generate vector code for this kind of application, with no changes to the code. However the performance of the vector code created this way will be in general less than an application running in full vector mode and will also be data dependent, because the compiler will vectorize this application using mask registers whose contents vary depending on how similar neighboring data is.
The intent in this document is to motivate the use of the new functionality in the Intel AVX-512CD and Intel AVX-512F subsets. In future documents, we will explore more possibilities for vectorization of complex loops by taking explicit control of the logic to update the mask vectors, with the purpose of increasing the efficiency of the vectorization.
References
Colfax International. (2015). "Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization." Retrieved from Optimization Techniques for the Intel MIC Architecture. Part 2 of 3: Strip-Mining for Vectorization: http://colfaxresearch.com/optimization-techniques-for-the-intel-mic-architecture-part-2-of-3-strip-mining-for-vectorization/
Intel. (2016, February). "Intel® Architecture Instruction Set Extensions Programming Reference." Retrieved from https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf
Wikipedia. (n.d.). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Image_histogram
Zhang, B. (2016). "Guide to Automatic Vectorization With Intel AVX-512 Instructions in Knights Landing Processors."
Analyzing GTC-P APEX code using Intel® Advisor on an Intel® Xeon Phi™ processor
Introduction
In this article, we describe how we achieved 35% faster performance in GTC-P APEX code using Intel® Advisor on an Intel® Xeon Phi™ processor code named Knights Landing (KNL). Using Intel® Advisor, we identified the five top most time consuming loops, four of which were scalar. The goal of our work was to try to vectorize these loops and get the best possible vectorization efficiency out of it that would increase the overall performance of the application. The main and the most time consuming loop was taking 86% of the whole execution time. By vectorizing this loop, we improved the overall application performance by 30%. The rest 5% of performance gain is achieved by vectorizing the other four loops that Intel® Advisor identified.
This article is divided into several sections. Firstly, we will describe Intel® Advisor in more details. Compilation and execution of the GTC-P APEX code will be discussed afterwards. Next, we will briefly describe KNL memory architecture. Analysis of the GTC-P APEX code will be covered next. Finally, we will conclude our article with our findings.
Intel® Advisor
Intel® Advisor is one of the analysis tools provided in the Intel® Parallel Studio XE suite. It has two main functionalities which are threading design and vectorization advisor. Threading design is used when a developer is unsure where to introduce parallelism in his/her single threaded application. In this case, Intel® Advisor can analyze the application, detect regions of code where parallelism would be beneficial and estimate the potential performance speedup the application would gain if parallelism is introduced in these specified code regions. Furthermore, if the developer decides to make the specified code regions parallel, Intel® Advisor can analyze the code to detect data sharing issues that would happen. Thus, the biggest advantage of threading design is that during the design stage Intel® Advisor can help by suggesting where to introduce parallelism, estimate potential speedup from it, and detect potential data sharing issues even before the developer decides to implement these changes.
The second functionality, and the one we will use in this article, is vectorization advisor. There are several analyses that vectorization advisor provides:
- Survey collection - provides detailed information about your application from vectorization stand point. It analyzes your code, reports all the loops it detected, whether they were vectorized or not, reasons for compiler not vectorizing scalar loops, vectorization efficiency for each vectorized loop, recommendations on how to improve the vectorization efficiency if it is low and suggestions on how to enable vectorization for scalar loops.
- Tripcount collection - collects information about the tripcounts and number of call times for each loop. The tripcount information is useful when you don't know whether vectorization would benefit a loop or when you try to understand why vectorization efficiency is low.
- Dependencies analysis - helps identify whether a loop has any data dependency issues that would prevent it from being vectorized.
- Memory Access Pattern (MAP) analysis - identifies the distribution of different memory access patterns existing in a loop body. There are three different memory access patterns in Advisor, which are unit strided, constant strided and variable strided accesses. Vectorization works best with unit strided accesses, and you get lower vectorization efficiency with a loop that has constant or variable strided accesses. Thus, MAP analysis can help you understand how much efficiency can you expect from vectorization depending on the distribution of the accesses the analysis reports.
Compilation and running the GTC-P APEX code
The GTC-P APEX code is distributed by NERSC and can be downloaded from this link.
To compile the code on a KNL machine, you need to make a copy of Makefile.generic file located in ARCH subfolder of the main directory, and modify it to include the following options:
- -xMIC-AVX512 - flag that tells the compiler to create a binary suitable for KNL and to apply auto vectorization on the loops.
- -qopt-report=5 - flag that dumps compiler optimization and vectorization report for each source file to a corresponding file.
- -g - to get the symbol names corresponding to variable names, functions, classes etc in the code.
The following options were used in our Makefile file to compile the GTC-P APEX code for this work:
CC= mpiicc CFLAGS = -std=gnu99 -Wall -Wextra -D_GNU_SOURCE -g -xMIC-AVX512 -qopt-report=5 CGTCOPTS += -DCACHELINE_BYTES=64 CFLAGSOMP = -qopenmp COPTFLAGS = -O3 CDEPFLAGS = -MD CLDFLAGS = -limf EXEEXT = _knl_icc
The GTC-P APEX code is a hybrid MPI and OpenMP code. All tests were run on one KNL node with 1 MPI rank and 128 OMP threads. The executable can be run using the following command on KNL machine:
numactl -m 1 mpiexec.hydra -n 1 ./bench_gtc_knl_icc B-1rank.txt 400 1
KNL Memory Architecture
A KNL machine has two memory subsystems. In addition to DDR4 memory, there is also a high bandwidth on package memory called MCDRAM. MCDRAM can be configured in 3 modes on boot time, which are Cache mode, Flat mode and Hybrid mode. In Cache mode, 16GB of MCDRAM is configured as Last Level Cache (LLC), whereas in Flat mode it is configured as a separate Non-Uniform Memory Access (NUMA) node. The Hybrid mode is the configuration of partial Cache and partial Flat modes. You can find out the MCDRAM configuration on a KNL system using numastat -m command as seen below:
Per-node system memory usage (in MBs):
Node 0 | Node 1 | Total | |
MemTotal | 98200.81 | 16384.00 | 114584.81 |
MemFree | 88096.82 | 11403.77 | 99500.60 |
MemUsed | 10103.99 | 4980.23 | 15084.21 |
Active | 2494.45 | 4417.58 | 6912.03 |
Inactive | 3128.38 | 10.65 | 3139.02 |
Active(anon) | 2435.44 | 4412.72 | 6848.16 |
Inactive(anon) | 93.94 | 0.66 | 94.59 |
Active(file) | 59.01 | 4.86 | 63.87 |
Inactive(file) | 3034.44 | 9.99 | 3044.43 |
Unevictable | 0.00 | 0.00 | 0.00 |
Mlocked | 0.00 | 0.00 | 0.00 |
Dirty | 0.00 | 0.00 | 0.00 |
Writeback | 0.00 | 0.00 | 0.00 |
FilePages | 3093.59 | 14.86 | 3108.45 |
Mapped | 4.75 | 0.04 | 4.79 |
AnonPages | 23.64 | 0.02 | 23.66 |
Shmem | 0.12 | 0.00 | 0.12 |
KernelStack | 59.89 | 0.06 | 59.95 |
PageTables | 1.52 | 0.00 | 1.52 |
NFS_Unstable | 0.00 | 0.00 | 0.00 |
Bounce | 0.00 | 0.00 | 0.00 |
WritebackTmp | 0.00 | 0.00 | 0.00 |
Slab | 731.95 | 97.56 | 829.51 |
SReclaimable | 155.30 | 14.47 | 169.77 |
SUnreclaim | 576.65 | 83.09 | 659.75 |
AnonHugePages | 0.00 | 0.00 | 0.00 |
HugePages_Total | 0.00 | 0.00 | 0.00 |
HugePages_Free | 0.00 | 0.00 | 0.00 |
HugePages_Surp | 0.00 | 0.00 | 0.00 |
The above configuration shows that our KNL machine was configured in Flat mode, with Node 0 representing 98GB of DDR4, and Node 1 corresponding to 16GB of MCDRAM memory. Based on application’s data access pattern, you can decide which memory mode would be more beneficial to run the application in. With Flat mode, you can also choose which data objects to be allocated on MCDRAM and DDR4 respectively. The advantage of allocating the data on MCDRAM is that it provides ~4x more memory bandwidth than the DDR4. This means that if an application is memory bandwidth bound, then MCDRAM could drastically improve the performance of the application compared to DDR4 allocation.
In this work, we specify that we want to allocate all the data on MCDRAM using numactl –m 1 option in the command line. This option will work fine if all the data created in the application fits into MCDRAM size. However if the data won’t fit into MCDRAM with numactl –m 1 option, then the application will crash. To avoid this kind of situations, you can use numactl –p 1 option instead, which ensures that the leftover data that didn’t fit into MCDRAM will be allocated in DDR4. Moreover, you can also selectively choose specific data objects that you want to put in MCDRAM and the rest of the data in DDR4. This can be accomplished using libmemkind library calls that can be instrumented in the code instead of the traditional memory allocation routines.
Analysis of the GTC-P APEX code using Intel® Advisor
#!/bin/bash export KMP_AFFINITY=compact,granularity=fine export KMP_PLACE_THREADS=64c,2t
numactl -m 1 mpiexec.hydra -n 1 advixe-cl -collect survey -project-dir knl-orig -no-auto-finalize -search-dir all=../src/ -data-limit=102400 -- ../src/bench_gtc_knl_icc B-1rank.txt 400 1
We used the above script to run survey analysis in Intel® Advisor. Please notice the usage of “-no-auto-finalize” option in the command line. Intel® Advisor runs in two steps which are collection and finalization of data. Collection step profiles application and collects raw data from it. Finalization step converts the raw data to a database data that is then read via Intel® Advisor GUI. This step can take quite a lot of time if the raw data is huge. In addition to that, finalizing result on very low CPU frequency, such as the ones that KNL cores have (1.3 GHz), will even further slowdown this step. Thus, it is advisable to use “no-auto-finalize” option when collecting the profiled data on KNL system, then to run the finalization step separately on any other Xeon based system with higher CPU frequency rate. The finalization step will automatically take place once you open your project directory using Intel® Advisor GUI.

The Survey report is the first collection that needs to be run in Intel® Advisor. This collection generates a detailed report, shown in Figure 1, about every loop in the code and whether it was vectorized or not. If a loop is vectorized by the compiler, the survey report will show the vectorization efficiency level shown in Efficiency column. This metric will give you an idea how well your loop was vectorized, whether you achieved the maximum performance out of vectorizing the loop or not. If the efficiency metric is low, Vector Issues column will give an idea why vectorization didn't gain 100% efficiency. If a loop is not vectorized by the compiler, Why No Vectorization column will give a reason why the compiler couldn't vectorize it. The Type column will show what type of loop it is i.e. scalar, threaded, vectorized, or both threded and vectorized. There are two time metrics shown in the report, which are Self Time and Total Time. Self time shows the time it takes to execute a loop body excluding function calls. If there is a function call inside the loop body, the time to execute the function will be included in the Total time metric. Thus, Total time metric will comprise the time from Self time and the time it takes to execute function calls inside the loop.
Figure 1 shows the top 10 most time consuming loops in the code, where orange and blue colors represent vectorized and non-vectorized loops respectively. In this article, we will focus on the first top 5 loops and try to vectorize non-vectorized loops and improve vectorization efficiency for vectorized loops.
Code Change #1
The very first loop in the list is not vectorized and it takes 86.5% of all the execution time. The reason why this loop was not vectorized is because the inner loop inside it was already vectorized by the compiler, as shown in Figure 1 in Why No Vectorization column. The default compiler policy for vectorization is to vectorize inner most loops first whenever possible. Sometimes, though, outer loop vectorization might be more beneficial in terms of application performance in cases when outer loop trip count is larger than the inner loop's tripcount and memory access pattern inside the loop body is appropriate for outer loop vectorization. Our outer loop starts at line #306 in push.c file. The compiler optimization report shows that the compiler skipped vectorizing this loop at line #306 because the inner loop at line #355 was vectorized instead, as seen in Figure 2.

Figure 3 and 4 show the source code for loops at line #306 and line #355, respectively.

From the source code, we clearly see that the loop at line #355 is only 4 iterations in total and still it is being vectorized. Whereas, the loop at line #306 has much larger tripcount, which is determined by mi variable and in our test case it ranges in millions, was left not vectorized. The first change that we will make in the code is to prevent the compiler from vectorizing the inner loop using pragma novector directive, and to remove the assert statement in the outer loop body that prevents it from vectorization. This change will be marked as change1.
Table 1 shows the comparison of the execution time report of the original code against the modified change1 code with “pragma novector” directive. The modified change1 version is almost 5% faster than the original code, which means vectorizing inner loop actually decreased the performance. Figure 5 shows that the outer loop at line #306 is still not vectorized but the total time for the loop has decreased from 6706 to 5212 seconds.

Code Change #2
Why No Vectorization field in Figure 5 also suggests to consider using SIMD directive, which is the directive that forces the compiler to vectorize a loop. So the next change in the code, marked as change2, will be to use pragma simd directive explicitly on this loop.
Figure 5 Survey report for change1 code
Original Code | Change1 code |
Total time: 128.848269 s Charge 51.835024 s (40.2295) Push 69.949501 s (54.2883) Shift_t 0.000677 s (0.0005) Shift_r 0.000201 s (0.0002) Sorting 5.833193 s (4.5272) Poisson 0.279117 s (0.2166) Field 0.365359 s (0.2836) Smooth 0.566864 s (0.4399) Setup 19.028135 s Poisson Init 0.032476 s | Total time: 122.644956 s Charge 54.230016 s (44.2171) Push 60.748115 s (49.5317) Shift_t 0.000926 s (0.0008) Shift_r 0.000268 s (0.0002) Sorting 6.163052 s (5.0251) Poisson 0.370527 s (0.3021) Field 0.405065 s (0.3303) Smooth 0.696573 s (0.5680) Setup 21.083826 s Poisson Init 0.033654 s |
Table 2 shows the performance comparison between the original code, the modified code with change1, and the modified code with change2.
Original Code | Code with change1 | Code with change2 |
Total time: 128.848269 s Charge 51.835024 s (40.2295) Push 69.949501 s (54.2883) Shift_t 0.000677 s (0.0005) Shift_r 0.000201 s (0.0002) Sorting 5.833193 s (4.5272) Poisson 0.279117 s (0.2166) Field 0.365359 s (0.2836) Smooth 0.566864 s (0.4399) Setup 19.028135 s Poisson Init 0.032476 s | Total time: 122.644956 s Charge 54.230016 s (44.2171) Push 60.748115 s (49.5317) Shift_t 0.000926 s (0.0008) Shift_r 0.000268 s (0.0002) Sorting 6.163052 s (5.0251) Poisson 0.370527 s (0.3021) Field 0.405065 s (0.3303) Smooth 0.696573 s (0.5680) Setup 21.083826 s Poisson Init 0.033654 s | Total time: 88.771399 s Charge 52.456334 s (59.0915) Push 29.125329 s (32.8094) Shift_t 0.000685 s (0.0008) Shift_r 0.000254 s (0.0003) Sorting 5.908818 s (6.6562) Poisson 0.288442 s (0.3249) Field 0.401597 s (0.4524) Smooth 0.571354 s (0.6436) Setup 19.136696 s Poisson Init 0.031585 s |
The code modification, marked as change2, that explicitly vectorized the outer loop at line #306 in push.c using pragma simd directive gained us 30% of performance improvement, as seen in Table 2. The main time difference between the original and the change2 code is in "Push" phase, which has decreased from 69.9 seconds to 29.1 seconds and correctly represents our change in code.
Figure 6 Top time consuming loops for original version on the top and for change2 version on the bottom
Figure 6 shows the list of top time consuming loops in original code shown on the top versus the change2 code shown on the bottom of the Figure. The most time consuming loop in original code is found to be the loop at line #306 in push.c which we vectorized in our change2 version. Thus, we clearly see that on the bottom list, the loop has disappeared which means it hasn't become one of the top time consuming loops anymore. Instead, all other loops on the list have moved one rank upper and we see that a new loop was added to the bottom list.
Code Change #3
Next, we will focus on the next two top time consuming loops in the list. Based on the survey analysis report in Figure 1, the compiler was not able to vectorize either of the loops at line #145 and #128 in shifti.c file due to existence of assumed data dependency issue. However, by looking at the code itself, we identified that there is definitely no data dependency issue in these loops. Thus, we vectorized these loops using pragma ivdep directive which tells the compiler to ignore any vector dependency issue in the loop followed by the directive. However, vectorizing both of these loops did not gain us much performance improvement as seen in Table 3. In fact, vectorizing the loop at line #128 in shifti.c file decreased the performance by 6%. Moreover, from Figure 7, you can see that the vectorized versions of these two loops do not achieve good efficiency especially the loop at line #128, and Intel® Advisor suggests that vectorization of the loop at line #128 seems inefficient.
Change2 version | Vectorized loop at line #145 in shifti.c | Vectorized loop at line #128 in shifti.c |
Total time: 88.771399 s Charge 52.456334 s (59.0915) Push 29.125329 s (32.8094) Shift_t 0.000685 s (0.0008) Shift_r 0.000254 s (0.0003) Sorting 5.908818 s (6.6562) Poisson 0.288442 s (0.3249) Field 0.401597 s (0.4524) Smooth 0.571354 s (0.6436) Setup 19.136696 s Poisson Init 0.031585 s | Total time: 87.671250 s Charge 52.588174 s (59.9834) Push 29.135723 s (33.2329) Shift_t 0.000711 s (0.0008) Shift_r 0.000298 s (0.0003) Sorting 4.668660 s (5.3252) Poisson 0.285867 s (0.3261) Field 0.402622 s (0.4592) Smooth 0.570873 s (0.6512) Setup 19.121684 s Poisson Init 0.032722 s | Total time: 94.257004 s Charge 56.887408 s (60.3535) Push 31.765860 s (33.7013) Shift_t 0.000672 s (0.0007) Shift_r 0.000233 s (0.0002) Sorting 4.322648 s (4.5860) Poisson 0.287653 s (0.3052) Field 0.407710 s (0.4326) Smooth 0.566107 s (0.6006) Setup 19.098176 s Poisson Init 0.032612 s |
Figure 7 Survey report for change3 code, where loops at line #128 and #145 in shifti.c were vectorized
The reason why vectorizing loop at line #128 seems inefficient is due to lots of gather and scatter instructions that compiler has to generate in order to vectorize it, which obviously is not as efficient as typical vector load/store instructions. The vectorization inefficiency in loop at line #145 is not as bad as in loop at line #128, and seems to be caused due to lack of work distributed to threads when vectorization is introduced on top of threading as well. Based on these findings, we decided to not vectorize loop at line #128 and still vectorize loop at line #145 in shifti.c file, and move to the next loop in the list.
Code Change #4
The next loop in the top time consuming loops list is loop at line #536 in chargei.c file which is the inner loop in the doubly nested loop shown in Figure 8.

The compiler auto-vectorizes the loop at line #536 in chargei.c file, but it achieves only 50% of vectorization efficiency as shown in Figure 7. As a default behavior, the compiler auto-vectorizes the innermost loops first and leaves outer loops non-vectorized when the innermost loop is vectorized. However, in this case vectorizing the outer loop might be more beneficial since the trip count for the outer loop is much larger than the inner loop. Since there is no any data dependency issue in this doubly nested loop, we can force the compiler to vectorize the outer loop using pragma simd clause, and force the compiler to not vectorize the inner loop using pragma novector clause. These changes will be marked as change4 version.

Changes in change4 code caused the execution time for "Charge" phase to decrease from 52.45 to 49.36 seconds which is about 6% performance increase, and cause 4% performance improvement in total execution time as seen in Table 4. Moreover, you can see that the survey report, shown in Figure 9, shows the total time for loop at line #536 in chargei.c file now has decreased to 21.9 seconds from its previous 227.2 seconds.
Change1 | Change4 |
Total time: 88.771399 s Charge 52.456334 s (59.0915) Push 29.125329 s (32.8094) Shift_t 0.000685 s (0.0008) Shift_r 0.000254 s (0.0003) Sorting 5.908818 s (6.6562) Poisson 0.288442 s (0.3249) Field 0.401597 s (0.4524) Smooth 0.571354 s (0.6436) Setup 19.136696 s Poisson Init 0.031585 s | Total time: 85.078972 s Charge 49.366594 s (58.0244) Push 28.602789 s (33.6191) Shift_t 0.000662 s (0.0008) Shift_r 0.000233 s (0.0003) Sorting 5.852659 s (6.8791) Poisson 0.281661 s (0.3311) Field 0.370580 s (0.4356) Smooth 0.585129 s (0.6877) Setup 19.006850 s Poisson Init 0.031122 s |
Finally, let’s focus on the last loop in the top time consuming loops list, which is loop at line #77 in shifti.c file. The compiler was not able to auto-vectorize this loop due to the existence of assumed data dependency issue. After checking the code, we have found out that there is definitely data dependency between iterations of the loop and thus vectorization of this loop will cause a correctness problem. Thus, we did not vectorize this loop.
In order to see the final performance improvement between the original and the vectorized versions of GTC-P APEX code, we made all the vectorization changes to the code and obtained the following results in Table 5.
Original version | Modified version |
Total time: 128.848269 s Charge 51.835024 s (40.2295) Push 69.949501 s (54.2883) Shift_t 0.000677 s (0.0005) Shift_r 0.000201 s (0.0002) Sorting 5.833193 s (4.5272) Poisson 0.279117 s (0.2166) Field 0.365359 s (0.2836) Smooth 0.566864 s (0.4399) Setup 19.028135 s Poisson Init 0.032476 s | Total time: 83.726301 s Charge 49.607312 s (59.2494) Push 28.287559 s (33.7858) Shift_t 0.000674 s (0.0008) Shift_r 0.000259 s (0.0003) Sorting 4.585134 s (5.4763) Poisson 0.280857 s (0.3354) Field 0.370222 s (0.4422) Smooth 0.576150 s (0.6881) Setup 19.010042 s Poisson Init 0.030605 s |
Conclusion
In this article we tried to vectorize and/or improve vectorization efficiency of the top 5 most time consuming loops in GTC-P APEX code reported by Intel® Advisor tool. We were able to improve vectorization efficiency of three loops in push.c, chargei.c and shifti.c files which improved the performance of the code by 35% in total against the original version of the code. Vectorizing loop at line #128 in shifti.c file didn’t give much performance improvement so it was left as is. Moreover, the loop at line #77 in shifti.c was not vectorized due to the data dependency issue.
Driver Support Matrix for Intel® Media SDK and OpenCL™
Developers can access Intel's processor graphics GPU capabilities through the Intel® Media SDK and Intel® SDK for OpenCL™ Applications. This article provides more information on how the software, driver, and hardware layers map together.
Delivery Models
There are two different packaging/delivery models:
- For Windows* Client: all components needed to run applications written with these SDKs are distributed with the Intel graphics driver. These components are intended to be updated on a separate cadence than Media SDK/OpenCL installs. Drivers are released separately and moving to the latest available driver is usually encouraged. Use Intel® Driver Update Utility to keep your system up-to-date with latest graphics drivers or manually update from downloadcenter.intel.com. To verify driver version installed on the machine, use the system analyzer tool.
- For Linux* and Windows Server*:Intel® Media Server Studio is an integrated software tools suite that includes both SDKs, plus a specific version of the driver validated with each release.
Driver Branches
Driver development uses branches covering specific hardware generations, as described in the table below. The general pattern is that each branch covers only the two latest architectures (N and N-1). This means there are two driver branches for each architecture except the newest one. Intel recommends using the most recent branch. If issues are found it is easier to get fixes for newer branches. The most recent branch has the most resources and gets the most frequent updates. Older branches/architectures get successively fewer resources and updates.
Processor Architecture | Intel® Integrated Graphics | Windows | Linux |
---|---|---|---|
3rd Generation Core, 4th Generation Core (Ivybridge/Haswell) LEGACY ONLY, downloads available but not updated | Ivybridge - Gen 7 Graphics Haswell - Gen 7.5 graphics | 15.33 Operating Systems: Client: Windows 7, 8, 8.1, 10 Server: Windows Server 2012 r2 | 16.3 (Media Server Studio 2015 R1) Gold Operating Systems: Ubuntu 12.04, SLES 11.3 |
4th Generation Core, 5th Generation Core (Haswell/Broadwell) LEGACY | Haswell - Gen 7.5 graphics Broadwell - Gen 8 graphics | 15.36 Operating Systems: Client: Windows 7, 8, 8.1, 10 Server: Windows Server 2012 r2 | 16.4 (Media Server Studio 2015/2016) Gold Operating Systems: CentOS 7.1 Generic kernel: 3.14.5 |
5th Generation Core 6th Generation Core (Broadwell/Skylake) CURRENT RELEASE |
Broadwell - Gen 8 graphics Skylake - Gen 9 graphics | 15.40 (Broadwell/Skylake Media Server Studio 2017) 15.45 (Skylake + forward, client) Operating Systems: Client: Windows 7, 8, 8.1, 10 Server: Windows Server 2012 r2
| 16.5 (Media Server Studio 2017) Gold Operating Systems: CentOS 7.2 Generic kernel: 4.4.0 |
Windows client note: Many OEMs have specialized drivers with additional validation. If you see a warning during install please check with your OEM for supported drivers for your machine.
Hardware details
Ivybridge (IVB) codename for 3rd generation Intel processor based on 22nm manufacturing technology and Gen 7 graphics architecture.
Ivybridge Gen7 3rd Generation Core
| GT2: Intel® HD Graphics 2500 GT2: Intel® HD Graphics 4000 |
Haswell (HSW) codename for 4th generation Intel processor based on 22nm manufacturing technology and Gen 7.5 graphics architecture. Available in multiple graphics versions- GT2(20 Execution Units), GT3(40 Execution Units) and GT3e(40 Execution Units + eDRAM to provide faster secondary cache).
Haswell Gen 7.5 4th Generation Core
| GT2: Intel® HD Graphics 4200 GT2: Intel® HD Graphics 4400 GT2: Intel® HD Graphics 4600
GT3: Intel® Iris™ Graphics 5000 GT3: Intel® Iris™ Graphics 5100
GT3e: Intel® Iris™ Pro Graphics 5200
|
Broadwell (BDW) codename for 5th generation Intel processor based on 14nm die shrink of Haswell architecture and Gen 8 graphics architecture. Available in multiple graphics versions - GT2(24 Execution Units), GT3(48 Execution Units) and GT3e(48 Execution Units + eDRAM to provide faster secondary cache).
Broadwell Gen8 5th Generation Core | GT2: Intel® HD Graphics 5500 GT2: Intel® HD Graphics 5600 GT2: Intel® HD Graphics 5700
GT3: Intel® Iris™ Graphics 6100 GT3e: Intel® Iris™ Pro Graphics 6200 |
Skylake (SKL) codename for 6th generation Intel processor based on 14nm manufacturing technology and Gen 9 graphics architecture. Available in multiple graphics versions - GT1 (12 Execution Units), GT2(24 Execution Units), GT3(48 Execution Units) and GT3e(48 Execution Units + eDRAM), GT4e (72 Execution Units + eDRAM to provide faster secondary cache).
Skylake Gen9 6th Generation Core | GT1: Intel® HD Graphics 510 (12 EUs) GT2: Intel® HD Graphics 520 (24 EUs, 1050MHz) GT2: Intel® HD Graphics 530 (24 EUs, 1150MHz)
GT3e: Intel® Iris™ Graphics 540 (48 EUs, 1050MHz, 64 MB eDRAM) GT3e: Intel® Iris™ Graphics 550 (48 EUs, 1100MHz, 64 MB eDRAM)
GT4e: Intel® Iris™ Pro Graphics 580 (72 EUs, 1050 MHz, 128 MB eDRAM) GT4e: Intel® Iris™ Pro Graphics p580 (72 EUs, 1100 MHz, 128 MB eDRAM) |
For more details please check
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Kronos.
Hybrid Parallelism: A MiniFE* Case Study
In my first article, Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing, I discussed the chief forms of parallelism: shared memory parallel programming and distributed memory message passing parallel programming. That article explained the basics of threading for shared memory programming and Message Passing Interface* (MPI) message passing. It included an analysis of a hybrid hierarchical version of one of the NAS Parallel Benchmarks*. In that case study the parallelism for threading was done at a lower level than the parallelism for MPI message passing.
This case study examines the situation where the problem decomposition is the same for threading as it is for MPI; that is, the threading parallelism is elevated to the same level as MPI parallelism. The reasons to elevate threading to the same parallel level as MPI message passing is to see if performance can improve because of less overhead in thread libraries than in MPI calls and to check whether the memory consumption is reduced by using threads rather voluminous numbers of MPI invocations.
This paper provides a brief overview of threading and MPI followed by a discussion of the changes in the miniFE*. Performance results are also shared. The performance gains are minimal, but threading consumed less memory, and data sets 15 percent larger could be solved using the threaded model. Indications are that the main benefit is memory usage. This article will be of most interest to those who want to optimize for memory footprint, especially those working on first generation Intel® Xeon Phi™ coprocessors (code-named Knights Corner), where available memory on the card is limited.
Many examples of hybrid distributed/shared memory parallel programming follow a hierarchical approach MPI distributed memory programming is done at the top and shared memory parallel programming is introduced at multiple regions of the software underneath (OpenMP* is a popular choice). Many software developers design good problem decomposition for MPI programming. However, when using OpenMP, some developers fall back to simply placing pragmas around do or for loops without considering overall problem decomposition and data locality. For this reason some say that if they want good parallel threaded code, they would write the MPI code first and then port it to OpenMP, because they are confident this would force them to implement good problem decomposition with good data locality and such.
This leads to another question: if good performance can be obtained either way does it matter whether a threading model or MPI is used? There are two things to consider. One is performance of course: is MPI or threading inherently faster than the other? The second consideration is memory consumption. When an MPI job begins, an initial number of MPI processes or ranks that will cooperate to complete the overall work are specified. As ever larger problem sizes or data sets are run, the number of systems in a cluster dedicated to a particular job increases, thus the number of MPI ranks increases. As the number of MPI ranks increases, the MPI runtime libraries consume more memory in order to be ready to handle a larger number of potential messages (see Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing).
This case study compares both performance and memory consumption. The code used in this case study is miniFE. MiniFE was developed at Sandia National Labs and is now distributed as part of the Montevo project (see montevo.org).
Shared Memory Threaded Parallel Programming
In the threading model, all the resources belong to the same process. Memory belongs to the process so sharing memory between threads can be easy. Each thread must be given a pointer to the common shared memory location. This means that there must be at least one common pointer or address passed into each thread so each thread can access the shared memory regions. Each thread has its own instruction pointer and stack.
A problem with threaded software is the potential for data race conditions. A data race occurs when two or more threads access the same memory address and at least one of the threads alters the value in memory. Whether the writing thread completes its write before or after the reading thread reads the value can alter the results of the computation. Mutexes, barriers, and locks were designed to control execution flow, protect memory, and prevent data races. This creates other problems, because deadlock can happen preventing any forward progression in the code, or contention for mutexes or locks restricts execution flow becoming a bottleneck. Mutexes and locks are not a simple cure-all. If not used correctly, data races can still exist. Placing locks that protect code segments rather than memory references is the most common error.
Distributed Memory MPI Parallel Programming
Distributed memory parallel programming models offer a range of methods with MPI. The discussion in this case study uses the traditional explicit message passing interface of MPI. The most commonly used elements of MPI are message passing constructs. The discussion in this case study uses the traditional explicit message passing interface of MPI. Any data that one MPI rank has that may be needed by another MPI rank must explicitly be sent by the first MPI rank to other ranks that need that data. In addition, the receiving MPI rank must explicitly request the data be received before it can access and use the data sent. The developer must define the buffers used to send and receive data as well as pack or unpack them if necessary; if data is received into its desired location, it doesn't need to be unpacked.
Finite Element Analysis
In finite element analysis a physical domain whose behavior is modeled by partial differential equations is divided into very small regions called elements. A set of basis functions (often polynomials) is defined for each element. The parameters of basis functions approximate the solution to the partial differential equations within each element. The solution phase typically is a step of minimizing the difference between the true value of the physical property and the value approximated by the basis functions. The operations form a linear system of equations for each finite element known as an element stiffness matrix. Each of these element stiffness matrices are added into a global stiffness matrix, which is solved to determine the values of interest. The solution values represent a physical property: displacement, stress, density, velocity, and so on.
The miniFE is representative of more general finite element analysis packages. In a general finite element program the elements may be irregular and of varying size and different physical properties. Only one element type and one domain are used within miniFE. The domain selected is always a rectangular prism that is divided into an integral number of elements along the three major axes: x, y, and z. The rectangular prism is then sliced parallel to the principal plans recursively to create smaller sets of the finite elements. This is illustrated in Figure 1.
The domain is divided into several regions or subdomains containing a set of finite elements. Figure 1 illustrates the recursive splitting. The dark purple line near the center of the prism on the left shows where the domain may be divided into two subdomains. Further splitting may occur as shown by the two green lines. The figure on the right shows the global domain split into four subdomains. This example shows the splitting occurring perpendicular to the z-axis. However, the splitting may be done parallel to any plane. The splitting is done recursively to obtain the desired number of subdomains. In the original miniFE code, each of these subdomains is assigned to an MPI rank: one subdomain per each MPI rank. Each MPI rank determines a local numbering of its subdomain and maps that numbering to the numbering of the global domain. For example, an MPI rank may hold a 10×10×10 subdomain. Locally it would number this from 0–9 in the x, y, and z directions. Globally, however, this may belong to the region numbered 100–109 on the x-axis, 220–229 on the y-axis, and 710–719 along the z axis. Each MPI rank determines the MPI ranks with which it shares edges or faces, and then initializes the buffers it will use to send and receive data during the conjugate gradient solver phase used to solve the linear system of equations. There are additional MPI communication costs; when a dot product is formed, each rank calculates its local contribution to the dot product and then the value must be reduced across all MPI ranks.
The miniFE code has options using threading models to parallelize the for loops surrounding many activities. In this mode the MPI ranks do not further subdivide their subdomain recursively into multiple smaller subdomains for each thread. Instead, for loops within the calculations for each subdomain are divided into parallel tasks using OpenMP*, Intel® Cilk™ Plus, or qthreads*. Initial performance data on the original reference code showed that running calculations for a specified problem set on 10 MPI ranks without threading was much faster than running one MPI rank with 10 threads.
So the division among threads was not as efficient as the division between MPI ranks. Software optimizations should begin with software performance analysis. I used the TAU* Performance System for software performance analysis. The data showed that the waxpy (vector _axpy operations) operations consumed much more time in the hybrid thread/MPI version than the MPI-only version. The waxpy operation is inherently parallel. It doesn't involve any reduction like a dot product, and there is no potential data-sharing problems that would complicate threading. The only reason for the waxpy operation to consume more time is because the threading models used are all fork-join models. That is, the work for each thread is forked off at the beginning of a for loop, and then all the threads join back again at the end. The effort to initiate computation at the fork and then synchronize at the end adds considerable overhead, which was not present in an MPI-only version of the code.
The original miniFE code divided the domain into a number of subdomains that matches the number of MPI ranks (This number is called numprocs). The identifier for each subdomain was the MPI rank number, called myproc. In the new hybrid code the original domain is divided into a number of subdomains that match the total number of threads globally; this is the number of MPI ranks times the number of threads per rank (numthreads). This global count of subdomains is called idivs (idivs = numprocs * numthreads). Each thread is given a local identifier, mythread (beginning at zero of course). The identifier of each subdomain changes from myproc to mypos (mypos = myproc * numthreads + mythread). When there is only one thread per mpi rank, mypos and myproc are equal. Code changes were implemented in each file to change references to numprocs to idivs, and myproc to mypos. A new routine was written between the main program and the routine driver. The main program forks off the number of threads indicated. Each thread begins execution of this new routine, which then calls the driver, and in turn each thread calls all of the subroutines that execute the full code path below the routine driver.
The principle of dividing work into compact local regions, or subdomains, remains the same. For example when a subdomain needs to share data with an adjacent subdomain it loops through all its neighbors and shares the necessary data. The code snippets below show the loops for sharing data from one subdomain to another in the original code and the new code. In these code snippets each subdomain is sending data to its adjacent neighbors with which it shares faces or edges. In the original code each subdomain maps to an MPI rank. These code snippets come from the file exchange_externals.hpp. The original code is shown below in the first text box. Comments are added to increase clarity.
Original code showing sends for data exchanges:
// prepare data to send to neighbors by copying data into send buffers for(size_t i=0; i<total_to_be_sent; ++i) { send_buffer[i] = x.coefs[elements_to_send[i]]; } //loop over all adjacent subdomain neighbors – Send data to each neighbor Scalar* s_buffer = &send_buffer[0]; for(int i=0; i<num_neighbors; ++i) { int n_send = send_length[i]; MPI_Send(s_buffer, n_send, mpi_dtype, neighbors[i], MPI_MY_TAG, MPI_COMM_WORLD); s_buffer += n_send; }
New code showing sends for data exchanges:
//loop over all adjacent subdomain neighbors – communicate data to each neighbor for(int i=0; i<num_neighbors; ++i) { int n_send = send_length[i]; if (neighbors[i]/numthreads != myproc) {// neighbor is in different MPI rank pack and send data for (int ij = ibuf ; ij < ibuf + n_send ; ++ij) send_buffer[ij] = x.coefs[elements_to_send[ij]] ; MPI_Send(s_buffer, n_send, mpi_dtype, neighbors[i]/numthreads, MPI_MY_TAG+(neighbors[i]*numthreads)+mythread, MPI_COMM_WORLD); } else {//neighbor is another thread in this mpi rank wait until recipient flags it is safe to write then write while (sg_sends[neighbors[i]%numthreads][mythread]); stmp = (Scalar *) (sg_recvs[neighbors[i]%numthreads][mythread]); for (int ij = ibuf ; ij < ibuf + n_send ; ++ij) stmp[ij-ibuf] = x.coefs[elements_to_send[ij]] ; // set flag that write completed sg_sends[neighbors[i]%numthreads][mythread] = 2 ; } s_buffer += n_send; ibuf += n_send ; }
In the new code each subdomain maps to a thread. So each thread now communicates with threads responsible for neighboring subdomains. These other threads may or may not be in the same MPI rank. The setup of communicating data remains nearly the same. When communication mapping is set up, a vector of pointers is shared within each MPI rank. When communication is between threads in the same MPI rank (process), a buffer is allocated and both threads have access to the pointer to that buffer. When it is time to exchange data, a thread loops through all its neighbors. If the recipient is in another MPI rank, the thread makes a regular MPI send call. If the recipient is in the same process as the sender, the sending thread writes the data to the shared buffer and marks a flag that it completed the write.
Additional changes were also required. By default, MPI assumes only one thread in a process or the MPI rank sends and receives messages. In this new miniFE thread layout each thread may send or receive data from another MPI rank. This required changing MPI_Init to MPI_Init_thread with the setting MPI_THREAD_MULTIPLE. This sets up the MPI runtime library to behave in a thread-safe manner. It is important to remember that MPI message passing is between processes (MPI ranks) not threads, so by default when a thread sends a message to a remote system there is no distinction made between threads on the remote system. One method to handle this would be to create multiple MPI communicators. If there were a separate communicator for each thread in an MPI rank, a developer could control which thread received the message in the other MPI rank by its selection of the communicator. Another method would be to use different tags for each thread so that the tags identify which thread should receive a particular message. The latter was used in this implementation; MPI message tags were used to control which thread received messages. The changes in MPI message tags can be seen in the code snippets as well. In miniFE the sender fills the send buffer in the order the receiver prefers. Thus the receiver does not need to unpack data on receipt and can use the data directly from the receive buffer destination.
Dot products are more complicated, because they are handled in a hierarchical fashion. First, a local sum is made by all threads in the MPI rank. One thread makes the MPI Allreduce call, while the other threads stop at a thread barrier waiting for the MPI Allreduce to be completed and the data recorded in an appropriate location for all threads to get a copy.
In this initial port all of the data collected here used Intel® Threading Building Blocks (Intel® TBB) thread. This closely match the C++ thread specifications, so it will be trivial to test using standard C++ threads.
Optimizations
The initial threading port achieved the goal matching vector axpy operation execution time. Even though this metric improved, prior to some tuning, the threading model was initially slower than the MPI version. Three principle optimization steps were applied to improve the threaded code performance.
The first step was to improve parallel operations like dot products. The initial port used a simple method of each thread accumulating results such as dot products using simple locks. The first attempt replaced Posix* mutexes with Intel TBB locks and then atomic operations as flags. These steps made no appreciable improvement. Although the simple lock method worked for reductions or gathers for quick development using four threads, it did not scale well when there were a couple of hundred threads. A simple tree was created to add some thread parallelism to reductions such as the dot products. Implementing a simple tree for a parallel reduction offered a significant performance improvement; further improvements may offer small incremental improvements.
The second optimization was to make copies of some global arrays for each thread (this came from an MPI_Allgather). Because none of the threads alters the array values there is no opportunity for race conditions or cache invalidation. From that point the array was used in a read-only mode. So the initial port shared one copy of the array among all threads. For performance purposes, this proved to be wrong; creating a private copy of the array for each thread improved performance. Even after these optimizations, performance of the code with lots of threads still lagged behind the case with only one thread per MPI rank.
This leads to the third and last optimization step. The slow region was in problem setup and initialization. I then realized that the bottleneck was in dynamic memory allocation and a better memory allocator would resolve the bottleneck. The default memory allocation libraries on Linux* do not scale for numerous threads. Several third-party scalable memory allocation libraries are available to resolve this problem. All of them work better than the Linux default memory allocation runtime libraries. I used Intel TBB memory allocator because I am familiar with it and it can be adopted without any code modification by simply using LD_PRELOAD. So at runtime LD_PRELOAD was defined to use the Intel TBB memory allocator designed for parallel software. This change of the memory allocation runtime libraries closed the performance gap. This step substituted the Intel TBB memory allocator for all of the object and dynamic memory creation. This single step provided the biggest performance improvement.
Performance
This new hybrid miniFE code ran on both the Intel® Xeon Phi™ coprocessor and the Intel® Xeon Phi™ processor. The data collected varied the number of MPI ranks and threads using a problem size that nearly consumed all of the system memory for the two platforms. For the first-generation Intel Xeon Phi coprocessor the MPI rank/thread ratio varied from 1:244 and 244:1. A problem size of 256×256×512 was used for the tests. The results are shown in Figure 2.

The results show variations in performance based on the different ratios of the MPI-to-thread ratio. Each ratio of MPI to thread ran at least twice, and the fastest time was selected for reporting. More runs were collected for the ratios with slower execution time. The differences in time proved repeatable. Figure 3 shows the same tests on the Intel Xeon Phi Processor using a larger problem size.

The performance on the Intel Xeon Phi Processor showed less performance variation than performance of miniFE on the Intel Xeon Phi coprocessor. No explanation is offered for the differences between the runtime and number of MPI ranks to the number of threads. It may be possible to close those differences by explicitly pinning threads and MPI ranks to specific cores. These tests left process and thread assignment to the OS.
There is much less variation for miniFE performance than was reported for the NAS SP-MZ* benchmark hybrid code as discussed in Hybrid Parallelism: Parallel Distributed Memory and Shared Memory Computing. The NAS benchmark code though did not create subdomains for each thread as was done in this investigation of miniFE. The NAS SP-MZ code did not scale as well with threads as it did with MPI. This case study shows that following the same decomposition, threads do as well as MPI ranks. On the Intel® Xeon Phi™ Product family, miniFE performance was slightly better for using the maximum number of threads and only one MPI rank rather than using the maximum number of MPI ranks with only one thread each. Best performance was achieved with a mixture of MPI ranks and threads.
Memory consumption proves to be the most interesting aspect. The Intel Xeon Phi coprocessor is frequently not set up with a disk to swap pages to virtual memory, which provides an ideal platform to evaluate the size of a problem that can be run with the associated runtime libraries. When running the miniFE hybrid code on the Intel Xeon Phi coprocessor, the largest problem size that ran successfully with one MPI rank for each core was 256×256×512. This is a problem size of 33,554,432 elements. The associated global stiffness matrix contained 908,921,857 nonzero entries. When running with only 1 MPI rank and creating a number of threads that match the number of cores, the same number of subdomains are created and a larger problem size—256×296×512—runs to completion. This larger problem contained 38,797,312 elements, and the corresponding global stiffness matrix had 1,050,756,217 nonzero elements. Based on the number of finite elements, the threading model allows developers to run a model 15 percent larger. Based on nonzero elements in the global stiffness matrix, the model solved a matrix that is 15.6 percent larger. The ability to run a larger problem size is a significant advantage that may appeal to some project teams.
There are further opportunities for optimization of the threaded software (for example, pinning threads and MPI ranks to specific cores and improving parallel reductions). It is felt that the principal tuning has been done and further tuning would probably have minimal changes in performance. The principal motivation to follow the same problem decomposition for threading as for MPI is for the improvement in memory consumption.
Summary
The effort to write code for both threads and MPI is time consuming. Projects such as the Multi-Processor Computer (MPC) framework (see mpc.hpcframework.paratools.com) may make writing code in MPI and running via threads just as efficient in the future. The one-sided communication features of MPI-3 may allow developers to write code more like the threaded version of miniFE, where one thread writes the necessary data to the other threads' desired locations, minimizing the need for MPI runtime libraries to hold so much memory in reserve. When add threading to MPI code, remember the best practices such as watching for scalable runtime libraries and system calls that may not be thread-friendly by default, such as memory allocation or rand().
Performance of threaded software performs comparably with MPI when they both follow the same parallel layout: subdomain per MPI rank and subdomain per thread. In cases like miniFE, threading consumes less memory than MPI runtime libraries and allows larger problem sizes to be solved on the same system. For this implementation of miniFE, problem sizes 15 percent larger could be run on the same platform. Those seeking to optimize for memory consumption should consider the same parallel layout for both threading and MPI and will likely benefit from the transition.
Notes
Data collected using the Intel® C++ Compiler 16.0 and Intel® MPI Library 5.1.
Running Intel® Parallel Studio XE Analysis Tools on Clusters with Slurm* / srun
Since HPC applications target high performance, users are interested in analyzing the runtime performance of such applications. In order to get a representative picture of that performance / behavior, it can be important to gather analysis data at the same scale as regular production runs. Doing so however, would imply that shared memory- focused analysis types would be done on each individual node of the run in parallel. This might not be in the user’s best interest, especially since the behavior of a well-balanced MPI application should be very similar across all nodes. Therefore, users need the ability to run individual shared memory- focused analysis types on subsets of MPI- ranks or compute nodes.
There are multiple ways to achieve this, e.g. through
- Separating environments for different ranks through the MPI runtime arguments
- MPI library specific environments for analysis tool attachment like “gtool” for the Intel®MPI Library
- Batch scheduler parameters that allow separating the environments for different MPI ranks
In this article, we want to focus on the third option by using the Slurm* workload manager, which allows us to stay independent of the MPI library implementation being utilized.
The Slurm batch scheduler comes with a job submission utility called srun. A very simple srun job submission could look like the following:
$ srun ./my_application
Now, attaching analysis tools such as - Intel® VTune Amplifier XE, Intel® Inspector XE or Intel® Advisor XE from the Intel Parallel Studio XE tools suite– could look like the following:
$ srun amplxe-cl –c hotspots –r my_result_1 -- ./my_application
The downside of this approach, however, is that the analysis tool - VTune in this case – will be attached to each individual MPI rank. Therefore, the user will get at least as many result directories as there are shared memory nodes within the run.
If the user is only interested in analyzing a subset of MPI ranks or shared memory nodes, they can leverage the multiple program configuration from srun. Therefore, the user needs to create a separate configuration file that will define which MPI ranks will be analyzed:
$ cat > srun_config.conf << EOF 0-98 ./my_application 99 amplxe-cl –c hotspots –r my_result_2 -- ./my_application 100-255 ./my_application EOF
As one can see from this example configuration, the user runs the target application across 256 MPI ranks, where only the 100th MPI process (i.e., rank #99) will be analyzed with VTune while all other ranks remain unaffected.
Now, the user can execute srun leveraging the created configuration file by using the following command:
$ srun --multi-prog ./srun_config.conf
This way, only one result directory for rank #99 will be created.
*Other names and brands may be claimed as the property of others.
Accelerating Your NVMe Drives with SPDK
Introduction
The Storage Performance Development Kit (SPDK) is an open source set of tools and libraries hosted on GitHub that helps developers create high-performance and scalable storage applications. This tutorial will focus on the userspace NVMe driver provided by SPDK and will show you a Hello World example running on an Intel® architecture platform.
Hardware and Software Configuration
CPU and Chipset | Intel® Xeon® processor E5-2697 v2 @ 2.7 GHz
|
Memory | Memory size: 8 GB (8X8 GB) DDR3 1866 Brand/model: Samsung – M393B1G73BH0* |
Storage | |
Operating System | CentOS* 7.2.1511 with kernel 3.10.0 |
Why is There a Need for a Userspace NVMe Driver?
Historically, storage devices have been an order of magnitude slower than other parts of a computer system, such as RAM and CPU. This meant the operating system and CPU would interface with disks using interrupts like so:
- A request is made to the OS to read data from a disk.
- The driver processes the request and communicates with the hardware.
- The disk platter is spun up.
- The needle is moved across the platter to start reading data.
- Data is read and copied into a buffer.
- An interrupt is generated, notifying the CPU that the data is now ready.
- Finally, the data is read from the buffer.
The interrupt model does incur an overhead; however, traditionally this has been significantly smaller than the latency of disk-based storage devices, and therefore using interrupts has proved effective. Storage devices such as solid state drives (SSDs) and next-generation technology like 3D XPoint™ storage are now significantly faster than disks and the bottleneck has moved away from hardware (e.g., disks) back to software (e.g., interrupts + kernel) as Figure 1 shows:
Figure 1.Solid state drives (SSDs) and 3D XPoint™ storage are significantly faster than disks. Bottlenecks have moved away from hardware.
The userspace NVMe driver addresses the issue of using interrupts by instead polling the storage device when data is being read or written. Additionally and importantly, the NVMe driver operates within userspace, which means the application is able to directly interface with the NVMe device without going through the kernel. The invocation of a system call is called a context switch and this incurs an overhead as the state has to be both stored and then restored when interfacing with the kernel. The NVMe uses a lockless design to not use CPU cycles synchronizing data between threads and this lockless approach also supports parallel IO command execution.
When comparing the SPDK userspace NVMe driver to an approach using the Linux Kernel, the overhead latency is up to 10x lower:
SPDK is capable of saturating 8 NVMe SSDs delivering over 3.5 million IOPs using a single CPU core:
Prerequisites and Building SPDK
SPDK has known support for Fedora*, CentOS*, Ubuntu*, Debian*, and FreeBSD*. A full list of prerequisite packages can be found here.
Before building SPDK, you are required to first install the Data Plane Development Kit (DPDK) as SPDK relies on the memory management and queuing capabilities already found in DPDK. DPDK is a mature library typically used for network packet processing and has been highly optimized to manage memory and queue data with low latency.
The source code for SPDK can be cloned from GitHub using the following:
git clone https://github.com/spdk/spdk.git
Building DPDK (for Linux*):
cd /path/to/build/spdk wget http://fast.dpdk.org/rel/dpdk-16.07.tar.xz tar xf dpdk-16.07.tar.xz cd dpdk-16.07 && make install T=x86_64-native-linuxapp-gcc DESTDIR=.
Building SPDK (for Linux):
Now that we have DPDK built inside of the SPDK folder, we need to change directory back to SPDK and build SPDK by passing the location of DPDK to make:
cd /path/to/build/spdk make DPDK_DIR=./dpdk-16.07/x86_64-native-linuxapp-gcc
Setting Up Your System Before Running an SPDK Application
The command below sets up hugepages as well as unbinds any NVMe and I/OAT devices from the kernel drivers:
sudo scripts/setup.sh
Using hugepages is important to performance as they 2MiB in size compared to the default 4KiB page size and this reduces the likelyhood of a Translation Lookaside Buffer (TLB) miss. The TLB is a component inside a CPU responsible for translating virtual addresses into physical memory addresses and therefore using larger pages (hugepages) results in efficient use of the TLB.
Getting Started with ‘Hello World’
SPDK includes a number of examples as well as quality documentation to quickly get started. We will go through an example of storing ‘Hello World’ to an NVMe device and then reading it back into a buffer.
Before jumping to code it is worth noting how NVMe devices are structured and provide a high-level example of how this will utilize the NVMe driver to detect NVMe devices, write and then read data.
An NVMe device (also called an NVMe controller) is structured with the following in mind:
- A system can have one or more NVMe devices.
- Each NVMe device consists of a number of namespaces (it can be only one).
- Each namespace consists of a number of Logical Block Addresses (LBAs).
This example will go through the following steps:
Setup
- Initialize the DPDK Environment Abstraction Layer (EAL). -c is a bitmask of the cores to run on, -n is the core ID for the master and --proc-type is the directory where a hugetlbfs is mounted.
static char *ealargs[] = { "hello_world", "-c 0x1", "-n 4", "--proc-type=auto", }; rte_eal_init(sizeof(ealargs) / sizeof(ealargs[0]), ealargs);
- Create a request buffer pool that is used internally by SPDK to store request data for each I/O request:
request_mempool = rte_mempool_create("nvme_request", 8192, spdk_nvme_request_size(), 128, 0, NULL, NULL, NULL, NULL, SOCKET_ID_ANY, 0);
- Probe the system for NVMe devices:
rc = spdk_nvme_probe(NULL, probe_cb, attach_cb, NULL);
- Enumerate the NVMe devices, returning a boolean value to SPDK as to whether the device should be attached:
static bool probe_cb(void *cb_ctx, struct spdk_pci_device *dev, struct spdk_nvme_ctrlr_opts *opts) { printf("Attaching to %04x:%02x:%02x.%02x\n", spdk_pci_device_get_domain(dev), spdk_pci_device_get_bus(dev), spdk_pci_device_get_dev(dev), spdk_pci_device_get_func(dev)); return true; }
- The device is attached; we can now request information about the number of namespaces:
static void attach_cb(void *cb_ctx, struct spdk_pci_device *dev, struct spdk_nvme_ctrlr *ctrlr, const struct spdk_nvme_ctrlr_opts *opts) { int nsid, num_ns; const struct spdk_nvme_ctrlr_data *cdata = spdk_nvme_ctrlr_get_data(ctrlr); printf("Attached to %04x:%02x:%02x.%02x\n", spdk_pci_device_get_domain(dev), spdk_pci_device_get_bus(dev), spdk_pci_device_get_dev(dev), spdk_pci_device_get_func(dev)); snprintf(entry->name, sizeof(entry->name), "%-20.20s (%-20.20s)", cdata->mn, cdata->sn); num_ns = spdk_nvme_ctrlr_get_num_ns(ctrlr); printf("Using controller %s with %d namespaces.\n", entry->name, num_ns); for (nsid = 1; nsid <= num_ns; nsid++) { register_ns(ctrlr, spdk_nvme_ctrlr_get_ns(ctrlr, nsid)); } }
- Enumerate the namespaces to retrieve information such as the size:
static void register_ns(struct spdk_nvme_ctrlr *ctrlr, struct spdk_nvme_ns *ns) { printf(" Namespace ID: %d size: %juGB\n", spdk_nvme_ns_get_id(ns), spdk_nvme_ns_get_size(ns) / 1000000000); }
- Create an I/O queue pair to submit read/write requests to a namespace:
ns_entry->qpair = spdk_nvme_ctrlr_alloc_io_qpair(ns_entry->ctrlr, 0);
Reading/writing data
- Allocate a buffer for the data that will be read/written:
sequence.buf = rte_zmalloc(NULL, 0x1000, 0x1000);
- Copy ‘Hello World’ into the buffer:
sprintf(sequence.buf, "Hello world!\n");
- Submit a write request to a specified namespace providing a queue pair, pointer to the buffer, index of the LBA, a callback for when the data is written, and a pointer to any data that should be passed to the callback:
rc = spdk_nvme_ns_cmd_write(ns_entry->ns, ns_entry->qpair, sequence.buf, 0, /* LBA start */ 1, /* number of LBAs */ write_complete, &sequence, 0);
- The write completion callback will be called synchronously.
- Submit a read request to a specified namespace providing a queue pair, pointer to a buffer, index of the LBA, a callback for the data that has been read, and a pointer to any data that should be passed to the callback:
rc = spdk_nvme_ns_cmd_read(ns_entry->ns, ns_entry->qpair, sequence->buf, 0, /* LBA start */ 1, /* number of LBAs */ read_complete, (void *)sequence, 0);
- The read completion callback will be called synchronously.
- Poll on a flag that marks the completion of both the read and write of the data. If the request is still in flight we can poll for the completions for a given queue pair. Although the actual reading and writing of the data is asynchronous, the spdk_nvme_qpair_process_completions function checks and returns the number completed I/O requests and will also call the read/write completion callbacks described above:
while (!sequence.is_completed) { spdk_nvme_qpair_process_completions(ns_entry->qpair, 0); }
- Release the queue pair and complete any cleanup before exiting:
spdk_nvme_ctrlr_free_io_qpair(ns_entry->qpair);
The complete code sample for the Hello World application described here is available on github, and API documentation for the SPDK NVME driver is available at www.spdk.io
Running the Hello World example should give the following output:
Other Examples Included with SPDK
SPDK includes a number of examples to help you get started and build an understanding of how SPDK works quickly. Here is the output from the perf example that benchmarks the NVMe drive:
Developers that require access to the NVMe drive information such as features, admin command set attributes, NVMe command set attributes, power management, and health information can use the identify example:
Other Useful Links
- SPDK mailing list
- SPDK website
- SPDK documentation
- Enabling the Storage Transformation with SPDK video– Registration is required to view
- Accelerating your Storage Algorithms using Intelligent Storage Acceleration Library (ISA-L) video
- Accelerating Data Deduplication with ISA-L blog post
Authors
Steven Briscoe is an Application Engineer who focuses on cloud computing within the Software Services Group at Intel (UK).
Thai Le is a Software Engineer who focuses on cloud computing and performance computing analysis at Intel.
Recipe: Building NAMD on Intel® Xeon® and Intel® Xeon Phi™ Processors
Purpose
This recipe describes a step-by-step process of how to get, build, and run NAMD, Scalable Molecular Dynamic, code on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors for better performance.
Introduction
NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecule systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also file-compatible with AMBER, CHARMM, and X-PLOR.
NAMD is distributed free of charge with source code. You can build NAMD yourself or download binaries for a wide variety of platforms. Find the details below of how to build on Intel® Xeon Phi™ processor and Intel® Xeon® E5 processors and learn more about NAMD at http://www.ks.uiuc.edu/Research/namd/
Building NAMD on Intel® Xeon® Processor E5-2697 v4 (BDW) and Intel® Xeon Phi™ Processor 7250 (KNL)
- Download the latest NAMD source code(Nightly Build) from this site: http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD
- Download fftw3 from this site: http://www.fftw.org/download.html
- Version 3.3.4 is used in this run
- Build fftw3:
- Cd<path>/fftw3.3.4
- ./configure --prefix=$base/fftw3 --enable-single --disable-fortran CC=icc
Use xMIC-AVX512 for KNL or –xCORE-AVX2 for BDW - make CFLAGS="-O3 -xMIC-AVX512 -fp-model fast=2 -no-prec-div -qoverride-limits" clean install
- Download charm++* version 6.7.1
- You can get charm++ from the NAMD Version Nightly Build source code
- Or download it separately from here: http://charmplusplus.org/download/
- Build multicore version of charm++:
- cd <path>/charm-6.7.1
- ./build charm++ multicore-linux64 iccstatic --with-production "-O3 -ip"
- Build BDW:
- Modify the Linux-x86_64-icc.arch to look like the following:
NAMD_ARCH = Linux-x86_64 CHARMARCH = multicore-linux64-iccstatic FLOATOPTS = -ip -xCORE-AVX2 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE CXX = icpc -std=c++11 -DNAMD_KNL CXXOPTS = -static-intel -O2 $(FLOATOPTS) CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4 CXXCOLVAROPTS = -O2 -ip CC = icc COPTS = -static-intel -O2 $(FLOATOPTS)
- ./config Linux-x86_64-icc --charm-base <charm_path> --charm-arch multicore-linux64- iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
- gmake -j
- Modify the Linux-x86_64-icc.arch to look like the following:
- Build KNL:
- Modify the arch/Linux-KNL-icc.arch to look like the following:
NAMD_ARCH = Linux-KNL CHARMARCH = multicore-linux64-iccstatic FLOATOPTS = -ip -xMIC-AVX512 -O3 -g -fp-model fast=2 -no-prec-div -qoverride-limits DNAMD_DISABLE_SSE CXX = icpc -std=c++11 -DNAMD_KNL CXXOPTS = -static-intel -O2 $(FLOATOPTS) CXXNOALIASOPTS = -O3 -fno-alias $(FLOATOPTS) -qopt-report-phase=loop,vec -qopt-report=4 CXXCOLVAROPTS = -O2 -ip CC = icc COPTS = -static-intel -O2 $(FLOATOPTS)
- ./config Linux-KNL-icc --charm-base <charm_path> --charm-arch multicore-linux64-iccstatic --with-fftw3 --fftw-prefix <fftw_path> --without-tcl --charm-opts –verbose
- gmake –j
- Modify the arch/Linux-KNL-icc.arch to look like the following:
- Change the kernel setting for KNL: “
nmi_watchdog=0 rcu_nocbs=2-271 nohz_full=2-271”
- Download apoa and stmv workloads from here: http://www.ks.uiuc.edu/Research/namd/utilities/
- Change next lines in *.namd file for both workloads:
numsteps 1000
outputtiming 20 outputenergies 600
Run NAMD workloads on Intel® Xeon® Processor E5-2697 v4 and Intel® Xeon Phi™ Processor 7250
Run BDW (ppn = 72):
$BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
Run KNL (ppn = 136, MCDRAM in flat mode, similar performance in cache mode):
numactl –m 1 $BIN +p $ppn apoa1/apoa1.namd +pemap 0-($ppn-1)
Performance results reported in Intel® Salesforce repository
(ns/day; higher is better):
Workload | Intel® Xeon® Processor E5-2697 v4 (ns/day) | Intel® Xeon Phi™ Processor 7250 (ns/day) | KNL vs. 2S BDW (speedup) |
---|---|---|---|
stmv | 0.45 | 0.55 | 1.22x |
Ap0a1 | 5.5 | 6.18 | 1.12x |
Systems configuration:
Processor | Intel® Xeon® Processor E5-2697 v4(BDW) | Intel® Xeon Phi™ Processor 7250 (KNL) |
---|---|---|
Stepping | 1 (B0) | 1 (B0) Bin1 |
Sockets / TDP | 2S / 290W | 1S / 215W |
Frequency / Cores / Threads | 2.3 GHz / 36 / 72 | 1.4 GHz / 68 / 272 |
DDR4 | 8x16 GB 2400 MHz(128 GB) | 6x16 GB 2400 MHz |
MCDRAM | N/A | 16 GB Flat |
Cluster/Snoop Mode/Mem Mode | Home | Quadrant/flat |
Turbo | On | On |
BIOS | GRRFSDP1.86B0271.R00.1510301446 | GVPRCRB1.86B.0010.R02.1608040407 |
Compiler | ICC-2017.0.098 | ICC-2017.0.098 |
Operating System | Red Hat* Enterprise Linux* 7.2 (3.10.0-327.e17.x86_64) | Red Hat Enterprise Linux 7.2 (3.10.0-327.22.2.el7.xppsl_1.4.1.3272._86_64) |
Simple, Powerful HPC Clusters Drive High-Speed Design Innovation
Up to 17x Faster Simulationsthrough Optimized Cluster Computing
Scientists and engineers across a wide range of disciplines are facing a common challenge. To be effective, they need to study more complex systems with more variables and greater resolution. Yet they also need timely results to keep their research and design efforts on track.
A key criterion for most of these groups is the ability to complete their simulations overnight, so they can be fully productive during the day. Altair and Intel help customers meet this requirement using Altair HyperWorks* running on high performance computing (HPC) appliances based on the Intel® Xeon® processor E5-2600 v4 product family.
Intel® HPC Developer Conference 2016 - Session Presentations
The 2016 Intel® HPC Developer Conference brought together developers from around the world to discuss code modernization in high-performance computing. For those who may have missed it or if you want to catch presentations that you may have missed, we have posted the Top Tech Sessions of 2016 to the HPC Developer’s Conference webpage. The sessions are split out by track, including Artificial Intelligence/Machine Learning, Systems, Software Visualization, Parallel Programming and others.
Artificial Intelligence/Machine Learning Track
- Accelerating Machine Learning Software on IA
- Data Analytics, Machine Learning and HPC in Today’s Changing Application Environment
- Scaling Deep Learning
- Optimizing Machine Learning workloads on Intel Platforms
- Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster
- Using Machine Learning to Avoid the Unwanted
- Deep Neural Network Art
- Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Systems Track
- Latest developments in OpenFabrics Interface (OFI): The new scalable fabric SW layer for Supercomputers
- Discover, extend and modernize your current development approach for hetergeneous compute with standards based OFI/MPI/OpenMP programming methods on Intel® Xeon Phi™ achitectures
- Intel® Omni-Path Architecture Software Architecture Overview
- Exploiting HPC Technologies to Accelerate Big Data Processing (Hadoop*, Spark*, and Memcached*)
- Best Practices and Performance Study of HPC Clusters
- Challenges of Deploying your HPC Application to the Cloud
- Parallel Performance: moving MPI Applications to the Next Level
- Simplify System Software Stack Development and Maintenance
High Productivity Languages Track
- Python Scalability Story in Production Environments
- The State of High Performance Computing in the Open Source R Ecosystem (University of Tennesse)
- Data Analytics and Simulation using the MATLAB Language (Mathworks)
- Julia in Parallel and High Performance Computing
- Jupyter: Python, Julia, C, and MKL HPC Batteries included
Software Visualization Track
- Introduction to SDVis and Update on Intel Efforts (James Jeffers, Intel)
- Update on OpenSWR (Jefferson Amstutz, Intel)
- OSPRay 1.0 and Beyond (Jefferson Amstutz, Intel)
- Large-scale Distributed Rendering with the OSPRay Ray Tracing Framework (Carson Brownley Intel)
- Realizing Multi-Hit Ray Tracing in Embree and OSPRay (Christiaan Gribble, Intel/SURVICE)
- Visualization w/Visit on Intel® Xeon Phi™ Processor (code name Knights Landing) KNL (Jian Huang & Hank Childs, Unv of Oregon / Unv of Tennessee)
- SDVis Research at the University of Utah Intel Parallel Compute Center (Aaron Knoll, Unv. Of Utah)
- OSPRay Integration into Pcon-Planner (Caglar Özgür & Frank Wicht, Eastern Graphics)
- Visualization and Analysis of Biomolecular Complexes on Upcoming KNL-based HPC Systems: TACC Stampede 2 and ANL Aurora
- Paraview and VTK w/OSPRay* and OpenSWR*
- SDVIs and In-Situ Visualization on TACC's Stampede
Parallel Programming Track
- Optimizations of Bspline-based Orbital Evaluations in Quantum Monte Carlo on Multi/Many-Core Shared Memory Processors
- Utilizating Roofline Analysis in the Intel® Advisor to Deliver Optimized Performance for Applications on Intel® Xeon Phi™ Processor (Code named Knights Landing)
- Many Cores for the Masses: Lessons Learned from Application Readiness Efforts at NERSC for the Knights Landing based Cori System
- Reshaping Core Genomics Software tools for the Many-Core era
- High-Performance and Scalable MPI +X Library for Emerging HPC Clusters
- All the things you need to know about Intel MPI Library (TACC)
- Using C++ and Intel Threading Building Blocks to program across processors and co-processors
- Improving Vectorization Efficiency using Intel SIMD Data Layout Template
- SWIFT: Using Task-Based Parallelism, Fully Asynchronous Communication and Vectorization to achieve maximal HPC performance
- Case Study: Optimization of Profrager, a protein structure and function prediction tool developed at the Brazilian National Laboratory for Scientific Computing (LNCC)
- Porting Industrial Application on Intel® Xeon Phi™: Altair RADIOSS Case Study, Developer feedbacks and Outlooks