Improving the Performance of Principal Component Analysis with Intel® Data Analytics Acceleration Library

December 2, 2016, 2:27 pm

Latest and popular articles on Intel Technologies

≫ Next: How AisaInfo ADB* Improves Performance with Intel® Xeon® Processor-Based Systems

≪ Previous: Intel® HPC Developer Conference 2016 - Session Presentations

Have you ever tried to access a website and had to wait a long time before you could access it or not been able to access it at all? If so, that website might be falling victim to what is called a Denial of Service¹ (DoS) attack. DoS attacks occur when an attacker floods a network with information like spam emails, causing the network to be so busy handling that information that it is unable to handle requests from other users.

To prevent spam email DoS attack a network needs to be able to identify “garbage”/spam emails and filter them out. One way to do this is to compare an email pattern with those in the library of email spam signatures. Incoming patterns that match those of the library are labeled as attacks. Since spam emails can come in many forms and shapes, there is no way to build a library that can store all the patterns. In order to increase the chance of identifying spam emails there need to be a method to restructure the data in such a way that will make it simpler to analyze.

This article discusses an unsupervised²machine-learning³ algorithm called principal component analysis⁴ (PCA) that can be used to simplify the data. It also describes how Intel® Data Analytics Acceleration Library (Intel® DAAL)⁵ helps optimize this algorithm to improve the performance when running it on systems equipped with Intel® Xeon® processors.

What is Principal Component Analysis?

PCA is a popular data analysis method. It is used to reduce the complexity of the data without losing its properties to make it easier to visualize and analyze. Reducing the complexity of the data means reducing the original dimensions to lesser dimensions while preserving the important features of the original datasets. It is normally used as a pre-step of machine learning algorithms like K-means⁶, resulting in simpler modeling and thus improving performance.

Figures 1–3 illustrate how the PCA algorithm works. To simplify the problem, let’s limit the scope to two-dimensional space.

Figure 1. Original dataset layout.

Figure 1 shows the objects of the dataset. We want to find the direction where the variance is maximal.

Figure 2. The mean and the direction with maximum variance.

Figure 2 shows the mean of the dataset and the direction with maximum variance. The first direction with the maximal variance is call the first principal component.

Figure 3. Finding the next principal component.

Figure 3 shows the next principal component. The next principal component is the direction where the variance is the second most maximal. Note that the second direction is orthonormal to the first direction.

Figure 4–6 shows how the PCA algorithm is used to reduce the dimensions.

Figure 4. Re-orientating the graph.

Figure 4 shows the new graph after rotating it so that the axis (P1) corresponding to the first principal component becomes a horizontal axis.

Figure 5. Projecting the objects to the P1 axis.

In Figure 5 the whole graph has been rotated so that the axis (P1) corresponding to the first principal component become a horizontal axis.

Figure 6. Reducing from two dimensions to one dimension.

Figure 6 shows the effect of using PCA to reduce from two dimensions (P1 and P2) to one dimension (P1) base on the maximal variance. Similarly, this same concept is used on multi-dimensional datasets to reduce their dimensions while still maintaining much of their characteristics by dropping dimensions with lower variances.

Information about PCA mathematical representation can be found at references 7 and 8.

Applications of PCA

PCA applications include the following:

Detecting DoS and network probe attacks
Image compression
Pattern recognition
Analyzing medical imaging

Pros and Cons of PCA

The following lists some of the advantages and disadvantages of PCA.

Pros
- Fast algorithm
- Shows the maximal variance of the data
- Reduces the dimension of the origin data
- Removes noise.
Cons
- Non-linear structure is hard to model with PCA

Intel® Data Analytics Acceleration Library

Intel DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. These basic building blocks are highly optimized for the latest features of the latest Intel® processors. More about Intel DAAL can be found at reference 5.

The next section shows how to use PCA with PyDAAL, the Python* API of Intel DAAL. To install PyDAAL, follow the instructions in reference 9.

Using the PCA Algorithm in Intel Data Analytics Acceleration Library

To invoke the PCA algorithm in Python¹⁰ using Intel DAAL, do the following steps:

Import the necessary packages using the commands from and import
1. Import the necessary functions for loading the data by issuing the following command:
  from daal.data_management import HomogenNumericTable
2. Import the PCA algorithm using the following commands:
  import daal.algorithms.pca as pca
3. Import numpy for calculation.
  Import numpy as np
Import the createSparseTable function to create a numeric table to store input data reading from a file.
from utils import createSparseTable
Load the data into the data set object declared above.
dataTable = createSparseTable(dataFileName)
Where dataFileName is the name of the input .csv data file
Create an algorithm object for PCA using the correlation method.
pca_alg = pca.Batch_Float64CorrelationDense ()
Note: if we want to use the svd (single value decomposition) method, we can use the following command:
pca = pca.Batch_Float64SvdDense()
Set the input for the algorithm.
pca_alg.input.setDataset(pca.data, dataTable)
Compute the results.
result = pca_alg.compute()
The results can be retrieved using the following commands:
result.get(pca.eigenvalues) result.get(pca.eigenvectors)

Conclusion

PCA is one of the simplest unsupervised machine-learning algorithms that is used to reduce the dimensions of a dataset. Intel DAAL contains an optimized version of the PCA algorithm. With Intel DAAL, you don’t have to worry about whether your applications will run well on systems equipped with future generations of Intel Xeon processors. Intel DAAL will automatically take advantage of new features in new Intel Xeon processors. All you need to do is link your applications to the latest version of Intel DAAL.

References

1. Denial of service attacks

2. Unsupervised learning

3. Wikipedia – machine learning

4. Principal component analysis

5. Introduction to Intel DAAL

6. K-means algorithm

7. Principal component analysis for machine learning

8. Principal component analysis tutorial

9. How to install Intel’s distribution for Python

10. Python website

↧

How AisaInfo ADB* Improves Performance with Intel® Xeon® Processor-Based Systems

December 6, 2016, 2:56 pm

Latest and popular articles on Intel Technologies

≫ Next: A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library

≪ Previous: Improving the Performance of Principal Component Analysis with Intel® Data Analytics Acceleration Library

Background

Supporting high online transaction volumes in real time, especially at peak time, can be challenging for telecom and financial services. To ensure uninterrupted service and a good customer experience, telecom and financial companies are constantly looking for ways to improve their services by enhancing their applications and systems.

AsiaInfo¹ ADB* is a scalable online transaction processing² database targeted for high-performance and mission-critical businesses such as online charge service³ (OCS). AsiaInfo ADB provides high performance, high availability, and scalability by clustering multiple servers.

This article describes how AsiaInfo ADB was able to take advantage of features like Intel® Advanced Vector Extensions 2 (Intel® AVX2)⁴ and Intel® Transactional Synchronization Extensions (Intel® TSX)⁵ as well as faster Intel® Solid State Drive hard disks to improve its performance when running on systems equipped with the latest generation of Intel® Xeon® processors.

AisaInfo ADB on Intel® Xeon® Processor-Based Systems

AsiaInfo engineers modified the ADB code by replacing the “self-implemented” spin lock to pthread_rwlock_wrlock in the GNU* C library⁶ (glibc). The function pthread_rwlock_wrlock can be configured to enable or disable Intel TSX with the environmental variable. With the new ADB version using glibc lock, when Intel TSX is enabled, the performance improves as shown in Figure 1 as compared to that of the original ADB version using the self-implemented lock.

For customers with limited disk space and cannot be expanded, they can enable the compress function. The ADB data compression function can save disk space by compressing data before writing to disk. This function is CPU intensive and impacts database performance. In order to do that, AsiaInfo engineers modified the ADB compression module using the Intel AVX2 intrinsic instructions.

New Intel Xeon processors like the Intel® Xeon® processor E7 v4 family provide more cores (24 compared to 18) and larger cache size (60 MB compared to 45 MB) compared to the previous generation of Intel® Xeon® processors E7 v3 family. More cores and larger cache size allow more transactions to be served within the same amount of time.

The next section shows how we tested the AsiaInfo ADB workload to compare the performance between the current generation of Intel Xeon processors E7 v4 family and those of the previous generation of Intel Xeon processors E7 v3 family.

Performance Test Procedure

We performed tests on two platforms. One system was equipped with the Intel® Xeon® processor E7-8890 v3 and the other with the Intel® Xeon® processor E7-8890 v4. We wanted to see how Intel TSX, Intel AVX2, and faster solid state drives (SSDs) affect performance.

Test Configuration

System equipped with the quad-socket Intel Xeon processor E7-8890 v4

System: Preproduction
Processors: Intel Xeon processor E7-8890 v4 @2.2 GHz
Cache: 60 MB
Cores: 24
Memory: 256 GB DDR4-1600 LV DIMM
SSD: Intel® SSD DC S3700 Series, Intel SSD DC P3700 Series

System equipped with the quad-socket Intel Xeon processor E7-8890 v3

System: Preproduction
Processors: Intel Xeon processor E5-2699 v3 @2.5 GHz
Cache: 45 MB
Cores: 18
Memory: 256 GB DDR4-1600 LV DIMM
SSD: Intel SSD DC S3700 Series, Intel SSD DC P3700 Series

Operating system:

Ubuntu* 15.10 - kernel 4.2

Software:

Glibc 2.21

Application:

ADB v1.1
AsiaInfo ADB OCS ktpmC workload

Test Results

**Figure 1:** Comparison between the application using the Intel® Xeon® processor E7-8890 v3 and the Intel® Xeon® processor E7-8890 v4 when Intel® Transactional Synchronization Extensions is enabled.

Figure 1 shows that the performance improved by 22 percent with Intel TSX enabled when running the application on systems equipped with Intel Xeon processor E7-8890 v4 compared to that of the Intel Xeon processor E7-8890 v3.

**Figure 2:** Performance improvement using Intel® Advanced Vector Extensions 2.

Figure 2 shows the data compression module performance improved by 34 percent when Intel AVX2 is enabled. This test was performed on the Intel® Xeon® processor E7-8890 v4.

**Figure 3:** Performance comparison between different Intel® SSDs.

Figure 3 shows the performance improvement of the application using faster Intel SSDs. In this test case, replacing the Intel SSD DC S3700 Series with the Intel® SSD DC P3700 Series gained 58 percent in performance. Again, this test was performed on the Intel® Xeon® processor E7-8890 v4.

Conclusion

AsisInfo ADB gains more performance by taking advantage of Intel TSX and Intel AVX2 as well as better platform capabilities such as more cores and larger cache size resulting in improved customer experiences.

References

↧

A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library

December 12, 2016, 11:25 am

Latest and popular articles on Intel Technologies

≫ Next: Thread Parallelism in Cython*

≪ Previous: How AisaInfo ADB* Improves Performance with Intel® Xeon® Processor-Based Systems

This article discusses a best-known method (BKM) for using the libfabric¹ infrastructure on a cluster computing system. The focus is on how to transition from an Open Fabrics Alliance (OFA) framework to the Open Fabrics Interfaces¹ (OFI) framework, where a description of the fabric providers that support libfabric are given. In general, the OFA to OFI transition has a charter to make the Intel® MPI Library software layer lighter, where most of the network communication controls are being shifted to a lower level (for example, the OFI provider level). For more information, go to the Libfabric OpenFabrics.¹The reader should note that the following information is based on the Open Fabrics Interfaces Working Group, and hence this document is heavily annotated with citations to the URL: https://github.com/ofiwg/libfabric so that the reader can obtain even more detailed information when it is needed.

What is libfabric?

The Open Fabrics Interfaces¹ (OFI) is a framework focused on exporting fabric communication services to applications.

See the OFI web site for more details. This URL includes a description and overview of the project and detailed documentation for the libfabric APIs.

Building and installing libfabric from the source

Distribution tar packages are available from the GitHub* releases tab.¹ If you are building libfabric from a developer Git clone, you must first run the autogen.sh script. This will invoke the GNU* Autotools to bootstrap libfabric's configuration and build mechanisms. If you are building libfabric from an official distribution tarball, then you do not need to run autogen.sh. This means that libfabric distribution tarballs are already bootstrapped for you.

Libfabric currently supports GNU/Linux*, Free BSD*, and OS X*. Although OS X* is mentioned here, the Intel® MPI Library does not support OS X.

Configuration options¹

The configure script has many built-in command-line options. The reader should issue the command:

 ./configure --help

to view those options. Some useful configuration switches are:

--prefix=<directory>

Throughout this article, <directory> should be interpreted as a meta-symbol for the actual directory path that is to be supplied by the user. By default, make install places the files in the /usr directory tree. If the --prefix option is used it indicates that libfabric files should be installed into the directory tree specified by <directory>. The executables that are built from the configure command will be placed into <directory>/bin.

--with-valgrind=<directory>

The meta-symbol <directory> is the directory where valgrind is installed. If valgrind is found, valgrind annotations are enabled. This may incur a performance penalty.

--enable-debug

Enable debug code paths. This enables various extra checks and allows for using the highest verbosity logging output that is normally compiled out in production builds.

--enable-<provider>=[yes | no | auto | dl | <directory>]

--disable-<provider>

This enables or disables the fabric provider referenced by the meta-symbol <provider>. Valid options are:

auto (This is the default if the --enable-<provider> option is not specified).
The provider will be enabled if all its requirements are satisfied. If one of the requirements cannot be satisfied, the provider is disabled.
yes (This is the default if the --enable-<provider> option is specified).
The configure script will abort if the provider cannot be enabled (for example, due to some of its requirements not being available).
no
Disable the provider. This is synonymous with --disable-<provider>.
dl
Enable the provider and build it as a loadable library.
<directory>
Enable the provider and use the installation given in <directory>.

Providers¹ are gni*, mxm*, psm, psm2, sockets, udp, usnic*, and verbs.

Examples¹

Consider the following example:

$ ./configure --prefix=/opt/libfabric --disable-sockets && make -j 32 && sudo make install

This tells libfabric to disable the sockets provider and install libfabric in the /opt/libfabric tree. All other providers will be enabled if possible, and all debug features will be disabled.

Alternatively:

$ ./configure --prefix=/opt/libfabric --enable-debug --enable-psm=dl && make -j 32 && sudo make install

This tells libfabric to enable the psm provider as a loadable library, enable all debug code paths, and install libfabric to the /opt/libfabric tree. All other providers will be enabled if possible.

Validate installation¹

The fi_info utility can be used to validate the libfabric and provider installation, as well as provide details about provider support and available interfaces. See the fi_info(1) man page for details on using the fi_info utility. fi_info is installed as part of the libfabric package.

A more comprehensive test suite is available via the fabtests software package. Also, fi_pingpong, which is a Ping-Pong test for transmitting data between two processes may be used for validation purposes as well.

Who are the libfabric¹ providers?

**gni*¹**

The Generic Network Interface (GNI) provider runs on Cray XC* systems utilizing the user-space Generic Network Interface (uGNI), which provides low-level access to the Aries* interconnect. Aries is the Cray custom interconnect ASIC (Application-Specific Integrated Circuit). The Aries interconnect is designed for low-latency, one-sided messaging and also includes direct hardware support for common atomic operations and optimized collectives. Note, however, that OFI does not provide an API for collectives. Some kind of path for optimization of the collectives can be done with the fi_trigger APIs, where details can be found at the URL: https://ofiwg.github.io/libfabric/master/man/fi_trigger.3.html. However, as of this writing, Intel MPI Library does not use fi_trigger (Triggered operations).

See the fi_gni(7) man page for more details.

Dependencies¹

The GNI provider requires GCC version 4.9 or higher.

mxm*¹

The MXM provider has been deprecated and was removed after the libfabric1.4.0 release.

psm¹

The psm (Performance Scaled Messaging) provider runs over the PSM 1.x interface that is currently supported by the Intel® True Scale Fabric. PSM provides tag-matching message queue functions that are optimized for MPI implementations. PSM also has limited Active Message support, which is not officially published but is quite stable and well documented in the source code (part of the OFED release). The psm provider makes use of both the tag-matching message queue functions and the Active Message functions to support a variety of libfabric data transfer APIs, including tagged message queue, message queue, RMA (Remote Memory Access), and atomic operations.

The psm provider can work with the psm2-compat library, which exposes a PSM 1.x interface over the Intel® Omni-Path Fabric.

See the fi_psm(7) man page for more details.

psm2¹

The psm2 provider runs over the PSM 2.x interface that is supported by the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x features plus a set of new functions with enhanced capabilities. Since PSM 1.x and PSM 2.x are not application binary interface (ABI) compatible, the psm2 provider only works with PSM 2.x and does not support Intel True Scale Fabric.

See the fi_psm2(7) man page for more details.

sockets¹

The sockets provider is a general purpose provider that can be used on any system that supports TCP sockets. The provider is not intended to provide performance improvements over regular TCP sockets but rather to allow developers to write, test, and debug application code even on platforms that do not have high-performance fabric hardware. The sockets provider supports all libfabric provider requirements and interfaces.

See the fi_sockets(7) man page for more details.

udp¹

The udp (user datagram protocol) provider is a basic provider that can be used on any system that supports User Datagram Protocol (UDP) sockets. UDP is an alternative communications protocol to Transmission Control Protocol (TCP) used primarily for establishing low-latency and loss tolerating connections between applications on the Internet. The provider is not intended to provide performance improvements over regular UDP sockets but rather to allow application and provider developers to write, test, and debug their code. The udp provider forms the foundation of a utility provider that enables the implementation of libfabric features over any hardware. Intel MPI Library does not support the udp provider.

See the fi_udp(7) man page for more details.

usnic*¹

The usnic provider is designed to run over the Cisco VIC* (virtualized NIC) hardware on Cisco UCS* (Unified Computing System) servers. It utilizes the Cisco usnic (userspace NIC) capabilities of the VIC to enable ultra-low latency and other offload capabilities on Ethernet networks. Intel MPI Library does not support the usnic provider.

See the fi_usnic(7) man page for more details.

Dependencies¹

The usnic provider depends on library files from either libnl version 1 (sometimes known as libnl or libnl1) or version 3 (sometimes known as libnl3). If you are compiling libfabric from source and want to enable usnic support, you will also need the matching libnl header files (for example, if you are building with libnl version 3, you need both the header and library files from version 3).

Configure options¹

--with-libnl=<directory>

If specified, look for libnl support. If it is not found, the usnic provider will not be built. If <directory> is specified, check for libnl version 3 in the directory. If version 3 is not found, check for version 1. If no <directory> argument is specified, this option is redundant with --with-usnic.

verbs*¹

The verbs provider enables applications using OFI to be run over any verbs hardware (InfiniBand*, iWarp*, and so on). It uses the Linux Verbs API for network transport and provides a translation of OFI calls to appropriate verbs API calls. It uses librdmacm for communication management and libibverbs for other control and data transfer operations.

See the fi_verbs(7) man page for more details.

Dependencies¹

The verbs provider requires libibverbs (v1.1.8 or newer) and librdmacm (v1.0.16 or newer). If you are compiling libfabric from source and want to enable verbs support, you will also need the matching header files for the above two libraries. If the libraries and header files are not in default paths, specify them in the CFLAGS, LDFLAGS, and LD_LIBRARY_PATH environment variables.

Selecting a fabric provider within OFI when using the Intel® MPI Library

For OFI when using Intel MPI Library, the selection of a provider from the libfabric library is done through the environment variable called I_MPI_OFI_PROVIDER, which defines the name of the OFI provider to load.

Syntax

export I_MPI_OFI_PROVIDER=<name>

where <name> is the OFI provider to load. Figure 1 shows a list of OFI providers¹ in the row of rectangles that are second from the bottom.

Figure 1.The libfabric* architecture under Open Fabric Interfaces¹ (OFI).

The discussion that follows provides a description of OFI providers that can be selected with the I_MPI_OFI_PROVIDER environment variable.

Using a DAPL or a DAPL UD equivalent when migrating to OFI

DAPL is an acronym for Direct Access Programming Library. For DAPL UD, the acronym UD stands for the User Datagram protocol, and this data transfer is a more memory-efficient alternative to the standard Reliable Connection (RC) transfer. UD implements a connectionless model that allows for a many-to-one connection transfer to be set up for managing communication using a fixed number of connection pairs, even as more MPI ranks are launched.

At the moment, there is no DAPL UD equivalent within OFI.

Using gni* under OFI

To use the gni provider under OFI, set the following environment variable:

export I_MPI_OFI_PROVIDER=gni

OVERVIEW¹

The GNI provider runs on Cray XC systems utilizing the user-space Generic Network Interface (uGNI), which provides low-level access to the Aries interconnect. The Aries interconnect is designed for low-latency, one-sided messaging and also includes direct hardware support for common atomic operations and optimized collectives. Intel MPI Library works with the GNI provider on an “as is” basis.

REQUIREMENTS¹

The GNI provider runs on Cray XC systems that run the Cray Linux Environment 5.2 UP04 or higher using gcc version 4.9 or higher.

The article by Lubin² talks about using the gni fabric.

Using mxm* under OFI

As of this writing, the MXM provider has been deprecated and was removed after the libfabric 1.4.0 release.

Using TCP (Transmission Control Protocol) under OFI

To use the sockets provider under OFI set the following environment variable:

export I_MPI_OFI_PROVIDER=sockets

OVERVIEW¹

SUPPORTED FEATURES¹

The sockets provider supports all the features defined for the libfabric API. Key features include:

Endpoint types¹

The provider supports all endpoint types: FI_EP_MSG, FI_EP_RDM, and FI_EP_DGRAM.

Endpoint capabilities¹

The following data transfer interface is supported for all endpoint types: fi_msg. Additionally, these interfaces are supported for reliable endpoints (FI_EP_MSG and FI_EP_RDM): fi_tagged, fi_atomic, and fi_rma.

Modes¹

The sockets provider supports all operational modes including FI_CONTEXT and FI_MSG_PREFIX.

Progress¹

Sockets provider supports both FI_PROGRESS_AUTO and FI_PROGRESS_MANUAL, with a default set to auto. When progress is set to auto, a background thread runs to ensure that progress is made for asynchronous requests.

LIMITATIONS¹

The sockets provider attempts to emulate the entire API set, including all defined options. In order to support development on a wide range of systems, it is implemented over TCP sockets. As a result, the performance numbers are lower compared to other providers implemented over high-speed fabrics and lower than what an application might see implementing sockets directly.

Using UDP under OFI

As of this writing, the UDP provider is not supported by Intel MPI Library because of the lack of required capabilities within the provider.

Using usnic* under OFI

As of this writing, Intel® MPI Library does not work with usnic*.

Using TMI under OFI

The Tag Matching Interface (TMI) provider was developed for Performance Scaled Messaging (PSM) and Performance Scaled Messaging 2 (PSM2). Therefore under OFI, use Performance Scaled Messaging (PSM) as an alternative to using TMI/PSM by setting the following environment variable:

export I_MPI_OFI_PROVIDER=psm

OVERVIEW¹

The psm provider runs over the PSM 1.x interface that is currently supported by the Intel True Scale Fabric. PSM provides tag-matching message queue functions that are optimized for MPI implementations. PSM also has limited Active Message support, which is not officially published, but is quite stable and is well documented in the source code (part of the OFED release). The psm provider makes use of both the tag-matching message queue functions and the Active Message functions to support a variety of libfabric data transfer APIs, including tagged message queue, message queue, RMA (Remote Memory Access), and atomic operations.

The psm provider can work with the psm2-compat library, which exposes a PSM 1.x interface over the Intel Omni-Path Fabric.

LIMITATIONS¹

The psm provider does not support all the features defined in the libfabric API. Here are some of the limitations:

Endpoint types¹

Only support non-connection based types FI_DGRAM and FI_RDM

Endpoint capabilities¹

Endpoints can support any combination of data transfer capabilities FI_TAGGED, FI_MSG, FI_ATOMICS, and FI_RMA. These capabilities can be further refined by FI_SEND, FI_RECV, FI_READ, FI_WRITE, FI_REMOTE_READ, and FI_REMOTE_WRITE to limit the direction of operations. The limitation is that no two endpoints can have overlapping receive or RMA (Remote Memory Access) target capabilities in any of the above categories. For example, it is fine to have two endpoints with FI_TAGGED | FI_SEND, one endpoint with FI_TAGGED | FI_RECV, one endpoint with FI_MSG, one endpoint with FI_RMA | FI_ATOMICS. But, it is not allowed to have two endpoints with FI_TAGGED, or two endpoints with FI_RMA.

FI_MULTI_RECV is supported for the non-tagged message queue only.

Other supported capabilities include FI_TRIGGER.

Modes¹

FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabilities. This means that any request belonging to these two categories that generates a completion must pass as the operation context a valid pointer to the data structure type, struct fi_context, and the space referenced by the pointer must remain untouched until the request has completed. If none of FI_TAGGED and FI_MSG is asked for, the FI_CONTEXT mode is not required.

Progress¹

The psm provider requires manual progress. The application is expected to call the fi_cq_read or fi_cntr_read function from time to time when no other libfabric function is called to ensure progress is made in a timely manner. The provider does support the auto progress mode. However, the performance can be significantly impacted, if the application purely depends on the provider to make auto progress.

Unsupported features¹

These features are unsupported: connection management, scalable endpoint, passive endpoint, shared receive context, and send/inject with immediate data.

Using PSM2 under OFI

To use the psm2 provider under OFI, set the following environment variable:

export I_MPI_OFI_PROVIDER=psm2

OVERVIEW¹

The psm2 provider runs over the PSM 2.x interface that is supported by the Intel Omni-Path Fabric. PSM 2.x has all the PSM 1.x features plus a set of new functions with enhanced capabilities. Since PSM 1.x and PSM 2.x are not ABI compatible the psm2 provider only works with PSM 2.x, and does not support Intel True Scale Fabric. If you have Intel® Omni-Path Architecture, use the PSM2 provider.

LIMITATIONS¹

The psm2 provider does not support all of the features defined in the libfabric API. Here are some of the limitations:

Endpoint types¹

The only supported non-connection based types are FI_DGRAM and FI_RDM.

Endpoint capabilities¹

FI_MULTI_RECV is supported for non-tagged message queue only.

Other supported capabilities include FI_TRIGGER, FI_REMOTE_CQ_DATA, and FI_SOURCE.

Modes¹

FI_CONTEXT is required for the FI_TAGGED and FI_MSG capabilities. This means that any request belonging to these two categories that generates a completion must pass as the operation context a valid pointer to data structure type, struct fi_context, and the space referenced by the pointer must remain untouched until the request has completed. If none of FI_TAGGED and FI_MSG is asked for, the FI_CONTEXT mode is not required.

Progress¹

The psm2 provider requires manual progress. The application is expected to call the fi_cq_read or fi_cntr_read function from time to time when no other libfabric function is called to ensure progress is made in a timely manner. The provider does support auto progress mode. However, the performance can be significantly impacted, if the application purely depends on the provider to make auto progress.

Unsupported features¹

These features are unsupported: connection management, scalable endpoint, passive endpoint, shared receive context, and send/inject with immediate data over tagged message queue.

Using verbs under OFI

To use the verbs provider under OFI, set the following environment variable:

export I_MPI_OFI_PROVIDER=verbs

OVERVIEW¹

The verbs provider enables applications using OFI to be run over any verbs hardware (InfiniBand, iWarp, and so on). It uses the Linux Verbs API for network transport and provides a translation of OFI calls to appropriate verbs API calls. It uses librdmacm for communication management, and libibverbs for other control and data transfer operations.

SUPPORTED FEATURES¹

The verbs provider supports a subset of OFI features.

Endpoint types¹

Only FI_EP_MSG (Reliable Connection-Oriented) and FI_EP_RDM (Reliable Datagram) are supported, but the official OFI documentation declares FI_EP_RDM as experimental because it is under active development and this includes wire protocols. Intel MPI Library works over RDM endpoints. Note that changes in the wire protocol typically mean that all peers must work in an aligned environment. Therefore, different versions of libfabric are not compatible.

Endpoint capabilities¹

FI_MSG, FI_RMA, FI_ATOMIC.

Modes¹

A verbs provider requires applications to support the following modes: FI_LOCAL_MR for all applications. FI_RX_CQ_DATA for applications that want to use RMA (Remote Memory Access). Applications must take responsibility of posting receives for any incoming CQ (Completion Queue) data.

Progress¹

A verbs provider supports FI_PROGRESS_AUTO: Asynchronous operations make forward progress automatically.

Operation flags¹

A verbs provider supports FI_INJECT, FI_COMPLETION, FI_REMOTE_CQ_DATA.

Msg Ordering¹

A verbs provider supports the following messaging ordering on the TX side: * Read after Read * Read after Write * Read after Send * Write after Write * Write after Send * Send after Write * Send after Send.

Is the multi-rail feature supported under OFI?

When using multi-rail under OFA, the command-line syntax for invoking “mpirun” with Intel MPI Library might look something like:

export I_MPI_FABRICS=ofa:ofa mpirun -n 8 -env I_MPI_OFA_ADAPTER_NAME adapter1 ./program.exe : -n 8 -env I_MPI_OFA_ADAPTER_NAME adapter2 ./program.exe

For the command-line above, 8 MPI ranks use the host channel adapter (HCA) called adapter1 and the other 8 MPI ranks use the HCA named adapter2.

Another multi-rail common case under OFA is to have every MPI rank use all the available host channel adapters and all the open ports from every HCA. Suppose the cluster system has 4 nodes where each system has 2 HCAs with 2 open ports each. Then every MPI rank may use 4 hardware cables for communication. The command-line syntax for invoking “mpirun” with Intel MPI Library might look something like:

export I_MPI_FABRICS=ofa:ofa mpirun –f <host-file> -n 16 –ppn 4 –env I_MPI_OFA_NUM_ADAPTERS=2 –env I_MPI_OFA_NUM_PORTS=2 ./program.exe>

where there are 4 MPI ranks associated with each node, and <host-file> is a meta-symbol for a file name that contains the names of the 4 compute servers. The environment variable setting I_MPI_OFA_NUM_ADAPTERS=2 enables utilization of 2 HCAs, and the environment variable I_MPI_OFA_NUM_PORTS=2 enables utilization of 2 ports.

For using multi-rail under OFI, the Unified Communication X (UCX) working group has defined a framework that will support multi-rail semantics.³ UCX is a collaboration between industry, laboratories, and academia to create an open-source production grade communication framework for data centric and high-performance computing applications (Figure 2).

Figure 2.The Unified Communication X framework³.

Regarding the current status of UCX and the multi-rail fabric: as of this writing, multi-rail is not implemented yet for OFI.

References

↧

Thread Parallelism in Cython*

December 15, 2016, 10:35 am

Latest and popular articles on Intel Technologies

≫ Next: Exploring MPI for Python* on Intel® Xeon Phi™ Processor

≪ Previous: A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library

Introduction

Cython* is a superset of Python* that additionally supports C functions and C types on variable and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import statement.

One interesting feature of Cython is that it supports native parallelism (see the cython.parallel module). The cython.parallel.prange function can be used for parallel loops; thus one can take advantage of Intel® Many Integrated Core Architecture (Intel® MIC Architecture) using the thread parallelism in Python.

Cython in Intel® Distribution for Python* 2017

Intel® Distribution for Python* 2017 is a binary distribution of Python interpreter, which accelerates core Python packages, including NumPy, SciPy, Jupyter, matplotlib, Cython, and so on. The package integrates Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL), pyDAAL, Intel® MPI Library and Intel® Threading Building Blocks (Intel® TBB). For more information on these packages, please refer to the Release Notes.

The Intel Distribution for Python 2017 can be downloaded here. It is available for free for Python 2.7.x and 3.5.x on OS X*, Windows* 7 and later, and Linux*. The package can be installed as a standalone or with the Intel® Parallel Studio XE 2017.

Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this article, the Intel® Distribution for Python 2.7 on Linux (l_python27_pu_2017.0.035.tgz) is installed on a 1.4 GHz, 68-core Intel® Xeon Phi™ processor 7250 with four hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and then follow the installer prompts:

$ tar -xvzf l_python27_pu_2017.0.035.tgz
$ cd l_python27_pu_2017.0.035
$ ./install.sh

After the installation completes, activate the root environment (see the Release Notes):

$ source /opt/intel/intelpython27/bin/activate root

Thread Parallelism in Cython

In Python, there is a mutex that prevents multiple native threads from executing bycodes at the same time. Because of this, threads in Python cannot run in parallel. This section explores thread parallelism in Cython. This functionality is then imported to the Python code as an extension module allowing the Python code to utilize all the cores and threads of the hardware underneath.

To generate an extension module, one can write Cython code (file with extension .pyx). The .pyx file is then compiled by the Cython compiler to convert it into efficient C code (file with extension .c). The .c file is in turn compiled and linked by a C/C++ compiler to generate a shared library (.so file). The shared library can be imported in Python as a module.

In the following multithreads.pyx file, the function serial_loop computes log(a)*log(b) for each entry in the A and B arrays and stores the result in the C array. The log function is imported from the C math library. The NumPy module, the high-performance scientific computation and data analysis package, is used in order to vectorize operations on A and B arrays.

Similarly, the function parallel_loop performs the same computation using OpenMP* threads to execute the computation in the body loop. Instead of using range, prange (parallel range) is used to allow multiple threads executed in parallel. prange is a function of the cython.parallel module and can be used for parallel loops. When this function is called, OpenMP starts a thread pool and distributes the work among the threads. Note that the prange function can be used only when the Global Interpreter Lock (GIL) is released by putting the loop in a nogil context (the GIL global variable prevents multiple threads to run concurrently). With wraparound(False), Cython never checks for negative indices; with boundscheck(False), Cython doesn’t do bound check on the arrays.

$ cat multithreads.pyx

cimport cython
import numpy as np
cimport openmp
from libc.math cimport log
from cython.parallel cimport prange
from cython.parallel cimport parallel

THOUSAND = 1024
FACTOR = 100
NUM_TOTAL_ELEMENTS = FACTOR * THOUSAND * THOUSAND
X1 = -1 + 2*np.random.rand(NUM_TOTAL_ELEMENTS)
X2 = -1 + 2*np.random.rand(NUM_TOTAL_ELEMENTS)
Y = np.zeros(X1.shape)

def test_serial():
    serial_loop(X1,X2,Y)

def serial_loop(double[:] A, double[:] B, double[:] C):
    cdef int N = A.shape[0]
    cdef int i

    for i in range(N):
        C[i] = log(A[i]) * log(B[i])

def test_parallel():
    parallel_loop(X1,X2,Y)

@cython.boundscheck(False)
@cython.wraparound(False)
def parallel_loop(double[:] A, double[:] B, double[:] C):
    cdef int N = A.shape[0]
    cdef int i

    with nogil:
        for i in prange(N, schedule='static'):
            C[i] = log(A[i]) * log(B[i])

After completing the Cython code, the Cython compiler converts it to a C code extension file. This can be done by a disutilssetup.py file (disutils is used to distribute Python modules). To use the OpenMP support, one must tell the compiler to enable OpenMP by providing the flag –fopenmp in a compile argument and link argument in the setup.py file as shown below. The setup.py file invokes the setuptools build process that generates the extension modules. By default, this setup.py uses GNU GCC* to compile the C code of the Python extension. In addition, we add –O0 compile flags (disable all optimizations) to create a baseline measurement.

$ cat setup.py
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name = "multithreads",
  cmdclass = {"build_ext": build_ext},
  ext_modules =
  [
    Extension("multithreads",
              ["multithreads.pyx"],
              extra_compile_args = ["-O0", "-fopenmp"],
              extra_link_args=['-fopenmp']
              )
  ]
)

Use the command below to build C/C++ extensions:

$ python setup.py build_ext –-inplace

Alternatively, you can also manually compile the Cython code:

$ cython multithreads.pyx

This generates the multithreads.c file, which contains the Python extension code. You can compile the extension code with the gcc compiler to generate the shared object multithreads.so file.

$ gcc -O0 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing
-fopenmp multithreads.c -I/opt/intel/intelpython27/include/python2.7 -L/opt/intel/intelpython27/lib -lpython2.7 -o multithreads.so

After the shared object is generated. Python code can import this module to take advantage of thread parallelism. The following section will show how one can improve its performance.

You can import the timeit module to measure the execution time of a Python function. Note that by default, timeit runs the measured function 1,000,000 times. Set the number of execution times to 100 in the following examples for a shorter execution time. Basically, timeit.Timer () imports the multithreads module and measures the time spent by the function multithreads.test_serial(). The argument number=100 tells the Python interpreter to perform the run 100 times. Thus, t1.timeit(number=100) measures the time to execute the serial loop (only one thread performs the loop) 100 times.

Similarly, t12.timeit(number=100) measures the time when executing the parallel loop (multiple threads perform the computation in parallel) 100 times.

Measure the serial loop with gcc compiler, compiler option –O0 (disabled all optimizations).

$ python
Python 2.7.12 |Intel Corporation| (default, Oct 20 2016, 03:10:12)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution

Import timeit and time t1 to measure the time spent in the serial loop. Note that you built with gcc compiler and disabled all optimizations. The result is displayed in seconds.

>>> import timeit>>> t1 = timeit.Timer("multithreads.test_serial()","import multithreads")>>> t1.timeit(number=100)
2874.419779062271

Measure the parallel loop with gcc compiler, compiler option –O0 (disabled all optimizations).

The parallel loop is measured by t2 (again, you built with gcc compiler and disabled all optimizations).

>>> t2 = timeit.Timer("multithreads.test_parallel()","import multithreads")>>> t2.timeit(number=100)
26.016316175460815

As you observe, the parallel loop improves the performance by roughly a factor of 110x.

Measure the parallel loop with icc compiler, compiler option –O0 (disabled all optimizations).

Next, recompile using the Intel® C Compiler and compare the performance. For the Intel® C/C++ Compiler, use the –qopenmp flag instead of –fopenmp to enable OpenMP. After installing the Intel Parallel Studio XE 2017, set the proper environment variables and delete all previous build:

$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64
Intel(R) Parallel Studio XE 2017 Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

$ rm multithreads.so multithreads.c -r build

To explicitly use the Intel icc to compile this application, execute the setup.py file with the following command:

$ LDSHARED="icc -shared" CC=icc python setup.py build_ext –-inplace

The parallel loop is measured by t2 (this time, you built with Intel compiler, disabled all optimizations):

$ python>>> import timeit>>> t2 = timeit.Timer("multithreads.test_parallel()","import multithreads")>>> t2.timeit(number=100)
23.89365792274475

Measure the parallel loop with icc compiler, compiler option –O3.

For the third try, you may want to see whether or not using –O3 optimization and enabling Intel® Advanced Vector Extensions (Intel® AVX-512) ISA on the Intel® Xeon Phi™ processor can improve the performance. To do this, in the setup.py, replace –O0 with –O3 and add –xMIC-AVX512. Repeat the compilation, and then run the parallel loop as indicated in the previous step, which results in: 21.027512073516846. The following graph shows the results (in seconds) when compiling with gcc, icc without optimization enabled, and icc with optimization, Intel AVX-512 ISA:

The result shows that the best result (21.03 seconds) is obtained when you compile the parallel loop with the Intel compiler, and enable auto-vectorization (-O3) combined with Intel AVX-512 ISA (-xMIC-AVX512) for the Intel Xeon Phi processor.

By default, the Intel Xeon Phi processor uses all available resources: it has 68 cores, and each core uses four hardware threads. A total of 272 threads or four threads/core are running in a parallel region. It is possible to modify the core and number of thread running by each core. The last section shows how to use an environment variable to accomplish this.

To run 68 threads on 68 cores (one thread per core) executing the loop body for 100 times, set the KMP_PLACE_THREADS environment as below:

$ export KMP_PLACE_THREADS=68c,1t

To run 136 threads on 68 cores (two threads per core) running the parallel loop for 100 times, set the KMP_PLACE_THREADS environment as below:

$ export KMP_PLACE_THREADS=68c,2t

To run 204 threads on 68 cores (three threads per core) running the parallel loop for 100 times, set the KMP_PLACE_THREADS environment as below:

$ export KMP_PLACE_THREADS=68c,3t

The following graph summarizes the result:

Conclusion

This article showed how to use Cython to build an extension module for Python in order to take advantage of multithread support for the Intel Xeon Phi processor. It shows how to use the setup script to build a shared library. The parallel loop performance can be improved by trying different compiler options in the setup script. This article also showed how to set different number of threads per core.

↧

Exploring MPI for Python* on Intel® Xeon Phi™ Processor

December 15, 2016, 11:51 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Clear Containers 1: The Container Landscape

≪ Previous: Thread Parallelism in Cython*

Introduction

Message Passing Interface (MPI) is a standardized message-passing library interface designed for distributed memory programming. MPI is widely used in the High Performance Computing (HPC) domain because it is well-suited for distributed memory architectures.

Python* is a modern, powerful interpreter which supports modules and packages. Python supports extension C/C++. While HPC applications are usually written in C or FORTRAN for faster speed, Python can be used to quickly prototype a proof of concept and for rapid application development because of its simplicity and modularity support.

The MPI for Python (mpi4py) package provides Python bindings for the MPI standard. The mpi4py package translates MPI syntax and semantics, and uses Python objects to communicate. Thus, programmers can implement MPI applications in Python quickly. Note that mpi4py is object-oriented. Not all functions in the MPI standard are available in mpi4py; however, almost all the commonly used functions are. More information on mpi4pi can be found here. In mpi4py, COMM_WORLD is an instance of the base class of communicators.

mpi4py supports two types of communications:

Communication of generic Python objects: The methods of a communicator object are lower-case (send(), recv(), bcast(), scatter(), gather(), etc.). In this type of communication, the sent object is passed as a parameter to the communication call.
Communication of buffer-like objects: The methods of a communicator object are upper-case letters (Send(), Recv(), Bcast(), Scatter(), Gather(), etc.). Buffer arguments to these calls are specified using tuples. This type of communication is much faster than Python objects communication type.

Intel® Distribution for Python* 2017

Intel® Distribution for Python* is a binary distribution of Python interpreter; it accelerates core Python packages including NumPy, SciPy, Jupyter, matplotlib, mpi4py, etc. The package integrates Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL), pyDAAL, Intel® MPI Library, and Intel® Threading Building Blocks (Intel® TBB).

The Intel Distribution for Python 2017 is available free for Python 2.7.x and 3.5.x on OS X*, Windows* 7 and later, and Linux*. The package can be installed as a stand alone or with the Intel® Parallel Studio XE 2017.

In the Intel Distribution for Python, mpi4py is a Python wraparound for the native Intel MPI implementation (Intel MPI Library). This document shows how to write an MPI program in Python, and how to take advantage of Intel® multicore architecture using OpenMP threads and Intel® AVX-512 instructions.

Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this example, the Intel Distribution for Python 2.7 on Linux (l_python27_pu_2017.0.035.tgz) is installed on an Intel® Xeon Phi™ processor 7250 @ 1.4 GHz and 68 cores with 4 hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and follow the installer prompts:

$ tar -xvzf l_python27_pu_2017.0.035.tgz
$ cd l_python27_pu_2017.0.035
$ ./install.sh

After the installation completes, activate the root Intel® Python Conda environment:

$ source /opt/intel/intelpython27/bin/activate root

Parallel Computing: OpenMP and SIMD

While multithreaded Python workloads can use Intel TBB optimized thread scheduling, another approach is to use OpenMP to take advantage of Intel multicore architecture. This section shows how to implement OpenMP multithreads and C math library in Cython*.

Cython is an interpreted language that can be built into native language. Cython is similar to Python, but it supports C function calls and C-style declaration of variables and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import statement.

For example, to generate an extension module, one can write a Cython code (.pyx file). The .pyx file is then compiled by Cython to generate a .c file, which contains the code of a Python extension code. The .c file is in turn compiled by a C compiler to generate a shared object library (.so file).

One way to build Cython code is to write a disutilssetup.py file (disutils is used to distribute Python modules). In the following multithreads.pyx file, the function vector_log_multiplication computes log(a)*log(b) for each entry in the A and B arrays and stores the result in the C array. Note that a parallel loop (prange) is used to allow multiple threads executed in parallel. The log function is imported from the C math library. The function getnumthreads() returns the number of threads:

$ cat multithreads.pyx

cimport cython
import numpy as np
cimport openmp
from libc.math cimport log
from cython.parallel cimport prange
from cython.parallel cimport parallel

@cython.boundscheck(False)
def vector_log_multiplication(double[:] A, double[:] B, double[:] C):
    cdef int N = A.shape[0]
    cdef int i

    with nogil, cython.boundscheck(False), cython.wraparound(False):
        for i in prange(N, schedule='static'):
            C[i] = log(A[i]) * log(B[i])

def getnumthreads():
    cdef int num_threads

    with nogil, parallel():
        num_threads = openmp.omp_get_num_threads()
        with gil:
            return num_threads

The setup.py file invokes the setuptools build process that generates the extension modules. By default, this setup.py uses GNU GCC* to compile the C code of the Python extension. In order to take advantage of AVX-512 and OpenMP multithreads in the Intel Xeon Phi processor, one can specify the options -xMIC-avx512 and -qopenmp in the compile and link flags, and use the Intel® compiler icc. For more information on how to create the setup.py file, refer to the Writing the Setup Script section of the Python documentation.

$ cat setup.py

from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name = "multithreads",
  cmdclass = {"build_ext": build_ext},
  ext_modules = [
    Extension("multithreads",
              ["multithreads.pyx"],
              libraries=["m"],
              extra_compile_args = ["-O3", "-xMIC-avx512", "-qopenmp" ],
              extra_link_args=['-qopenmp', '-xMIC-avx512']
              )
  ]

)

In this example, the Parallel Studio XE 2017 update 1 is installed. First, set the proper environment variables for the Intel C compiler:

$ source /opt/intel/parallel_studio_xe_2017.1.043/psxevars.sh intel64
Intel(R) Parallel Studio XE 2017 Update 1 for Linux*
Copyright (C) 2009-2016 Intel Corporation. All rights reserved.

To explicitly use the Intel compiler icc to compile this application, execute the setup.py file with the following command:

$ LDSHARED="icc -shared" CC=icc python setup.py build_ext –inplace

running build_ext
cythoning multithreads.pyx to multithreads.c
building 'multithreads' extension
creating build
creating build/temp.linux-x86_64-2.7
icc -fno-strict-aliasing -Wformat -Wformat-security -D_FORTIFY_SOURCE=2 -fstack-protector -O3 -fpic -fPIC -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/intel/intelpython27/include/python2.7 -c multithreads.c -o build/temp.linux-x86_64-2.7/multithreads.o -O3 -xMIC-avx512 -march=native -qopenmp
icc -shared build/temp.linux-x86_64-2.7/multithreads.o -L/opt/intel/intelpython27/lib -lm -lpython2.7 -o /home/plse/test/v7/multithreads.so -qopenmp -xMIC-avx512

As mentioned above, this process first generates the extension code multithreads.c. The Intel compiler compiles this extension code to generate the dynamic shared object library multithreads.so.

How to write a Python Application with Hybrid MPI/OpenMP

In this section, we write an MPI application in Python. This program imports mpi4py and multithreads modules. The MPI application uses a communicator object, MPI.COMM_WORLD, to identify a set of processes which can communicate within the set. The MPI functions MPI.COMM_WORLD.Get_size(), MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.send(), and MPI.COMM_WORLD.recv() are methods of this communicator object. Note that in mpi4py there is no need to call MPI_Init() and MPI_Finalize() as in the MPI standard because these functions are called when the module is imported and when the Python process ends, respectively.

The sample Python application first initializes two large input arrays consisting of random numbers between 1 and 2. Each MPI rank uses OpenMP threads to do the computation in parallel; each OpenMP thread in turn computes the product of two natural logarithms c = log(a)*log(b) where a and b are random numbers between 1 and 2 (1 <= a,b <= 2). To do that, each MPI rank calls the vector_log_multiplication function defined in the multithreads.pyx file. Execution time of this function is short, about 1.5 seconds. For illustration purposes, we use the timeit utility to invoke the function ten times just to have enough time to demonstrate the number of OpenMP threads involved.

Below is the application source code mpi_sample.py:

from mpi4py import MPI
from multithreads import *
import numpy as np
import timeit

def time_vector_log_multiplication():
    vector_log_multiplication(A, B, C)

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

THOUSAND = 1024
FACTOR = 512
NUM_TOTAL_ELEMENTS = FACTOR * THOUSAND * THOUSAND
NUM_ELEMENTS_RANK = NUM_TOTAL_ELEMENTS / size
repeat = 10
numthread = getnumthreads()

if rank == 0:
   print "Initialize arrays for %d million of elements" % FACTOR

A = 1 + np.random.rand(NUM_ELEMENTS_RANK)
B = 1 + np.random.rand(NUM_ELEMENTS_RANK)
C = np.zeros(A.shape)

if rank == 0:
    print "Start timing ..."
    print "Call vector_log_multiplication with iter = %d" % repeat
    t1 =  timeit.timeit("time_vector_log_multiplication()", setup="from __main__ import time_vector_log_multiplication",number=repeat)
    print "Rank %d of %d running on %s with %d threads in %d seconds" % (rank, size, name, numthread, t1)

    for i in xrange(1, size):
        rank, size, name, numthread, t1 = MPI.COMM_WORLD.recv(source=i, tag=1)
        print "Rank %d of %d running on %s with %d threads in %d seconds" % (rank, size, name, numthread, t1)
    print "End  timing ..."

else:
    t1 =  timeit.timeit("time_vector_log_multiplication()", setup="from __main__ import time_vector_log_multiplication",number=repeat)
    MPI.COMM_WORLD.send((rank, size, name, numthread, t1), dest=0, tag=1)

Run the following command line to launch the above Python application with two MPI ranks:

$ mpirun -host localhost -n 2 python mpi_sample.py

Initialize arrays for 512 million of elements
Start timing ...
Call vector_log_multiplication with iter = 10
Rank 0 of 2 running on knl-sb2.jf.intel.com with 136 threads in 14 seconds
Rank 1 of 2 running on knl-sb2.jf.intel.com with 136 threads in 15 seconds
End  timing ...

While the Python program is running, the top command in a new terminal displays two MPI ranks (shown as two Python processes). When the main module enters the loop (shown with the message “Start timing…”), the top command reports almost 136 threads running (~13600 %CPU). This is because by default, all 272 hardware threads on this system are utilized by two MPI ranks, thus each MPI rank has 272/2 = 136 threads.

To get detailed information about MPI at run time, we can set the I_MPI_DEBUG environment variable to a value ranging from 0 to 1000. The following command runs 4 MPI ranks and sets the I_MPI_DEBUG to the value 4. Each MPI rank has 272/4 = 68 OpenMP threads as indicated by the top command:

$ mpirun -n 4 -genv I_MPI_DEBUG 4 python mpi_sample.py

[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[0] MPI startup(): Rank    Pid      Node name             Pin cpu
[0] MPI startup(): 0       84484    knl-sb2.jf.intel.com  {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152, 204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220}
[0] MPI startup(): 1       84485    knl-sb2.jf.intel.com  {17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,85,86,87,88,89,90,91,92,93,94                                            ,95,96,97,98,99,100,101,153,154,155,156,157,158,159,160,161,162,163,164,165,166, 167,168,169,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237}
[0] MPI startup(): 2       84486    knl-sb2.jf.intel.com  {34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254}
[0] MPI startup(): 3       84487    knl-sb2.jf.intel.com  {51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271}
Initialize arrays for 512 million of elements
Start timing ...
Call vector_log_multiplication with iter = 10
Rank 0 of 4 running on knl-sb2.jf.intel.com with 68 threads in 16 seconds
Rank 1 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds
Rank 2 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds
Rank 3 of 4 running on knl-sb2.jf.intel.com with 68 threads in 15 seconds
End  timing ...

We can specify the number of OpenMP threads used by each rank in the parallel region by setting the OMP_NUM_THREADS environment variable. The following command starts 4 MPI ranks, 34 threads for each MPI ranks (or 2 threads/core):

$  mpirun -host localhost -n 4 -genv OMP_NUM_THREADS 34 python mpi_sample.py

Initialize arrays for 512 million of elements
Start timing ...
Call vector_log_multiplication with iter = 10
Rank 0 of 4 running on knl-sb2.jf.intel.com with 34 threads in 18 seconds
Rank 1 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds
Rank 2 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds
Rank 3 of 4 running on knl-sb2.jf.intel.com with 34 threads in 17 seconds
End  timing ...

Finally, we can force the program to allocate memory in MCDRAM (High-Bandwidth Memory on the Intel Xeon Phi processor). For example, before the execution of the program, the ”numactl –hardware” command shows the system has 2 NUMA nodes: node 0 consists of CPUs and 96 GB DDR4 memory, node 1 is the on-board 16 GB MCDRAM memory:

$ numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 73585 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

Run the following command, which indicates allocating memory in MCDRAM if possible:

$ mpirun -n 4 numactl --preferred 1 python mpi_sample.py

While the program is running, we can observe that it allocates memory in MCDRAM (NUMA node 1):

$ numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 73590 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 3428 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

Readers can also try the above code on an Intel® Xeon® processor system with the appropriate setting. For example, on Intel® Xeon® processor E5-2690 v4, using –xCORE-AVX2 instead of –xMIC-AVX512, and set the number of available threads to 28 instead of 272. Also note that E5-2690 v4 doesn’t have High-Bandwidth Memory.

Conclusion

This article introduced the MPI for Python package and demonstrated how to use it via the Intel Distribution for Python. Furthermore, it showed how to use OpenMP and Intel AVX-512 instructions in order to fully take advantage of the Intel Xeon Phi processor architecture. A simple example was included to show how one can write a parallel Cython function with OpenMP, compiled it with the Intel compiler with AVX-512 enabled option, and integrated it with an MPI Python program to fully take advantage of the Intel Xeon Phi processor architecture.

References:

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer with Intel Corporation's Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.

↧

Intel® Clear Containers 1: The Container Landscape

December 23, 2016, 12:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Using Intel® VTune™ Amplifier on Cray* XC systems

≪ Previous: Exploring MPI for Python* on Intel® Xeon Phi™ Processor

Download PDF

Introduction

This article introduces the concept of Intel® Clear Containers technology and how it fits into the overall landscape of current container-based technologies. Many introductory articles have already been written about containers and virtual machines (VMs) over the last several years. Here's a good overview: A Beginner-Friendly Introduction To Containers, VMs, and Docker.

This article briefly summarizes the most salient features of existing VMs and containers.

Intel Clear Containers offer advantages to data center managers and developers because of their security and shared memory efficiencies, while their performance remains sufficient for running containerized applications.

What's In A Word: Container

The word “container” gets thrown around a lot in this article and many others, and it's important to understand its meaning.

If you’re an application developer, you probably think of a container as some kind of useful application running inside a private chunk of a computer's resources. For example, you would discuss downloading a container image of a web server from Docker* Hub*, and then running that container on a host.

If you’re a data center manager, your perception of a container might be slightly different. You know that application developers are running all kinds of things in your data center, but you are more concerned with the composition and management of the bounded resources the applications run in than the applications themselves. You might discuss running “an LXC container using Docker” for the developer's web server. You would refer to the application image and application itself as the containerized workload.

This is a subtle, but important, difference. Since we'll be looking directly at container technology, independent of the workloads run using it, we'll use the data center manager's definition of container: the technology that creates an instance of bounded resources for a containerized workload to use.

Virtual Machines versus Containers

It is technically a misstatement to suggest that containers came after VMs. In fact, containers are more-or-less a form of virtualized resources themselves, and both technologies are descendants in a long line of hardware abstractions that stretch back to early computing.

However, in the marketplace of modern IT, it's relatively clear that the era from roughly the mid-2000s to around 2014 (or so) was dominated by the rise of paravirtual machine usage, both in the traditional data center and in cloud computing environments. The increasing power of servers, combined with advancements in hardware platforms friendly to VMs like Intel® Virtualization Technology for IA-32, Intel® 64 and Intel® Architecture and Intel® Virtualization Technology for Directed I/O allowed data center managers to more flexibly assign workloads.

Early concerns about VMs included performance and security. As the platforms grew more and more robust, these concerns became less and less relevant. Eventually commodity hypervisors were capable of delivering performance with less than 2–3 percent falloff from direct physical access. From a security standpoint, VMs became more isolated, allowing them to run in user space, meaning a single ill-behaved VM that became compromised did not automatically allow an attacker access outside the VM itself.

Within the last few years, containers began exploding upon the scene. A container, whether provided by Linux* Containers (LXC), libcontainer*, or other types, offers direct access to hardware, so there's no performance penalty. They can be instantiated far more quickly than a regular VM since they don't have to go through a bootup process. They don't require the heavyweight overhead of an entire OS installation to run. Most importantly, the powerful trio of a container, a container management system (Docker), and a robust library of containerized applications (DockerHub) gives application developers access to rapid deployment and scaling of their applications that could not be equaled by traditional VMs.

While offering these huge rewards, containers reintroduced a security problem: they represent direct access to the server hardware that underpins them. A compromised container allows an attacker the capability to escape to the rest of the OS beneath it.

[Note: We don't mean to imply that this access is automatic or easy. There are many steps to secure containers in the current market. However, a compromise of the container itself—NOT the containerized application—results in likely elevation to the kernel level.]

That leaves us with this admittedly highly generalized statement of pros and cons for VMs versus containers:

Virtual Machine	Container
- Slow Boot	+ Rapid Start
- Heavy Mgmt.	+ Easy Mgmt.
+ Security	- Security
= Performance*	= Performance

*VMs take a negligible performance deficit due to hardware abstraction.

Best of Both Worlds: Intel Clear Containers

Intel has developed a new, open source method of launching containerized workloads called Intel Clear Containers. An Intel Clear Container, running on Intel architecture with Intel® Virtualization Technology enabled, is:

A highly-customized version of the QEMU-KVM* hypervisor, called qemu-lite.
- Most of the boot-time probes and early system setup associated with a full-fledged hypervisor are unnecessary and stripped away.
- This reduces startup time to be on a par with a normal container process.
A mini-OS that consists of:
- A highly-optimized Linux kernel.
- An optimized version of systemd.
- Just enough drivers and additional binaries to bring up an overlay filesystem, set up networking, and attach volumes.
The correct tooling to bring up containerized workload images exactly as a normal container process would.

Intel Clear Containers can also be integrated with Docker 1.12, allowing the use of Docker just exactly as though operating normal OS containers via the native Docker execution engine. This drop-in is possible because the runtime is compatible with the Open Container Initiative* (OCI*). The important point is that from the application developer perspective, where “container” means the containerized workload, an Intel Clear Container looks and behaves just like a “normal” OS container.

There are some additional but less obvious benefits. Since the mini-OS uses a 4.0+ Linux kernel, it can take advantage of the “direct access” (DAX) feature of the kernel to replace what would be overhead associated with VM memory page cache management. The result is faster performance by the mini-OS kernel and a significant reduction in the memory footprint of the base OS and filesystem; only one copy needs to be resident in memory on a host that could be running thousands of containers.

In addition, Kernel Shared Memory (KSM) allows the containerized VMs to share memory securely for static information that is not already shared by DAX via a process of de-duplication. This results in an even more efficient memory profile. The upshot of these two combined technologies is that the system's memory gets used for the actual workloads, rather than redundant copies of the same OS and library data.

Given the entry of Intel Clear Containers onto the scene, we can expand the table from above:

Virtual Machine	Container	Intel® Clear Container
- Slow Boot	+ Rapid Start	+ Rapid Start^
- Heavy Mgmt.	+ Easy Mgmt.	+ Easy Mgmt.
+ Security	- Security	+ Security
= Performance*	= Performance	= Performance*

*VMs take a negligible performance deficit due to hardware abstraction.

Conclusion

Intel Clear Containers offer a means of combining the best features of VMs with the power and flexibility that containers bring to application developers.

You can find more information about Intel Clear Containers at the official website.

This is the first in a series of articles about Intel Clear Containers. In the second, we'll be demonstrating how to get started using Intel Clear Containers yourself.

Read the second article in the series: Intel® Clear Containers 2: Using Clear Containers with Docker

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation, who focuses on enabling cloud technology for Intel’s Developer Relations Division. Before that he worked in Intel’s Open Source Technology Center (OTC), on both Intel Clear Containers and the Clear Linux for Intel Architecture Project. He’d be happy to hear from you about this article at: jim.chamings@intel.com.

↧

Using Intel® VTune™ Amplifier on Cray* XC systems

December 29, 2016, 4:32 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Clear Containers 2: Using Clear Containers with Docker

≪ Previous: Intel® Clear Containers 1: The Container Landscape

Introduction

The goal of this article is to provide detailed description of the process of VTune Amplifier installation and using it for applications performance analysis, which is a little bit specific to Cray’s programming environment (PE). We will be referencing to the CLE 6.0 – Cray installation and configuration model for software on Cray XC systems [1] The installation part of the article is targeting site administrators and system supporters responsible for the Cray XC programming environment, while data collection and analysis part is applicable for Cray XC system users.

Installation

The Cray CLE 6.0 provides a set of different compilers, performance analysis tools and run-time libraries including Intel Compiler and Intel MPI library. However, VTune Amplifier is not a part of it, and it require additional efforts for installing in the programming environment.

According to the Cray CLE 6.0 documentation [2], installation of additional software into a PE image root is performed on the system's System Management Workstation SMW. The PE image root is then pushed to the boot node so that it can be mounted by a group of Data Virtualization Service (DVS) servers and then mounted to the system's login and compute nodes.

Cray positions advantages of PE image root model as the installation is designed to be system and hardware agnostic, so the same PE image root can also be used for other systems, such as eLogin systems or another Cray XC. A feature of Image Management and Provisioning System (IMPS) images is that they are easily "cloned" leveraging the use of rpm and zypper. This ability allows the site to test new PE releases, and also makes reverting back to previous PE releases easier. However, VTune in its part of the sampling driver installation is not system agnostic and requires thorough following of supported Linux kernel used for data collection. This will be shown later in the example.

Installing VTune Amplifier is performed on the SMW by using chroot to access the PE root image. You need to copy the installation package of VTune to PE image root, execute VTune installation procedure and create a VTune modulefile.

The Craypkg-gen tool is used to generate a modulefile so that third party programming software like VTune can be used in a similar manner as the components of the Cray Programming Environment. But before that you need to define USER_INSTALL_DIR environment variable, which is for VTune would be /opt/intel.

The Craypkg-gen ‘-m’ option will create the modulefile:

$ craypkg-gen –m $USER_INSTALL_DIR/vtune_amplifier_xe_2017.0.2.478468

The ‘-m’ option also creates a set_default script that will make the associated modulefile the default version that is used by the module command. For this example, the following set_default script was created:

$USER_INSTALL/admin-pe/set_default_craypkg/set_default_vtune_amplifier_xe_2017.0.2.478468

Executing the generated set_default script will result in a “module load vtune” loading the vtune_amplifier_xe/2017.0.2.478468 modulefile.

Example of installing VTune Amplifier 2017

With having CLE 6.0 the Programming Environment software installed on to a PE image root, download the Intel VTune Amplifier 2017 package, and copy it to the PE image root:

smw # export  PECOMPUTE=/var/opt/cray/imps/image_roots/<pe_compute_cle_6.0_imagename>smw # cp vtune_amplifier_xe_2017_update1.tar.gz $PECOMPUTE/var/tmp

Note, it could be not a standalone VTune installation package, but the whole Intel parallel Studio XE package - parallel_studio_xe_2016_update1.tgz. In this case the installation would be different only in a sense of a selecting a VTune component.

If not using a FlexLm license server, which require a certain configuration, copy a registered license file to PE image for interactive installation:

smw # cp l_vtune_amplifier_xe_2017_p.lic $PECOMPUTE/var/tmp

Or copy the license file to the default Intel licenses directory:

smw # cp l_vtune_amplifier_xe_2017_p.lic $PECOMPUTE/opt/intel/licenses

Perform a chroot to PE image:

smw # chroot $PECOMPUTE

Untar the VTune Amplifier package:

smw # cd /var/tmpsmw # tar xzvf vtune_amplifier_xe_2017_update1.tar.gz

By default, the VTune installer is interactive and requires the administrator to respond to prompts. You might want to consult with the Intel® VTune™ Amplifier XE Installation Guide before proceeding.

smw # cd vtune_amplifier_xe_2017_update1/smw # ./install.sh

Follow the command prompts to install the product.

If you need a non-interactive VTune installation, refer to the Automated Installation of Intel® VTune™ Amplifier XE help article.

Once the installer flow reached the sep driver installation, you can either postpone that step or provide a path to the Linux kernel source directory that runs on the Cray compute nodes.

Note: the Cray SWM 8.0 is based on SLES 12 system, which might not be the same as on the compute nodes. In this case you need to provide a path to the target OS kernel headers when requested by the VTune installer.

In case of postponed driver installation, go through the following steps (assuming that the compute node Linux kernel sources are unpacked to the usr/src/target_linux).

Use the GCC environment for building:

smw #  module swap PrgEnv-cray PrgEnv-gnu

Set environment variable CC, so that 'cc' is used as the compiler:

smw # env CC=cc

Build the drivers (two kernel drivers will be built):

smw # cd vtune_amplifier_xe_2017/sepdk/srcsmw # ./build-driver –ni --kernel-src-dir=$PECOMPUTE/usr/src/target_linux

Install the drivers with permit to the user group (by default, the driver access group name is ‘vtune’ and the driver permissions - 660):

smw # ./insmod-sep3 -r -g <group>

By default, the driver will be installed in the current /sepdk/src directory. If you need to change it, use the --install-dir option with the insmod-sep3 script.

Refer to the <vtune-install-dir>/sepdk/src/README.txt document for more details on building the driver.

Create the VTune modulefile following the steps:

smw # module load craypkg-gensmw # craypkg-gen -m $PECOMPUTE/opt/intel/vtune_amplifier_xe_2017.0.2.478468smw # /opt/intel/vtune_amplifier_xe_2017.0.2.478468/amplxe-vars.sh

The above procedure will create the module file $PECOMPUTE/modulefiles/vtune_amplifier_xe /2017.0.2.478468

You might want to edit the newly created modulefile specifying path variables.

Collecting profile data with VTune Amplifier

In order to collect profiling data for further analysis you need to run VTune collector along with your application on a system. There are several ways how to launch an application for analysis and in general, they are described in the VTune Amplifier Help pages.

Cray systems have specifics of running applications by submitting batch jobs, so has VTune. Generally it is recommend using VTune command line tool, "amplxe-cl", to collect profiling data on compute nodes via batch jobs, and then using VTune GUI, “amplxe-gui”, to display results on a login node of the system.

However, job scheduler utilities accepted as a part of task submitting procedure, as well as Compilers and MPI libraries used for creating parallel applications, may vary depending on specific requirements. This creates additional complexities for performance data collection using VTune or any other performance profiling tool. Below, we will give some common recipes on how to run performance data collection with two most frequently used job schedulers.

Slurm* workload manager and srun command

Here is an example of a job script for analysis of a pure MPI application:

#!/bin/bash -l
#SBATCH --partition debug
#SBATCH --vtune
#SBATCH --time 01:00:00
#SBATCH --nodes 2
#SBATCH --job-name myjob

module unload darshan
module load vtune
srun -n 64 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

This script will run the advanced-hotspots analysis over a.out program running on two nodes with 96 tasks in total. Other VTune options mean the following:

-collect advanced-hotspots type of analysis used by VTune collector (this is hardware events based collector as well as general-exploration and memory-access)

--trace-mpi allows the collectors to trace MPI code, and determine the MPI rank IDs if the code is linked to a non-Intel MPI library. When using Intel MPI library this options should be omitted.

-r my_res_dir name for results directory which will be created in a current directory

It is highly recommended to create result directory on the fast Lustre file system. VTune needs frequently purging trace data from memory to disk, so it’s not recommended putting results in to global file system as it’s projected to compute nodes via Cray DVS layer and might be may not fully supporting mmap functionality required by VTune collector.

In the script you need to unload the darshan module on the system before profiling your code, as VTune collector might interfere with the I/O characterization tool. Although, there might be no darshan tool installed in your system at all.

The --vtune flag is needed for dynamic insmod’ing driver for hardware events collection during the job.

Note the length of you job. Even if the '-t' is set to 1 hour, it doesn’t mean that VTune will be collecting data for the whole period if application run time. By default, the size of results directory is limited and when trace file reaches this limit, VTune will stop the collection while the application continues to run. The simple implication is that performance data will be collected over some part for application starting from its beginning. To overcome this limitation consider either increasing result directory size limit or decreasing sampling frequency.

If you application uses a hybrid parallelization approach with combination of MPI and OpenMP, your job script for VTune analysis might look like the following:

#!/bin/bash -l
#SBATCH --partition debug
#SBATCH --vtune
#SBATCH --time 01:00:00
#SBATCH --nodes 2
#SBATCH --job-name myjob

module unload darshan
module load vtune
export OMP_NUM_THREADS=32
srun -n 2 –c 32 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

As you can see, tasks and threads assignment syntax remains the same for srun, and as with pure MPI application, you specify the amplxe-cl as a task to execute which will take care of distribution of the a.out tasks between compute nodes. In this case VTune creates only two per-node results directories, named my_res_dir.<nodename>. The per-OpenMP thread results will be aggregated in each per-node resulting trace file.

One of the downsides of using such approach is that VTune will analyze each task and it will create results against each MPI rank in the job. It’s not a problem when a job is distributed among a few ranks, but in case of hundreds or thousands tasks you might end up with enormous performance data size and infinite time to complete analysis finalization. In this case you might want to collect profile against a single or a subset of MPI ranks, leveraging the multiple program configuration from srun. This approach is described in the article [3].

For the aforementioned example you need to create a separate configuration file that will define which MPI ranks will be analyzed.

$ cat run_config.conf
0-1022 ./a.out
1023 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

And in the job script the srun line will look like the following:

srun –n 32 –c 32 --multi-prog ./srun_config.conf

Application Level Placement Scheduler* (ALPS) and aprun command

With ALPS running VTune by the the aprun command is similar to the Slurm/srun experience. Just make sure you are using the --trace-mpi option to make sure VTune is keeping one collector instance on each node with multiple MPI ranks.

For a pure MPI application your job script for VTune analysis might look like the following [4]:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR

aprun -n 32 –N 16 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

where:

-n– number of processes

-N– number of processes per node

In case of a hybrid parallelization approach with combination of MPI and OpenMP:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR
setenv OMP_NUM_THREADS 8
aprun -n 32 –N 2 –d 8 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

where:

-d– depth or number of CPUs assigned per a process

If you’d like to analyze just on node, you need to modify the script for multiple executables:

#!/bin/bash
#PBS -l mppwidth=32
#PBS -l walltime=00:10:00
#PBS -N myjob
#PBS -q debug

cd $PBS_O_WORKDIR

aprun -n 16 ./a.out : -n 16 –N 16 amplxe-cl -collect advanced-hotspots -r my_res_dir --trace-mpi -- ./a.out

Known limitations on VTune Amplifier collection in Cray XC systems

1. By default Cray compiler produces static binaries. General recommendation is to use dynamic linking for profiling under VTune Amplifier where possible to avoid a set of limitations that the tools has on profiling static binaries. If dynamic linking cannot be applied the following should be taken into account on VTune Amplifier limitations:

a) PIN-based analysis types don’t work with static binaries out of the box reporting the following message:

Error: Binary file of the analysis target does not contain symbols required for profiling. See the 'Analyzing Statically Linked Binaries' help topic for more details.

This impacts hotspots, concurrency and locks and waits collection and also memory access collection with memory object instrumentation. See https://software.intel.com/en-us/node/609433 how to work around the issue.

b) PMU-based analysis crashes on static binaries with OpenMP RTL from 2017 Gold and earlier Intel compiler versions.

To workaround the issue use a wrapper script with the following variables to be unset:

unset INTEL_LIBITTNOTIFY64
unset INTEL_ITTNOTIFY_GROUPS

The issue was fixed in Intel OpenMP RTL that is a part of Intel Compiler 2017 Update 1 and later.

c) Collection information based on User API will not be available including user pauses, resumes, frames, tasks defined by a user in their source code, OpenMP instrumentation based statistics such as Serial time vs Parallel time, imbalance on barriers etc, rank number capturing to enrich process names with MPI rank numbers.

2. In the case if VTune result directory is placed on a file system projected by Cray DVS VTune emits an error that the result cannot be finalized.

To workaround the issue - place a VTune result directory on a file system w/o Cray DVS projection (scratch etc) using '-r' VTune command line option.

3. It is required to add PMI_NO_FORK=1 to the application environment to make MPI profiling working and avoid MPI application hang under profiling.

Analyzing data with VTune Amplifier

VTune Amplifier provides a powerful and visual tools for multi-process, multithreading and single-threading performance analysis. In most cases it’s better using VTune GUI for opening collected results, while command-line tools have very similar results reporting functionality. For doing that you can enter a login node, load tune module and launch VTune GUI:

$ module load vtune
$ amplxe-gui

In the GUI you need to open an .amplxe project file in an appropriate results directory created during data collection. VTune Amplifier GUI is exposing a lot of graphic controls and objects, so performance wise it better for remote users to run an in-place X-server and open a client X-Window terminal using VNC* or NX* [5] software.

*Other names and brands may be claimed as the property of others.

References

[1] http://docs.cray.com/PDF/XC_Series_Software_Installation_and_Configuration_Guide_CLE60UP02_S-2559.pdf

[2] https://cug.org/proceedings/cug2016_proceedings/includes/files/pap127.pdf

[3] Running Intel® Parallel Studio XE Analysis Tools on Clusters with Slurm* / srun

[4] http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Show;q=;f=man/alpsm/10/cat1/aprun.1.html

[5] https://en.wikipedia.org/wiki/NX_technology

↧

Intel® Clear Containers 2: Using Clear Containers with Docker

December 29, 2016, 8:10 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® ISA-L: Cryptographic Hashes for Cloud Storage

≪ Previous: Using Intel® VTune™ Amplifier on Cray* XC systems

Download PDF

Introduction

This article describes multiple ways to get started using Intel® Clear Containers on a variety of operating systems. It is written for an audience that is familiar with Linux* operating systems, basic command-line shell usage, and has some familiarity with Docker*. We'll do an installation walk-through that explains the steps as we take them.

This article is the second in a series of three. The first article introduces the concept of Intel Clear Containers technology and describes how it fits into the overall landscape of current container-based technologies.

Requirements

You will need a host upon which to run Docker and Intel Clear Containers. As described below, the choice of OS is up to you, but the host you choose has some prerequisites:

Ideally, for realistic performance, you would want to use a physical host. If you do, then the following should be true as well:
- It must be capable of using Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64, and Intel® Architecture (Intel® VT-x).
- Intel VT-x must be enabled in the system BIOS.
You can use a kernel-based virtual machine (KVM), with nested virtualization, to try out Intel Clear Containers. Note that the physical host you are running the KVM instance on should satisfy the above conditions as well. It might work on a less-functional system, but ... no guarantees.

Two Paths: Clear Linux* for Intel® Architecture or Your Own Distribution

You have your choice of operating system to install to your host. You can either use Intel Clear Containers in Clear Linux* for Intel® Architecture, or use another common Linux distribution of your choice. IntelClear Containers do not behave or function differently on different operating systems, although the installation instructions differ. Detailed instructions exist for installing to CentOS* 7.2, Fedora* 24, and Ubuntu* 16.04.

Using the Clear Linux Project for Intel Architecture

Intel Clear Containers were developed by the team that develops the Clear Linux Project for Intel Architecture operating system distribution.

Intel Clear Containers are installed by default with Docker in the current version of Clear Linux. So, one way to get started with using Intel Clear Containers is to download and install Clear Linux. Instructions for installation on a physical host are available at https://clearlinux.org/documentation/gs_installing_clr_as_host.html.

For installation to a virtual machine, use the instructions at https://clearlinux.org/documentation/vm-kvm.html.

Software in Clear Linux is delivered in bundles; the Intel Clear Containers and Docker installation is contained in a bundle called containers-basic. Once you've installed your OS, it's very easy to add the bundle:

swupd bundle-add containers-basic

From there you could simply begin using Docker:

systemctl start docker-cor

This will start Docker with the correct runtime executable ('cor') as the execution engine, thus enabling Intel Clear Containers:

docker run -ti ubuntu (for example).

A complete production installation document, that includes directions for setting up user-level Docker control, is here for a physical host: https://github.com/01org/cc-oci-runtime/wiki/Installing-Clear-Containers-on-Clear-Linux.

Using a Common Linux* Distribution

If you are not using the Clear Linux for Intel Architecture distribution as your base operating system, it is possible to install to many common distributions. You will find guides for installation to CentOS 7.2, Fedora 24, and Ubuntu 16.04 (either Server or Desktop versions of these distributions will work just fine) at https://github.com/01org/cc-oci-runtime/wiki/Installation.

The essence of all of these installations follows the same basic flow:

Obtain the Intel Clear Containers runtime executable, called cc-oci-runtime. This is the Intel Clear Containers OCI-compatible runtime binary that is responsible for launching a qemu-lite process and integrating filesystem and network pieces.
Handle additional dependencies and configuration details that the base OS may be lacking.
Upgrade (or fresh-install) the local Docker installation to 1.12, which is the version that supports OCI and replaceable execution engines.
Configure the Docker daemon to use the Intel Clear Containers execution engine by default.

The repository instructions given are tailored to the specific distribution setups, but given this general workflow and some knowledge of your own particular distribution, just about any common distribution could be adapted to do the same without too much additional effort.

Installation Walk-Through: Ubuntu* 16.04 and Docker

This section will walk through the installation of Intel Clear Containers for Ubuntu 16.04, in detail. It will follow the instructions linked to above, so it might help to have the Ubuntu installation guide open as you read. It’s located at https://github.com/01org/cc-oci-runtime/wiki/Installing-Clear-Containers-on-Ubuntu-16.04

I'll be giving some context and explanations to the instructions as we go along, though, which may be helpful if you are adapting to your own distribution.

Note for Proxy Users

If you require the use of a proxy server for connections to the Internet, you'll want to pay attention to specific items called out in the following discussion. For most of this, it is sufficient to set the following proxy variables in your running shell, replacing the all-caps values as needed. This should be a familiar format for most that have to use these services.

# export http_proxy=http://[USER:PASSWORD@]PROXYHOST:PORT/
# export https_proxy=http://[USER:PASSWORD@]PROXYHOST:PORT/
# export no_proxy=localhost,127.0.0.0/24,*.MY.DOMAIN

Install the Intel Clear Containers Runtime

The first step is to obtain and install the Intel Clear Containers runtime, as described above. For Ubuntu, there is a package available, which can be downloaded and installed, but we have to resolve a simple dependency first.

sudo apt-get install libpixman-1-0

[Note: This seems like an odd dependency, and it is. However, there's certain pieces of the qemu-lite executable that can't easily be removed; this is a holdover dependency from the larger QEMU-KVM parent. It's very low-overhead and should be resolved in a later release of Intel Clear Containers.]

Now we'll add a repository service that has the runtime that we're after, as well as downloading the public key for that repository so that the Ubuntu packaging system can verify the integrity of the packages we download from it:

sudo sh -c "echo 'deb http://download.opensuse.org/repositories/home:/clearlinux:/preview:/clear-containers-2.0/xUbuntu_16.04/ /'>> /etc/apt/sources.list.d/cc-oci-runtime.list"
wget http://download.opensuse.org/repositories/home:clearlinux:preview:clear-containers-2.0/xUbuntu_16.04/Release.key
sudo apt-key add Release.key
sudo apt-get update
sudo apt-get install -y cc-oci-runtime

Configure OS for Intel Clear Containers

As of this writing, Section 3 of the installation instructions suggests the installation of additional kernel drivers and a reboot of your host at this point in the procedure. This is to acquire the default storage driver, ‘aufs’, for Docker.

However, there is a more up-to-date alternative called ‘overlay2’, and therefore this step is unnecessary with the addition of one small configuration change, which is detailed below. For now, it is okay to simply skip Section 3 and the installation of the “Linux kernel extras” packages.

One more thing remains in this step. Clear Linux for Intel Architecture updates very frequently (as often as twice a day) to stay ahead of security exploits and to be as up-to-date as possible in the open source world. Due to the Ubuntu packaging system, it's pretty certain that the mini-OS that's included as part of this package is ahead of where Ubuntu thinks it is. We need to update the OS to use the current mini-OS:

cd /usr/share/clear-containers/
sudo rm clear-containers.img
sudo ln -s clear-*-containers.img clear-containers.img
sudo sed -ie 's!"image":.*$!"image": "/usr/share/clear-containers/clear-containers.img",!g' /usr/share/defaults/cc-oci-runtime/vm.json

Install Docker* 1.12

As of this writing, even though Ubuntu 16.04 makes Docker 1.12 available as default for the OS, the packaging of it assumes the use of the native runtime. Therefore, it is necessary to install separate pieces directly from dockerproject.org rather than taking the operating system packaging.

Similarly to the above installation of the cc-oci-runtime, we're going to add a repository, add the key for the repo, and then perform installation from the repository.

sudo apt-get install apt-transport-https ca-certificates
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
sudo sh -c "echo 'deb https://apt.dockerproject.org/repo ubuntu-xenial main'>> /etc/apt/sources.list.d/docker.list"
sudo apt-get update
sudo apt-get purge lxc-docker

[Note for proxy users: the second command above will not work with just the usual proxy environment variables that we discussed above. You'll need to modify the command as shown here:

sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --keyserver-options http-proxy=http://[USER:PASSWORD@]HOST:PORT/ --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
]

Before we install, we should check the current versions available. As of this writing the most current version available is docker-engine=1.12.3-0~xenial.

Install the version specified in the instructions even if it is older! More current versions may or may not have been well-integrated with the OS distribution in question. Intel Clear Containers is under active development and improvement, so not all versions will work everywhere. (As stated earlier...the best place to ensure you've got an up-to-date and working Intel Clear Containers installation is in the Clear Linux for Intel Architecture Project distribution.)

You can look at the list of versions available with:

apt-cache policy docker-engine

There are newer versions that we shouldn't use. We'll have to specify the version the instructions tell us to:

sudo apt-get install docker-engine=1.12.1-0~xenial

Configure Docker Startup for use of Intel Clear Containers

Ubuntu 16.04 uses systemd for system initialization, therefore most of what's remaining is to make some alterations in systemd with regard to Docker startup. The following instructions will override the default startup and ensure use of the cc-oci-runtime.

sudo mkdir -p /etc/systemd/system/docker.service.d/

Edit a file in that directory (as root) called clr-containers.conf. Make it look like this:

[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -D –s overlay2 --add-runtime cor=/usr/bin/cc-oci-runtime --default-runtime=cor

This is a systemd directive file for the docker service; it specifies the command-line options to the dockerd processes that will force it to use Intel Clear Containers instead of its native service.

Note also the addition of the ‘-s overlay2’ flag, which is not in the instructions (at time of this writing). This tells the Docker daemon to use the ‘overlay2’ storage driver in preference to ‘aufs’. This is the recommended storage driver to use for kernels of version 4.0 or greater.

Now we need to make sure systemd recognizes the change, and then restart the service:

sudo systemctl daemon-reload
sudo systemctl restart docker

Note that I've skipped some additional, optional configuration that's called out in the installation document. This additional configuration is to allow for large numbers of Intel Clear Containers to run on the same machine. Without performing this optional action, you will be limited on how many containers can run simultaneously. See section 6.1 of the instruction document if you want to remove this limitation.

Ready to Run

At this point you should be able to run Docker container startup normally:

sudo docker run -ti ubuntu

This will give you a command prompt on a simple Ubuntu container. You can log in separately and see the qemu-lite process running, like this (the container is running in the background window, the process display is in the foreground).

Summary

Now you have everything you need to take Intel Clear Containers for a test drive. For the most part, it will behave just like any other Docker installation. As shown, integration with Docker Hub and the huge library of container images present there is open to use by Intel Clear Containers.

This has been the second of a three-part series on Intel Clear Containers. In the final article, I'll dive into the technology a bit more, exploring some of the major engineering tradeoffs that have been made, and where development is likely headed in upcoming releases. I'll also discuss the use of Intel Clear Containers in various orchestration tools besides Docker.

Read the first article in the series: Intel® Clear Containers 1: The Container Landscape

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation, who focuses on enabling cloud technology for the Intel Developer Relations Division. Before that, he worked in the Intel Open Source Technology Center (OTC), on both Intel Clear Containers and the Clear Linux Project for Intel Architecture. He’d be happy to hear from you about this article at: jim.chamings@intel.com.

↧

Intel® ISA-L: Cryptographic Hashes for Cloud Storage

December 29, 2016, 8:37 am

Latest and popular articles on Intel Technologies

≫ Next: Simple, Powerful HPC Clusters Drive High-Speed Design Innovation

≪ Previous: Intel® Clear Containers 2: Using Clear Containers with Docker

Download Code Sample

Download PDF

Introduction

Today’s new devices generate data that requires centralized storage and access everywhere, thus increasing the demand for more and faster cloud storage. At the point where data is collected and packaged for the cloud, improvements in data processing performance are important. Intel® Intelligent Storage Acceleration Library (Intel® ISA-L), with the ability to generate cryptographic hashes extremely fast, can improve data encryption performance. In this article, a sample application that includes downloadable source code will be shared to demonstrate the utilization of the Intel® ISA-L cryptographic hash feature. The sample application has been tested on the hardware and software configuration presented in the table below. Depending on the platform capability, Intel ISA-L can run on various Intel® processor families. Improvements are obtained by speeding up computations through the use of the following instruction sets:

Hardware and Software Configuration

CPU and Chipset	Intel® Xeon® processor E5-2699 v4, 2.2 GHz # of cores per chip: 22 (only used single core) # of sockets: 2 Chipset: Intel® C610 chipset, QS (B-1 step) System bus: 9.6 GT/s Intel® QuickPath Interconnect Intel® Hyper-Threading Technology off Intel® Speed Step Technology enabled Intel® Turbo Boost Technology disabled
Platform	Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass) BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932 DIMM slots: 24 Power supply: 1x1100W
Memory	Memory size: 256 GB (16X16 GB) DDR4 2133P Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG
Storage	Brand and model: 1 TB Western Digital* (WD1002FAEX) Plus Intel® SSD P3700 Series (SSDPEDMD400G4)
Operating System	Ubuntu* 16.04 LTS (Xenial Xerus) Linux kernel 4.4.0-21-generic

Why Use Intel® ISA-L?

Intel ISA-L has the capability to generate cryptographic hashes fast by utilizing the Single Instruction Multiple Data (SIMD). The cryptographic functions are part of a separate collection within Intel ISA-L and can be found in the GitHub repository 01org/isa-l_crypto. To demonstrate this multithreading hash feature, this article simulates a sample “producer-consumer” application. A variable number (from 1-16) of “producer” threads will fill a single buffer with data chunks, while a single “consumer” thread will take data chunks from the buffer and calculate cryptographic hashes using Intel ISA-L’s implementations. For this demo, a developer can choose the number of threads (producers) submitting data (2, 4, 8, or 16) and the type of hash (MD5, SHA1, SHA256, or SHA512). The example will produce output that shows the utilization of the “consumer” thread and the overall wall-clock time.

Prerequisites

Intel ISA-L has known support for Linux* and Microsoft Windows*. A full list of prerequisite packages can be found here.

Building the sample application (for Linux):

Install the dependencies:
- a c++14 compliant c++ compiler
- cmake >= 3.1
- git
- autogen
- autoconf
- automake
- yasm and/or nasm
- libtool
- boost's "Program Options" library and headers
sudo apt-get update sudo apt-get install gcc g++ make cmake git autogen autoconf automake yasm nasm libtool libboost-all-dev
Also needed is the latest versions of isa-l_crypto. The get_libs.bash script can be used to get it. The script will download the library from its official GitHub repository, build it, and install it in ./libs/usr.
bash ./libs/get_libs.bash
Build from the `ex3` directory:
mkdir <build-dir> cd <build-dir> cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD make

Getting Started with the Sample Application

The download button for the source code is provided at the beginning of the article. The sample application contains the following:

This example will go through the following steps at a high level work flow and only focus in detail on the consumer code found inside “consumers.cpp and the “hash.cpp” files:

Setup

1. In the “main.cpp” file, we first parse the arguments coming from the command line and display the options that are going to be performed.

int main(int argc, char* argv[])
{
     options options = options::parse(argc, argv);
     display_info(options);

2. From the “main.cpp” file, we call the shared_data routine to process the options from command line.

shared_data data(options);

In the “shared_data.cpp”, we create the `shared_data` that is the shared buffer that is going to be written to by the producers and read by the consumer, as well as the means to synchronize those reads and writes.

Parsing the option of the command line

3. In the options.cpp file, the program parses the command line arguments using: `options::parse()`.

Create the Producer

4. In the “main.cpp” file, we then create the producers and then call their `producer::run()` method in a new thread (`std::async` with the `std::launch::async` launch policy is used for that).

for (uint8_t i = 0; i < options.producers; ++i)
       producers_future_results.push_back(
            std::async(std::launch::async, &producer::run, &producers[i]));

In the “producer.cpp” file, each producer is assigned one chunk 'id' (stored in m_id ) in which it will submit data.

On each iteration, we:

wait until our chunk is ready_write , then fill it with data.
sleep for the appropriate amount of time to simulate the time it could take to generate data.

The program generates only very simple data: each chunk is filled repeatedly with only one random character (returned by random_data_generator::get() ). See the “random_data_generator.cpp” file for more details.

5. In the “main.cpp” file, the program stores data to the `std::future` object for each producer’s thread. Each std::future object holds a way to access the results of the thread once it’s done and wait synchronously for the thread o be done. The thread does not return any data.

std::vector<std::future<void>> producers_future_results;

Create the Consumer and start the hashing for the data

6. In the “main.cpp” file, the program then creates only one consumer and calls it's `consumer::run()` method

consumer consumer(data, options);
    consumer.run();

In the “consumer.cpp” file, the consumer will repeatedly:

wait for some chunks of data to be ready_read ( m_data.cv().wait_for ).
submit each of them to be hashed ( m_hash.hash_entire ).
mark those chunks as ready_write ( m_data.mark_ready_write ).
wait for the jobs to be done ( m_hash.hash_flush ).
unlock the mutex and notify all waiting threads, so the producers can start filling the chunks again

When all the data has been hashed we display the results, including the thread usage. This is computed by comparing the amount of time we waited for chunks to be ready and read to the amount of time we actually spent hashing the data.

consumer::consumer(shared_data& data, options& options)
    : m_data(data), m_options(options), m_hash(m_options.function)
{
}

void consumer::run()
{
    uint64_t hashes_submitted = 0;

    auto start_work    = std::chrono::steady_clock::now();
    auto wait_duration = std::chrono::nanoseconds{0};

    while (true)
    {
        auto start_wait = std::chrono::steady_clock::now();

        std::unique_lock<std::mutex> lk(m_data.mutex());

        // We wait for at least 1 chunk to be readable
        auto ready_in_time =
            m_data.cv().wait_for(lk, std::chrono::seconds{1}, [&] { return m_data.ready_read(); });

        auto end_wait = std::chrono::steady_clock::now();
        wait_duration += (end_wait - start_wait);

        if (!ready_in_time)
        {
            continue;
        }

        while (hashes_submitted < m_options.iterations)
        {
            int idx = m_data.first_chunck_ready_read();

            if (idx < 0)
                break;

            // We submit each readable chunk to the hash function, then mark that chunk as writable
            m_hash.hash_entire(m_data.get_chunk(idx), m_options.chunk_size);
            m_data.mark_ready_write(idx);
            ++hashes_submitted;
        }

        // We unlock the mutex and notify all waiting thread, so the producers can start filling the
        // chunks again
        lk.unlock();
        m_data.cv().notify_all();

        // We wait until all hash jobs are done
        for (int i = 0; i < m_options.producers; ++i)
            m_hash.hash_flush();

        display_progress(m_hash.generated_hashes(), m_options.iterations);

        if (hashes_submitted == m_options.iterations)
        {
            auto end_work      = std::chrono::steady_clock::now();
            auto work_duration = (end_work - start_work);

            std::cout << "[Info   ] Elasped time:          ";
            display_time(work_duration.count());
            std::cout << "\n";
            std::cout << "[Info   ] Consumer thread usage: "<< std::fixed << std::setprecision(1)<< (double)(work_duration - wait_duration).count() / work_duration.count() *
                             100<< " %\n";

            uint64_t total_size = m_options.chunk_size * m_options.iterations;
            uint64_t throughput = total_size /
                                  std::chrono::duration_cast<std::chrono::duration<double>>(
                                      work_duration - wait_duration)
                                      .count();

            std::cout << "[Info   ] Hash speed:            "<< size::to_string(throughput)<< "/s ("<< size::to_string(throughput, false) << "/s)\n";

            break;
        }
    }
}

The “hash.cpp” file provides a simple common interface to the md5/sha1/sha256/sha512 hash routines.

hash::hash(hash_function function) : m_function(function), m_generated_hashes(0)
{
    switch (m_function)
    {
        case hash_function::md5:
            m_hash_impl = md5(&md5_ctx_mgr_init, &md5_ctx_mgr_submit, &md5_ctx_mgr_flush);
            break;
        case hash_function::sha1:
            m_hash_impl = sha1(&sha1_ctx_mgr_init, &sha1_ctx_mgr_submit, &sha1_ctx_mgr_flush);
            break;
        case hash_function::sha256:
            m_hash_impl =
                sha256(&sha256_ctx_mgr_init, &sha256_ctx_mgr_submit, &sha256_ctx_mgr_flush);
            break;
        case hash_function::sha512:
            m_hash_impl =
                sha512(&sha512_ctx_mgr_init, &sha512_ctx_mgr_submit, &sha512_ctx_mgr_flush);
            break;
    }
}


void hash::hash_entire(const uint8_t* chunk, uint len)
{
    submit_visitor visitor(chunk, len);
    if (boost::apply_visitor(visitor, m_hash_impl))
        ++m_generated_hashes;
}


void hash::hash_flush()
{
    flush_visitor visitor;
    if (boost::apply_visitor(visitor, m_hash_impl))
        ++m_generated_hashes;
}


uint64_t hash::generated_hashes() const
{
    return m_generated_hashes;
}

7. Once `consumer::run` is done and returned to the main program, the program waits for each producer to be done, by calling `std::future::wait()` on each `std::future` object xx.

for (const auto& producer_future_result : producers_future_results)
        producer_future_result.wait();

Execute the Sample Application

In this example, the program generated data in N producer threads, and hashed the data using a single consumer thread. The program will show if the consumer thread can keep up with N producer threads.

Configuring the tests

Speed of data generation

Since this is not a real-world application, the data generation can be almost as fast or slow as we want. The “—speed” argument is used to choose how fast each producer is generating data.

If “--speed 50MB”, each producer thread would take 1 seconds to generate a 50MB chunk.

The faster the speed, the less time the consumer thread will have to hash the data before new chunks are available. This means the consumer thread usage will be higher.

Number of producers

The “—producers” argument is used to choose the number of producer threads to concurrently generate and submit data chunks.

Important note: On each iteration, the consumer thread will submit at most that number of chunks of data to be hashed. So, the higher the number, the more opportunity there is for “isa-l_crypto” to run more hash jobs at the same time. This is because of the way the program measures the consumer thread usage.

Chunk size

The size of the data chunks is being defined by each producer for each iteration and then the consumer submits the data chunk to the hash_function.

The “--chunk-size” argument is used to choose that value.

This is a very important value, as it directly affects how long each hash job will take.

Total size

This is the total amount of data to be generated and hashed. Knowing this and the other parameters, the program knows how many times chunks will need to be generated in total, and how many hash jobs will be submitted in total.

Using the “ --total-size” argument, it is important to pick a large enough value (compared to the chunk-size) that we will submit a large enough number of jobs, in order to cancel some of the noise in measuring the time taken by those jobs.

The results

[Info ] Elasped time: 2.603 s [Info ] Consumer thread usage: 42.0 % [Info ] Hash speed: 981.7 MB/s (936.2 MiB/s)

Elapsed time

This is the total time taken by the whole process

Consumer thread usage

We compare how long we spent waiting for chunks of data to be available to how long the consumer thread has been running in total.

Any value lower than 100% shows that the consumer thread was able to keep up with the producers and had to wait for new chunks of data.

A value very close to 100% shows that the consumer threads were consistently busy, and were not able to outrun the producers.

Hash speed

This is the effective speed at which the isa-l_crypto functions hashed the data. The clock for this starts running as soon as at least one data chunk is available, and stops when all these chunks have been hashed.

Running the example

Running this example “ex3” with the taskset command to core number 3 and 4 should give the following output:

The program runs as a single thread on core number 3. ~55% of its time is waiting for the producer to submit the data.

Running the program with the taskset command for core 3 to 20 for the 16 threads (producers) should give the following output:

The program runs as sixteen threads on core numbers 3 to 19. Only ~2% of its time is waiting for the producer to submit the data.

Notes: 2x Intel® Xeon® processor E5-2699v4 (HT off), Intel® Speed Step enabled, Intel® Turbo Boost Technology disabled, 16x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu* 16.04 LTS, Linux kernel 4.4.0-21-generic, 1 TB Western Digital* (WD1002FAEX), 1 Intel® SSD P3700 Series (SSDPEDMD400G4), 22x per CPU socket. Performance measured by the written sample application in this article.

Conclusion

As demonstrated in this quick tutorial, the hash function feature can be applied to any storage application. The source code for the sample application is also for provided for your reference. Intel ISA-L has provided the library for storage developers to quickly adopt to your specific application run on Intel® Architecture.

Authors

Thai Le is a Software Engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an Application Engineer focusing on Cloud Computing within the Software Services Group at Intel Corporation (UK).

Notices

System configurations, SSD configurations and performance tests conducted are discussed in detail within the body of this paper. For more information go to http://www.intel.com/content/www/us/en/benchmarks/intel-product-performance.html.

This sample source code is released under the Intel Sample Source Code License Agreement.

↧

Simple, Powerful HPC Clusters Drive High-Speed Design Innovation

November 18, 2016, 10:08 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

≪ Previous: Intel® ISA-L: Cryptographic Hashes for Cloud Storage

Up to 17x Faster Simulationsthrough Optimized Cluster Computing

Scientists and engineers across a wide range of disciplines are facing a common challenge. To be eﬀective, they need to study more complex systems with more variables and greater resolution. Yet they also need timely results to keep their research and design eﬀorts on track.

A key criterion for most of these groups is the ability to complete their simulations overnight, so they can be fully productive during the day. Altair and Intel help customers meet this requirement using Altair HyperWorks* running on high performance computing (HPC) appliances based on the Intel® Xeon® processor E5-2600 v4 product family.

Download Complete Solution Brief (PDF)

↧

Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

December 22, 2016, 9:28 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® ISA-L: Cryptographic Hashes for Cloud Storage

≪ Previous: Simple, Powerful HPC Clusters Drive High-Speed Design Innovation

This paper examines software performance optimization for an implementation of a non-library version of DGEMM executing on the Intel® Xeon Phi™ processor (code-named Knights Landing, with acronym KNL) running the Linux* Operating System (OS). The performance optimizations will incorporate the use of C/C++ High Bandwidth Memory (HBM) application programming interfaces (APIs) for doing dynamic storage allocation from Multi-Channel DRAM (MCDRAM), _mm_malloc dynamic storage allocation calls into Double Data Rate (DDR) memory, high-level abstract vector register management, and data prefetching. The dynamic storage allocations will be used to manage tiled data structure objects that will accommodate the memory hierarchy of the Intel Xeon Phi processor architecture. The focus in terms of optimizing application performance execution based on storage allocation is to:

Align the starting addresses of data objects during storage allocation so that vector operations on the Intel Xeon Phi processor will not require additional special vector alignment when using a vector processing unit associated with each hardware thread.
Select data tile sizes and do abstract vector register management that will allow for cache reuse and data locality.
Place select data structures into MCDRAM, through the HBM software package.
Use data prefetching to improve timely referencing of the tiled data structures into the Intel Xeon Phi processor cache hierarchy.

These methodologies are intended to provide you with insight when applying code modernization to legacy software applications and when developing new software for the Intel Xeon Phi processor architecture.

Introduction
∘ What strategies are used to improve application performance?
∘ How is this article organized?
The Intel® Xeon Phi™ Processor Architecture
Why Does the Intel Xeon Phi Processor Need HBM?
∘ How does a software application distinguish between data assigned to DDR versus MCDRAM in flat mode?
∘ How does a software application interface to MCDRAM?
Prefetch Tuning
Matrix Multiply Background and Programming Example
Performance Results for a C/C++ Implementation of DGEMM Based on Intel® Math Kernel Library/DGEMM
Conclusions
References

Introduction

The information in this article might help you achieve better execution performance if you are optimizing software applications for the Intel® Xeon Phi™ processor architecture (code-named Knights Landing ¹) that is running the Linux* OS. The scenario is that optimization opportunities are exposed from using profiling analysis software tools such as Intel® VTune™ Amplifier XE ², and/or Intel® Trace Analyzer and Collector ³, and/or MPI Performance Snapshot ⁴ where these software tools reveal possible memory management bottlenecks.

What strategies are used to improve application performance?

This article examines memory management, which involves tiling of data structures using the following strategies:

Aligned data storage allocation. This paper examines the use of the _mm_malloc intrinsic for dynamic storage allocation of data objects that reside in Double Data Rate (DDR) memory.
Use of Multi-Channel Dynamic Random-Access Memory (MCDRAM). This article discusses the use of a 16-gigabyte MCDRAM, which is High-Bandwidth Memory (HBM) ¹. MCDRAM on the Intel Xeon Phi processor comprises eight devices (2 gigabytes each). This HBM is integrated on-the Intel® Xeon Phi™ processor package and is connected to the Knights Landing die via a proprietary on-package I/O. All eight MCDRAM devices collectively provide an aggregate Stream triad benchmark bandwidth of more than 450 gigabytes per second ¹.
Vector register management. An attempt will be made to manage the vector registers on the Intel Xeon Phi processor by using explicit compiler semantics including C/C++ Extensions for Array Notation (CEAN) ⁵.
Compiler prefetching controls. Compiler prefetching control will be applied to application data objects to manage data look-ahead into the Intel Xeon Phi processor’s L2 and L1 cache hierarchy.

Developers of applications for Intel Xeon Phi processor architecture may find these methodologies useful for optimizing programming applications, which exploit at the core level, hybrid parallel programming consisting of a combination of both threading and vectorization technologies.

How is this article organized?

Section 2 provides insight on the Intel Xeon Phi processor architecture and what software developer may want to think about in doing code modernization for existing applications or for developing new software applications. Part 3 examines storage allocations for HBM (MCDRAM). In this article and for the experiments, data objects that are not allocated in MCDRAM will reside in DDR. Section 4 examines prefetch tuning capabilities. Part 5 provides background material for the matrix multiply algorithm. Section 6 applies the outlined memory management techniques to a double-precision floating-point matrix multiply algorithm (DGEMM), and works through restructuring transformations to improve execution performance. Part 7 describes performance results.

The Intel® Xeon Phi™ Processor Architecture

A Knights Landing processor socket has at most 36 active tiles, where a tile is defined as consisting of two cores (Figure 1) ¹. This means that the Knights Landing socket can have at most 72 cores. The two cores within each tile communicate with each other via a 2D mesh on-die interconnect architecture that is based on a Ring architecture (Figure 1) ¹. The communication mesh consists of four parallel networks, each of which delivers different types of packet information (for example, commands, data, and responses) and is highly optimized for the Knights Landing traffic flows and protocols. The mesh can deliver greater than 700 gigabytes per second of total aggregate bandwidth.

Figure 1. Intel® Xeon Phi™ processor block diagram showing tiles. (DDR MC = DDR memory controller, DMI = Direct Media Interface, EDC = MCDRAM controllers, MCDRAM = Multi-Channel DRAM) ¹.

Each core has two Vector Processing Units (VPUs) and 1 megabyte of level-2 (L2) cache that is shared by the two cores within a tile (Figure 2) ¹. Each core within a tile has 32 kilobytes of L1 instruction cache and 32 kilobytes of L1 data cache. The cache lines are 512-bits wide, implying that a cache line can contain 64 bytes of data.

Intel® Xeon Phi™ processor illustration of a tile from Figure 1

Figure 2. Intel® Xeon Phi™ processor illustration of a tile from Figure 1 that contains two cores (CHA = Caching/Home Agent, VPU = Vector Processing Unit) ¹.

In terms of single precision and double-precision floating-point data, the 64-byte cache lines can hold 16 single-precision floating-point objects or 8 double-precision floating-point objects.

Looking at the details of a core in Figure 2, there are four hardware threads (hardware contexts) per core ¹, where each hardware thread acts as a logical processor ⁶. A hardware thread has 32 512-bit-wide vector registers (Figure 3) to provide Single Instruction Multiple Data (SIMD) support ⁶. To manage the 512-bit wide SIMD registers (ZMM0-ZMM31), the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set is used ⁷. For completeness in regard to Figure 3, the lower 256-bits of the ZMM registers are aliased to the respective 256-bit YMM registers, and the lower 128-bits are aliased to the respective 128-bit XMM registers.

Figure 3. 512-bit-wide vectors and SIMD register set.

Figure 3. 512-bit-wide vectors and SIMD register set ⁶.

The rest of this article focuses on the instructions that support the 512-bit wide SIMD registers (ZMM0-ZMM31). Regarding the Intel AVX-512 instruction set extensions, a 512-bit VPU also supports Fused Multiply-Add (FMA) instructions ⁶, where each of the three registers acts as a source and one of them also functions as a destination to store the result. The FMA instructions in conjunction with the 512-bit wide SIMD registers can do 32 single-precision floating-point computations or 16 double-precision floating-point operations per clock cycle for computational semantics such as:

C_ij = C_ij + A_ip× B_pj

where subscripts “i”, “j”, and “p” serve as respective row and column indices for matrices A, B, and C.

Why Does the Intel Xeon Phi Processor Need HBM?

Conventional Dynamic Random-Access Memory (DRAM) and Dual-Inline Memory Modules (DIMMs) cannot meet the data-bandwidth consumption capabilities of the Intel Xeon Phi processor ⁸. To address this “processor to memory bandwidth” issue there are two memory technologies that can be used that place the physical memory closer to the Knights Landing processor socket, namely ⁸:

MCDRAM: This is a proprietary HBM that physically sits atop the family of Intel Xeon Phi processors.
HBM: This memory architecture is compatible with the Joint Electron Device Engineering Council (JEDEC) standards ⁹, and is a high-bandwidth memory designed for a generation of Intel Xeon Phi processors, code named Knights Hill.

From a performance point of view, there is no conceptual difference between MCDRAM and HBM.

For the Intel Xeon Phi processor, MCDRAM as shown in Figure 4 has three memory modes ¹: cache mode, flat mode, and hybrid mode. When doing code modernization for existing applications or performing new application development on Intel Xeon Phi processor architecture, you may want to experiment with the three configurations to find the one that provides the best performance optimization for your applications. Below are some details about the three modes that may help you make informed decisions about which configuration may provide the best execution performance for software applications.

Figure 4. The three MCDRAM memory modes—cache, flat, and hybrid

Figure 4. The three MCDRAM memory modes—cache, flat, and hybrid—in the Intel® Xeon Phi™ processor. These modes are selectable through the BIOS at boot time ¹.

The cache mode does not require any software change and works well for many applications ¹. For those applications that do not show a good hit rate in MCDRAM, the other two memory modes provide more user control to better utilize MCDRAM.

In flat mode, both the MCDRAM memory and the DDR memory act as regular memory and are mapped into the same system address space ¹. The flat mode configuration is ideal for applications that can separate their data into a larger, lower-bandwidth region and a smaller, higher bandwidth region. Accesses to MCDRAM in flat mode see guaranteed high bandwidth compared to cache mode, where it depends on the hit rates. Unless the data structures for the application workload can fit entirely within MCDRAM, the flat mode configuration requires software support to enable the application to take advantage of this mode.

For the hybrid mode, the MCDRAM is partitioned such that either a half or a quarter of the MCDRAM is used as cache, and the rest is used as flat memory ¹. The cache portion will serve all of the DDR memory. This is ideal for a mixture of software applications that have data structures that benefit from general caching, but also can take advantage by storing critical or frequently accessed data in the flat memory partition. As with the flat mode, software enabling is required to access the flat mode section of the MCDRAM when software does not entirely fit into it. Again as mentioned above, the cache mode section does not require any software support ¹.

How does a software application distinguish between data assigned to DDR versus MCDRAM in flat mode?

When MCDRAM is configured in flat mode, the application software is required to explicitly allocate memory into MCDRAM ¹. In a flat mode configuration, the MCDRAM is accessed as memory by relying on mechanisms that are already supported in the existing the Linux* OS software stack. This minimizes any major enabling effort and ensures that the applications written for flat MCDRAM mode remain portable to systems that do not have a flat MCDRAM configuration. This software architecture is based on the Non-Uniform Memory Access (NUMA) memory support model ¹⁰ that exists in current operating systems and is widely used to optimize software for current multi-socket systems. The same mechanism is used to expose the two types of memory on Knights Landing as two separate NUMA nodes (DDR and MCDRAM). This provides software with a way to address the two types of memory using NUMA mechanisms. By default, the BIOS sets the Knights Landing cores to have a higher affinity to DDR than MCDRAM. This affinity helps direct all default and noncritical memory allocations to DDR and thus keeps them out of MCDRAM.

On a Knights Landing system one can type the NUMA command:

numactl –H

numactl --hardware

and you will see attributes about the DDR memory (node 0) and MCDRAM (node 1). The attributes might look something like the following:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 0 size: 32664 MB
node 0 free: 30414 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15958 MB
node distances:
node 0 1
  0: 10 31
  1: 31 10

There is an environment variable called MEMKIND_HBW_NODES which controls the binding of high bandwidth memory allocations to one of the two NUMA nodes listed above. For example, if this environment variable is set to 0, it will bind high bandwidth memory allocations to NUMA node 0. Alternatively, setting this environment variable to 1 will bind high bandwidth allocations to NUMA node 1.

How does a software application interface to MCDRAM?

To allocate critical memory in MCDRAM in flat mode, a high-bandwidth (HBW) malloc library is available that can be downloaded at reference ¹¹ or by clicking here. This memkind library ¹¹ has functions that can align data objects on, say, 64-byte boundaries. Such alignments can lead to efficient use of cache lines, the L2 and L1 caches, and the SIMD vector registers. Once the memkind library is installed, the LD_LIBRARY_PATH environment variable will need to be updated to include the directory path to the memkind library.

One other topic should be noted regarding huge pages. On Knights Landing, huge pages are managed by the kernel a bit differently when you try to perform memory allocation using them (memory pages of size 2 MB instead of the standard 4 KB) ¹². In those cases, huge pages need to be enabled prior to use. The content of the file called:

/proc/sys/vm/nr_hugepages

contains the current number of preallocated huge pages of the default size. If for example, you issue the Linux command on the Knights Landing system:

cat /proc/sys/vm/nr_hugepages

and the file contains a 0, the system administrator can issue the Linux OS command:

echo 20 > /proc/sys/vm/nr_hugepages

to dynamically allocate and deallocate default sized persistent huge pages, and thus adjust the number of default sized huge pages in the huge page pool to 20. Therefore, the system will allocate or free huge pages, as required. Note that one does not need to explicitly set the number of huge pages by echoing to the file /proc/sys/vm/nr_hugepages as long as the content of /sys/kernel/mm/transparent_hugepage/enabled is set to “always”.

A detailed review for setting the environment variables MEMKIND_HBW_NODES and LD_LIBRARY_PATH in regard to the memkind library and adjusting the content of the file nr_hugepages is discussed in Performance Results for a C/C++ Implementation of DGEMM Based on Intel® Math Kernel Library/DGEMM.

Prefetch Tuning

Compiler prefetching is disabled by default for the Intel Xeon Phi processor ¹³. To enable compiler prefetching for Knights Landing architecture use the compiler options:

-O3 –xmic-avx512 –qopt-prefetch=<n>

where the values of meta-symbol <n> are explained in Table 1

Table 1. Intel® C/C++ compiler switch settings for -qopt-prefetch

How does the `-qopt-prefetch=<n>` compiler switch work for the Intel® Xeon Phi™ processor architecture?
*Value of meta-symbol “<n>”*	Prefetch Semantic Actions
`0`	`This is the default and if you omit the –qopt-prefetch option, then no auto-prefetching is done by the compiler`
`2`	`This is the default if you use only –qopt-prefetch with no explicit “<n>” argument. Insert prefetches for direct references where the compiler thinks the hardware prefetcher may not be able to handle it`
`3`	`Prefetching is turned on for all direct memory references without regard to the hardware prefetcher`
`4`	`Same as n=3 (currently)`
`5`	`Additional prefetching for all indirect references (Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and above)` `Indirect prefetches (hint 1) is done using AVX512-PF gatherpf instructions on Knights Landing (not all cases, but a subset)` `Extra prefetches issued for strided vector accesses (hint 0) to cover all cache-lines`

The prefetch distance is the number of iterations of look-ahead when a prefetch is issued. Prefetching is done after the vectorization phase, and therefore the distance is in terms of vectorized iterations if an entire serial loop or part of a serial loop is vectorized. The Intel Xeon Phi processor also has a hardware L2 prefetcher that is enabled by default. In general, if the software prefetching algorithm is performing well for an executing application, the hardware prefetcher will not join in with the software prefetcher.

For this article the Intel C/C++ Compiler option:

-qopt-prefetch-distance=n₁[,n₂]

is explored. The arguments n₁ and n₂ have the following semantic actions in regard to the --qopt-prefetch-distance compiler switch:

The distance n₁ (number of future loop iterations) for first-level prefetching into the Intel Xeon Phi processor L2 cache.
The distance n₂ for second-level prefetching from the L2 cache into the L1 cache, where n₂≤ n₁. The exception is that n₁ can be 0 for values of n₂ (no first-level prefetches will be issued by the compiler).

Some useful values to try for n₁ are 0, 4, 8, 16, 32, and 64 ¹⁴. Similarly, useful values to try for n₂ are 0, 1, 2, 4, and 8. These L2 prefetching values signified by n₁ can be permuted with prefetching values n₂ that control data movement from the L2 cache into the L1 cache. This permutation process can reveal the best combination of n₁ and n₂ values. For example, a setting might be:

-qopt-prefetch-distance=0,1

where the value 0 tells the compiler to disable compiler prefetching into the L2 cache, and the n₂ value of 1 indicates that 1 iteration of compiler prefetching should be done from the L2 cache into the L1 cache.

The optimization report output from the compiler (enabled using -opt-report=<m>) will provide details on the number of prefetch instructions inserted by the compiler for each loop.

In summary, section 2 discussed the Intel Xeon Phi processor many-core architecture, including the on-die interconnection network for the cores, hardware threads, VPUs, the L1 and L2 caches, 512-bit wide vector registers, and 512-bit wide cache lines. Part 3 examined MCDRAM, and the memkind library for helping to establish efficient data alignment of data structures (memory objects). This present section discussed prefetching of these memory objects into the cache hierarchy. In the next section, these techniques will be applied so as to optimize an algorithm ¹⁵ such as a double-precision version of matrix multiply. The transformation techniques will incorporate using a high-level programming language in an attempt to maintain portability from one processor generation to the next ¹⁶.

Matrix Multiply Background and Programming Example

Matrix multiply has the core computational assignment:

C_ij = C_ij + A_ip× B_pj

A basic matrix multiply loop structure implemented in a high-level programming language might look something like the following pseudo-code:

integer i, j, p;for p = 1:Kfor j = 1:Nfor i = 1:M
           Cij = Cij + Aip × Bpj
        endfor
    endfor
endfor

where matrix A has dimensions M × K, matrix B has dimensions K × N, and matrix C has dimensions M × N. For the memory offset computation for matrices A, B, and C we will assume column-major-order data organization.

For various processor architectures, software vendor libraries are available for performing matrix multiply in a highly efficient manner. For example, matrix multiplication for the above can be computed using DGEMM which calculates the product for a matrix C using double precision matrix elements ¹⁷. Note that a DGEMM core solver for implementing the above algorithm may be implemented in assembly language (e.g., DGEMM for the Intel® Math Kernel Library ¹⁷), where an assembly language solution may not be necessarily portable from one processor architecture to the next.

In regard to this article, the focus is to do code restructuring transformations to achieve code modernization performance improvements using a high-level programming language. The reason for using matrix multiply as an example in applying the high-level memory-allocation-optimization techniques is that the basic algorithm is roughly four lines long and is easily understandable. Additionally, it is hoped that after you see a before and after of the applied restructuring transformations using a high-level programming language, you will think about associating restructuring transformations of a similar nature to the applications that you have written in a high-level programming language which are targeted for code modernization techniques.

Goto et al. ¹⁵ have looked at restructuring transformations for the basic matrix multiply loop structure shown above in order to optimize it for various processor architectures. This has required organizing the A, B, and C matrices from the pseudo-code example above into sub-matrix tiles. Figure 5 shows a tile rendering abstraction, but note that the access patterns required in Figure 5 are different from those described in reference ¹⁵. For the Ã and B̃ tiles in Figure 5, data packing is done to promote efficient matrix-element memory referencing.

Partitioning of DGEMM for the Intel® Xeon Phi™ processor

Figure 5. Partitioning of DGEMM for the Intel® Xeon Phi™ processor where buffer Ã is shared by all cores, and buffer B̃ and sections of matrix C are not shared by all cores. The data partitioning is based on an Intel® Xeon Phi™ processor/DGEMM implementation from the Intel® Math Kernel Library ¹⁷.

Regarding the matrix partitioning in Figure 5 for the Intel Xeon Phi processor, Ã is shared by all the cores, and matrices B̃ and C are not shared by all the cores. This is for a multi-threaded DGEMM solution. Parallelism for the Intel Xeon Phi processor can be demonstrated as threading at the core level (Figure 1), and then as shown in Figure 2, the VPUs can exploit vectorization with 512-bit vector registers and SIMD semantics.

For the sub-matrices in Figure 5 that are either shared by all of the Intel Xeon Phi processor cores (for example, sub-matrix Ã), or for the sub-matrices that are not shared (for example, sub-matrices for B̃ and partitions for matrix C), the next question is: what memory configurations should you use (for example, DDR or MCDRAM)?

Figure 6. DGEMM kernel solver for Intel® Xeon Phi™ processor with partitions for A ̃, B ̃, and C.

Figure 6. DGEMM kernel solver for Intel® Xeon Phi™ processor with partitions for Ã, B̃, and C¹⁷.

Recall that there is 16 gigabytes of multi-channel DRAM and therefore since sub-matrix Ã is shared by all the cores, it will be placed into MCDRAM using the flat mode configuration.

In section 3, we examined HBM, where for MCDRAM there were three configurations: cache mode, flat mode, and hybrid mode (Figure 4). It was mentioned that the flat mode configuration is ideal for applications that can separate their data into a larger, lower-bandwidth region and a smaller, higher bandwidth region. Following this rule for flat mode, we will place (Figure 6) into MCDRAM using the following “memkind” library prototype:

int hbw_posix_memalign(void **memptr, size_t alignment, size_t size);

where the alignment argument “size_t alignment” might have a value of 64, which is a power of 2 and allows the starting address of Ã to align on the beginning of a cache line.

In Figure 6, note that matrix C consists of a core partition that has 8 rows and 28 columns. From an abstraction point of view, the 8 double-precision matrix elements (64 bytes total) can fit into a 512-bit (64 byte) vector register. Also, recall that there are 32 512-bit vector registers per hardware thread. To reduce register pressure on a hardware thread, 28 of the vector registers will be used for the core solver on the right side of Figure 6.

Similarly, for the other tiling objects in Figure 6, the _mm_malloc intrinsic will be used to allocate storage in DDR memory on Knights Landing. The _mm_malloc function prototype looks as follows:

void *_mm_malloc (size_t size, size_t align);

The _mm_malloc prototype also has a “size_t align” argument, which again is an alignment constraint. Using a value of 64 allows data objects that are dynamically allocated to have their starting address aligned on the beginning of a cache line.

For Figure 6, matrix B̃ will be migrated into the L2 cache.

To summarize, we have discussed the partitioning of the A, B, and C matrices into sub-matrix data tiles, and we have utilized 28 of the 32 512-bit vector registers. We looked at data storage prototypes for placing data into MCDRAM or DDR. Next, we want to explore how the data elements within the sub-matrices will be organized (packed) to provide efficient access and reuse.

Data element packing for 8 rows by 336 columns of matrix segment A ̃

Figure 7. Data element packing for 8 rows by 336 columns of matrix segment Ã using column-major-order memory offsets ¹⁷.

Recall from Figure 5 and Figure 6 that matrix segment Ã has a large number of rows and 336 columns. The data is packed into strips that have 8 row elements for each of the 336 columns (Figure 7) using column-major order memory offsets. The number of strips for matrix segment Ã is equal to:

Large-number-of-rows / 8

Note that 8 double-precision row elements for each of the 336 columns can provide efficient use of the 512-bit wide cache lines for the L2 and L1 caches.

For matrix segment B̃ in Figure 5 and Figure 6, the 336 row by 112 column tiles are sub-partitioned into 336 rows by 28 column strips (Figure 8). In Figure 8, the matrix segment B̃ has strips that use row-major-order memory offsets and therefore the 28 elements in a row are contiguous. The 28 elements of a row for matrix segment B̃ correspond with the 28 elements for matrix C that are used in the core solver computation illustrated in the right portion of Figure 6.

Figure 8. Data element packing for 336 rows by 28 columns of matrix segment B ̃

Figure 8. Data element packing for 336 rows by 28 columns of matrix segment B̃ using row-major-order memory offsets ¹⁷.

Figure 9 shows a column-major-order partition of matrix C that has 8 array elements within a column and there are 28 columns (in green). As mentioned earlier, the 8 contiguous double-precision array elements within a column will fit into a 512-bit (64 byte) vector register, and the 28 columns contain 8 row-elements each that can map onto 28 of the 32 possible vector registers associated with a hardware thread. In Figure 9, note that when the next 8 row by 28 column segment for matrix C (in white) is processed, the first element in each column is adjacent to the last element in each column with respect to the green partition. Thus, this column major ordering can allow the 8 row by 28 column segment (in white) to be efficiently prefetched for anticipated FMA computation.

Figure 9. Data element organization for 8 rows by 28 columns of matrix segment C

Figure 9. Data element organization for 8 rows by 28 columns of matrix segment C using column-major-order memory offsets ¹⁷.

In regard to Figures 7, 8, and 9, we have completed the data referencing analysis for the Ã, B̃, and C matrices, which are color coded to make it easy to associate the rectangular data tiling abstractions with the three matrix multiply data objects. Putting this all together into a core matrix multiply implementation, a possible pseudo-code representation for the core solver in Figure 6 that also reflects the data referencing patterns for Figures 7, 8 and 9, might look something like the following:

Code snippet

C/C++ Extensions for Array Notation (CEAN) ⁵ are used in the pseudo-code as indicated by the colon notation “…:8” within the subscripts. This language extension is used to describe computing 8 double-precision partial results for matrix C by using 8 double-precision elements of Ã and replicating a single element of B̃ eight times and placing the same 8 values into a 512-bit vector register for the B̃ operand to make it possible to take advantage of FMA computation. For matrix B̃ the subscript “l” (the character l is the letter L) is used to reference the level (entire strip in Figure 8), where there are 4 levels in the B̃ matrix, each containing 336 rows and 28 columns. This accounts for the value 112 (28 × 4) in Figure 6.

Performance Results for a C/C++ Implementation of DGEMM Based on Intel® Math Kernel Library/DGEMM

This section describes three DGEMM experiments that were run on a single Intel Xeon Phi processor socket that had 64 cores and 16 gigabytes of MCDRAM. The first experiment establishes a baseline for floating-point-operations-per second performance. Experiments 2 and 3 attempt to demonstrate increasing floating-point-operations-per-second performance. All three executables do storage allocation into MCDRAM or DDR using the respective function prototypes:

int hbw_posix_memalign(void **memptr, size_t alignment, size_t size);

and

void *_mm_malloc (size_t size, size_t align);

MCDRAM for these experiments was configured in flat mode. The executables were built with the Intel C/C++ Compiler. Cache management was used for the A and B matrices by transferring data into tile data structures that would fit into the L2 and L1 caches. MCDRAM was used for the -tile. The other data structures that were allocated for this Intel® Math Kernel Library (Intel® MKL)/DGEMM implementation used DDR memory.

Please note that on your system, the floating-point-operations-per-second results will vary from those shown in Figures 10, 11, and 12. Results will be influenced, for example, by factors such as the version of the OS, the software stack component versions, the processor stepping, the number of cores on a socket, and the storage capacity of MCDRAM.

A shell script for running the experiment that resulted in the data for Figure 9 had the following arguments:

64 1 336 112 43008 43008 dynamic 2 <path-to-memkind-library>

64 defines the number of core threads.
1 defines the number of hardware threads per core that are to be used.
336 defines the number of columns for the Ã and the number of rows for the B̃ tiling data structures. See Figure 5 and Figure 6.
112 defines the number of columns for the B̃ data structure tile. Also, see Figures 5 and Figure 6.
43008 is the matrix order.
The second value 43008 refers to the number of rows for.
The values dynamic and 2 are used to control the OpenMP* scheduling ^18,¹⁹ (see Table 2 below).
The meta-symbol <path-to-memkind-library> refers the directory path to the memkind library that is installed on the user’s Knights Landing system.

The first experiment is based on the data tiling storage diagrams from the section 5. Recall that each 512-bit vector register for Intel Xeon Phi processor can reference eight double-precision floating-point operations, and there is also an opportunity to use the FMA vector instruction for the core computation:

for ( … )
     C[ir+iir:8,jc+jjc] += …
     C[ir+iir:8,jc+jjc+1] += …

     …

     C[ir+iir:8,jc+jjc+26] += …
     C[ir+iir:8,jc+jjc+27] += …
endfor

For the compilation of orig_knl_dgemm.c into an executable for running on the Intel Xeon Phi processor, the floating-point-operations-per-second results might look something like the following:

Figure 10. Intel® Xeon Phi™ processor result for the executable orig_knl_dgemm.exe

Figure 10. Intel® Xeon Phi™ processor result for the executable orig_knl_dgemm.exe using 64 core threads and 1 OpenMP* thread per core. The matrix order was 43,008. The Ã tile data structure for the A-matrix had 336 columns and the M-rows value was 43,008. 0 abstract vector registers were used for the matrix-multiply core solver.

In the next experiment, the source file called opt_knl_dgemm.c is used to build the executable called reg_opt_knl_dgemm.exe. In this file, references to the 28 columns of matrix C in the core solver are replaced with the following:

t0[0:8] = C[ir+iir:8,jc+jjc];
t1[0:8] = C[ir+iir:8,jc+jjc+1];

…

t26[0:8] = C[ir+iir:8,jc+jjc+26];
t27[0:8] = C[ir+iir:8,jc+jjc+27];for ( … )
    t0[0:8] += …
    t1[0:8] += …

…

    t27[0:8] += …
endfor 

C[ir+iir:8,jc+jjc] += …
C[ir+iir:8,jc+jjc+1] += …

     …

C[ir+iir:8,jc+jjc+26] = t26[0:8];C[ir+iir:8,jc+jjc+27] = t27[0:8];

The notion of using the array temporaries t0 through t27 can be thought of as assigning abstract vector registers in the computation of partial results for the core matrix multiply algorithm. For this experiment on the Intel Xeon Phi processor, the floating-point-operations-per-second results might look something like:

Figure 11. Intel® Xeon Phi™ processor performance comparison

Figure 11. Intel® Xeon Phi™ processor performance comparison between the executable, orig_knl_dgemm.exe and the executable, reg_opt_knl_dgemm.exe using 64 core threads and 1 OpenMP* thread per core. The matrix order was 43,008. The Ã tile data structure for the A-matrix had 336 columns and the M-rows value was 43,008. The executable reg_opt_knl_dgemm.exe used 28 abstract vector registers for the matrix-multiply core solver

Note that in Figure 11, the result for the executable, orig_knl_dgemm.exe is compared with the result for the executable, reg_opt_knl_dgemm.exe (where 28 abstract vector registers were used). As mentioned previously, from an abstract vector register perspective, the intent was to explicitly manage 28 of the thirty-two 512 bit vector registers for a hardware thread within a Knights Landing core.

The last experiment (experiment 3) builds the Intel MKL/DGEMM executable called pref_32_0_reg_opt_knl_dgemm.exe using the Intel C/C++ compiler options -qopt-prefetch=2 and -qopt-prefetch-distance=n₁,n₂ where n₁ and n₂ are replaced with integer constants. The -qopt-prefetch-distance switch is used to control the number of iterations of data prefetching that take place for the L2 and L1 caches on Knights Landing. The L2, L1 combination that is reported here is (32,0). Figure 12 shows a comparison of experiments 1, 2, and 3.

Figure 12. Intel® Xeon Phi™ processor performance comparisons

Figure 12. Intel® Xeon Phi™ processor performance comparisons for executables, orig_knl_dgemm.exe, reg_opt_knl_dgem.exe, and pref_32_0_reg_opt_knl_dgemm.exe. Each executable used 64 core threads and 1 OpenMP* thread per core. The matrix order was 43,008. The Ã tile data structure for the A-matrix had 336 columns and the M-rows value was 43,008. The executables reg_opt_knl_dgemm.exe and pref_32_0_reg_opt_knl_dgemm.exe used 28 abstract vector registers for the matrix-multiply core solver. The executable pref_32_0_reg_opt_knl_dgemm.exe was also built with the Intel® C/C++ Compiler prefetch switches -qopt-prefetch=2 and -qopt-prefetch-distance=32,0

For the three experiments discussed above, the user can download the shell scripts, makefiles, C/C++ source files, and a README.TXT file at the following URL:

Knights Landing/DGEMM Download Package

After downloading and untarring the package, note the following checklist:

Make sure that the HBM software package called memkind is installed on your Knights Landing system. Click here to retrieve the package, if it is not already installed.
Set the following environment variables:
```
export MEMKIND_HBW_NODES=1
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:<path-to-memkind-library>/lib
```
where <path-to-memkind-library> is a meta-symbol and represents the directory path to the memkind library where the user has done their installation of this library.
Issue the command:
```
cat /proc/sys/vm/nr_hugepages
```
If it does not have a value of 20, then ask your system administrator to change this value on your Knights Landing system using root privileges. With a system administrator account, this can be done by issuing the command:
```
echo 20 > /proc/sys/vm/nr_hugepages
```
followed by the verification command:
```
cat /proc/sys/vm/nr_hugepages
```
As mentioned earlier in a subsection of Section 3, one does not need to explicitly set the number of huge pages by echoing to /proc/sys/vm/nr_hugepages as long as the content of /sys/kernel/mm/transparent_hugepage/enabled is set to “always”.
Review the content of the README.TXT file that is within the directory opt_knl_dgemm_v1 on your Knights Landing system. This read-me file contains information about how to build and run the executables from a host Knights Landing system. The README.TXT file should be used as a guide for doing your own experiments.

Once you complete the checklist on the Intel Xeon Phi processor system, you can source an Intel Parallel Studio XE Cluster Edition script called psxevars.sh by doing the following:

. <path-to-Intel-Parallel-Studio-XE-Cluster-Edition>/psxevars.sh intel64

This script is sourced in particular to set up the Intel C/C++ compilation environment.

For experiment 1, issue a command sequence that looks something like the following within the directory opt_knl_dgemm_v1:

$ cd ./scripts
$ ./orig_knl_dgemm.sh < path-to-memkind-library>

The output report for a run with respect to the scripts sub-directory will be placed in the sibling directory called reports, and the report file should have a name something like:

orig_knl_dgemm_report.64.1.336.112.43008.43008.dynamic.2

where the suffix notation for the report file name has the following meaning:

64 defines the number of core threads.
1 defines the number of hardware threads per core that are to be used.
336 defines the number of columns for the Ã and the number of rows for the B̃ tiling data structures. See Figures 5 and 6.
112 defines the number of columns for the B̃ data structure tile. Also, see Figures 5 and 6.
43008 is the matrix order.
The second value 43008 refers to the number of rows for Ã.
The values dynamic and 2 are used to control the OpenMP scheduling (see below).

As mentioned earlier, OpenMP is used to manage threaded parallelism. In so doing, the OpenMP Standard ¹⁸ provides a scheduling option for work-sharing loops:

schedule(kind[, chunk_size])

This scheduling option is part of the C/C++ directive: #pragma omp parallel for, or #pragma omp for, and the Fortran* directive: !$omp parallel do, or !$omp do. The schedule clause specifies how iterations of the associated loops are divided into contiguous non-empty subsets, called chunks, and how these chunks are distributed among threads of a team. Each thread executes its assigned chunk or chunks in the context of its implicit task. The chunk_size expression is evaluated using the original list items of any variables that are made private in the loop construct.

Table 2 provides a summary of the possible settings for the “kind” component for the “schedule” option.

Table 2. “kind” scheduling values for the OpenMP* schedule(kind[, chunk_size])directive component for OpenMP work sharing loops ^18,19.

Kind	Description
`Static`	Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size is the loop-count/number-of-threads. Set chunk to 1 to interleave the iterations.
`Dynamic`	Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.
`Guided`	Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies the minimum size chunk to use. By default, the chunk size is approximately loop-count/number-of-threads.
`Auto`	When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler. The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team.
`Runtime`	Uses the OMP_SCHEDULE environment variable to specify which one of the three loop-scheduling types should be used. OMP_SCHEDULE is a string formatted exactly the same as would appear on the parallel construct.

An alternative to using the scheduling option of the C/C++ directives:

#pragma omp parallel for or #pragma omp for

or the Fortran directives:

!$omp parallel do or !$omp do

is to use the OpenMP environment variable OMP_SCHEDULE, which has the options:

type[,chunk]

where:

type is one of static, dynamic, guided, or auto.
chunk is an optional positive integer that specifies the chunk size.

Finally, experiments 2 and 3 can be launched from the scripts directory by using the commands:

./reg_opt_knl_dgemm.sh <path-to-memkind-library>

and

./pref_reg_opt_knl_dgemm.sh <path-to-memkind-library>

In a similar manner, the output report for each run with respect to the scripts sub-directory will be placed in the sibling directory called reports.

Conclusions

The experiments on an Intel Xeon Phi processor architecture using HBM library storage allocation along with MCDRAMM for a non-library C/C++ implementation of KNL/DGEMM indicate that data alignment, data placement, and management of the vector registers can help provide good performance on the Intel Xeon Phi processor. Management of the Intel Xeon Phi processor vector registers at the program-language-application-level was done with abstract vector registers. In general, you may want to use conditional compilation macros within your applications to control the selection of the high-bandwidth libraries for managing dynamic storage allocations into MCDRAM versus DDR. In this way, you can experiment with the application to see which storage allocation methodology provides the best execution performance for your application running on Intel Xeon Phi processor architectures. Finally, compiler prefetching controls were used for the L1 and L2 data caches. The experiments showed that making adjustments to prefetching further improved execution performance.

As mentioned earlier, the core solver for MKL DGEMM is written in assembly language, and when a user finds a need to use DGEMM as part of a software application programming solution, Intel® MKL DGEMM should be used. For completeness, the following URL provides performance charts for Intel® MKL DGEMM on Knights Landing:

https://software.intel.com/en-us/intel-mkl/benchmarks#DGEMM3

References

A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, Y. Liu, “KNIGHTS LANDING: SECOND GENERATION INTEL® XEON PHI PRODUCT,” IEEE MICRO, March/April 2016, pp. 34-46.
Intel® VTune™ Amplifier 2017.
Intel® Trace Analyzer and Collector.
Getting Started with the MPI Performance Snapshot.
C/C++ Extensions for Array Notations Programming Model
Intel® 64 and IA-32 Architectures Software Developer Manuals.
Intel® Architecture Instruction Set Extensions Programming Reference.(PDF 74 KB)
B. Brett, Multi-Channel DRAM (MCDRAM) and High-Bandwidth Memory (HBM).
https://www.jedec.org/
http://man7.org/linux/man-pages/man7/numa.7.html
https://github.com/memkind/memkind
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
Intel® C++ Compiler 17.0 Developer Guide and Reference.
R. Krishnaiyer, Compiler Prefetching for the Intel® Xeon Phi™ coprocessor (PDF 336 KB).
K. Goto and R. van de Geijn, “Anatomy of High-Performance Matrix Multiplication,” ACM Transactions on Mathematical Software, Vol. 34, No. 3, May 2008, pp. 1-25.
Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors, May 2016.
Intel® Math Kernel Library (Intel® MKL).
The OpenMP API Specification for Parallel Programming.
R. Green, OpenMP Loop Scheduling.

↧

Intel® ISA-L: Cryptographic Hashes for Cloud Storage

December 29, 2016, 8:37 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® ISA-L: Semi-Dynamic Compression Algorithms

≪ Previous: Intel® Xeon Phi™ Processor 7200 Family Memory Management Optimizations

Download Code Sample

Download PDF

Introduction

Hardware and Software Configuration

CPU and Chipset	Intel® Xeon® processor E5-2699 v4, 2.2 GHz # of cores per chip: 22 (only used single core) # of sockets: 2 Chipset: Intel® C610 chipset, QS (B-1 step) System bus: 9.6 GT/s Intel® QuickPath Interconnect Intel® Hyper-Threading Technology off Intel® Speed Step Technology enabled Intel® Turbo Boost Technology disabled
Platform	Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass) BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932 DIMM slots: 24 Power supply: 1x1100W
Memory	Memory size: 256 GB (16X16 GB) DDR4 2133P Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG
Storage	Brand and model: 1 TB Western Digital* (WD1002FAEX) Plus Intel® SSD P3700 Series (SSDPEDMD400G4)
Operating System	Ubuntu* 16.04 LTS (Xenial Xerus) Linux kernel 4.4.0-21-generic

Why Use Intel® ISA-L?

Prerequisites

Intel ISA-L has known support for Linux* and Microsoft Windows*. A full list of prerequisite packages can be found here.

Building the sample application (for Linux):

Install the dependencies:
- a c++14 compliant c++ compiler
- cmake >= 3.1
- git
- autogen
- autoconf
- automake
- yasm and/or nasm
- libtool
- boost's "Program Options" library and headers
sudo apt-get update sudo apt-get install gcc g++ make cmake git autogen autoconf automake yasm nasm libtool libboost-all-dev
Also needed is the latest versions of isa-l_crypto. The get_libs.bash script can be used to get it. The script will download the library from its official GitHub repository, build it, and install it in ./libs/usr.
bash ./libs/get_libs.bash
Build from the `ex3` directory:
mkdir <build-dir> cd <build-dir> cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD make

Getting Started with the Sample Application

The download button for the source code is provided at the beginning of the article. The sample application contains the following:

This example will go through the following steps at a high level work flow and only focus in detail on the consumer code found inside “consumers.cpp and the “hash.cpp” files:

Setup

1. In the “main.cpp” file, we first parse the arguments coming from the command line and display the options that are going to be performed.

int main(int argc, char* argv[])
{
     options options = options::parse(argc, argv);
     display_info(options);

2. From the “main.cpp” file, we call the shared_data routine to process the options from command line.

shared_data data(options);

Parsing the option of the command line

3. In the options.cpp file, the program parses the command line arguments using: `options::parse()`.

Create the Producer

4. In the “main.cpp” file, we then create the producers and then call their `producer::run()` method in a new thread (`std::async` with the `std::launch::async` launch policy is used for that).

for (uint8_t i = 0; i < options.producers; ++i)
       producers_future_results.push_back(
            std::async(std::launch::async, &producer::run, &producers[i]));

In the “producer.cpp” file, each producer is assigned one chunk 'id' (stored in m_id ) in which it will submit data.

On each iteration, we:

wait until our chunk is ready_write , then fill it with data.
sleep for the appropriate amount of time to simulate the time it could take to generate data.

std::vector<std::future<void>> producers_future_results;

Create the Consumer and start the hashing for the data

6. In the “main.cpp” file, the program then creates only one consumer and calls it's `consumer::run()` method

consumer consumer(data, options);
    consumer.run();

In the “consumer.cpp” file, the consumer will repeatedly:

wait for some chunks of data to be ready_read ( m_data.cv().wait_for ).
submit each of them to be hashed ( m_hash.hash_entire ).
mark those chunks as ready_write ( m_data.mark_ready_write ).
wait for the jobs to be done ( m_hash.hash_flush ).
unlock the mutex and notify all waiting threads, so the producers can start filling the chunks again

consumer::consumer(shared_data& data, options& options)
    : m_data(data), m_options(options), m_hash(m_options.function)
{
}

void consumer::run()
{
    uint64_t hashes_submitted = 0;

    auto start_work    = std::chrono::steady_clock::now();
    auto wait_duration = std::chrono::nanoseconds{0};

    while (true)
    {
        auto start_wait = std::chrono::steady_clock::now();

        std::unique_lock<std::mutex> lk(m_data.mutex());

        // We wait for at least 1 chunk to be readable
        auto ready_in_time =
            m_data.cv().wait_for(lk, std::chrono::seconds{1}, [&] { return m_data.ready_read(); });

        auto end_wait = std::chrono::steady_clock::now();
        wait_duration += (end_wait - start_wait);

        if (!ready_in_time)
        {
            continue;
        }

        while (hashes_submitted < m_options.iterations)
        {
            int idx = m_data.first_chunck_ready_read();

            if (idx < 0)
                break;

            // We submit each readable chunk to the hash function, then mark that chunk as writable
            m_hash.hash_entire(m_data.get_chunk(idx), m_options.chunk_size);
            m_data.mark_ready_write(idx);
            ++hashes_submitted;
        }

        // We unlock the mutex and notify all waiting thread, so the producers can start filling the
        // chunks again
        lk.unlock();
        m_data.cv().notify_all();

        // We wait until all hash jobs are done
        for (int i = 0; i < m_options.producers; ++i)
            m_hash.hash_flush();

        display_progress(m_hash.generated_hashes(), m_options.iterations);

        if (hashes_submitted == m_options.iterations)
        {
            auto end_work      = std::chrono::steady_clock::now();
            auto work_duration = (end_work - start_work);

            std::cout << "[Info   ] Elasped time:          ";
            display_time(work_duration.count());
            std::cout << "\n";
            std::cout << "[Info   ] Consumer thread usage: "<< std::fixed << std::setprecision(1)<< (double)(work_duration - wait_duration).count() / work_duration.count() *
                             100<< " %\n";

            uint64_t total_size = m_options.chunk_size * m_options.iterations;
            uint64_t throughput = total_size /
                                  std::chrono::duration_cast<std::chrono::duration<double>>(
                                      work_duration - wait_duration)
                                      .count();

            std::cout << "[Info   ] Hash speed:            "<< size::to_string(throughput)<< "/s ("<< size::to_string(throughput, false) << "/s)\n";

            break;
        }
    }
}

The “hash.cpp” file provides a simple common interface to the md5/sha1/sha256/sha512 hash routines.

hash::hash(hash_function function) : m_function(function), m_generated_hashes(0)
{
    switch (m_function)
    {
        case hash_function::md5:
            m_hash_impl = md5(&md5_ctx_mgr_init, &md5_ctx_mgr_submit, &md5_ctx_mgr_flush);
            break;
        case hash_function::sha1:
            m_hash_impl = sha1(&sha1_ctx_mgr_init, &sha1_ctx_mgr_submit, &sha1_ctx_mgr_flush);
            break;
        case hash_function::sha256:
            m_hash_impl =
                sha256(&sha256_ctx_mgr_init, &sha256_ctx_mgr_submit, &sha256_ctx_mgr_flush);
            break;
        case hash_function::sha512:
            m_hash_impl =
                sha512(&sha512_ctx_mgr_init, &sha512_ctx_mgr_submit, &sha512_ctx_mgr_flush);
            break;
    }
}


void hash::hash_entire(const uint8_t* chunk, uint len)
{
    submit_visitor visitor(chunk, len);
    if (boost::apply_visitor(visitor, m_hash_impl))
        ++m_generated_hashes;
}


void hash::hash_flush()
{
    flush_visitor visitor;
    if (boost::apply_visitor(visitor, m_hash_impl))
        ++m_generated_hashes;
}


uint64_t hash::generated_hashes() const
{
    return m_generated_hashes;
}

7. Once `consumer::run` is done and returned to the main program, the program waits for each producer to be done, by calling `std::future::wait()` on each `std::future` object xx.

for (const auto& producer_future_result : producers_future_results)
        producer_future_result.wait();

Execute the Sample Application

Configuring the tests

Speed of data generation

Since this is not a real-world application, the data generation can be almost as fast or slow as we want. The “—speed” argument is used to choose how fast each producer is generating data.

If “--speed 50MB”, each producer thread would take 1 seconds to generate a 50MB chunk.

The faster the speed, the less time the consumer thread will have to hash the data before new chunks are available. This means the consumer thread usage will be higher.

Number of producers

The “—producers” argument is used to choose the number of producer threads to concurrently generate and submit data chunks.

Chunk size

The size of the data chunks is being defined by each producer for each iteration and then the consumer submits the data chunk to the hash_function.

The “--chunk-size” argument is used to choose that value.

This is a very important value, as it directly affects how long each hash job will take.

Total size

The results

taskset –c 3,4 ./ex3 –producers 1 ... [Info ] Elasped time: 2.603 s [Info ] Consumer thread usage: 42.0 % [Info ] Hash speed: 981.7 MB/s (936.2 MiB/s)

Elapsed time

This is the total time taken by the whole process

Consumer thread usage

We compare how long we spent waiting for chunks of data to be available to how long the consumer thread has been running in total.

Any value lower than 100% shows that the consumer thread was able to keep up with the producers and had to wait for new chunks of data.

A value very close to 100% shows that the consumer threads were consistently busy, and were not able to outrun the producers.

Hash speed

Running the example

Running this example “ex3” with the taskset command to cores number 3 and 4 should give the following output:

Output for the command taskset -c 3,4 ./ex3 –producers 1

The program runs as a single thread on cores number 3 and 4. ~55% of its time is waiting for the producer to submit the data.

Running the program with the taskset command for core 3 to 20 for the 16 threads (producers) should give the following output:

Output for the command taskset –c 3-20 ./ex3 –producers 16

The program runs as sixteen threads on cores number 3 to 20. Only ~2% of its time is waiting for the producer to submit the data.

Conclusion

Authors

Thai Le is a Software Engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an Application Engineer focusing on Cloud Computing within the Software Services Group at Intel Corporation (UK).

Notices

This sample source code is released under the Intel Sample Source Code License Agreement.

↧

Intel® ISA-L: Semi-Dynamic Compression Algorithms

December 29, 2016, 8:42 am

Latest and popular articles on Intel Technologies

≫ Next: 3D Isotropic Acoustic Finite-Difference Wave Equation Code: A Many-Core Processor Implementation and Analysis

≪ Previous: Intel® ISA-L: Cryptographic Hashes for Cloud Storage

Download Code Sample

Download PDF

Introduction

Compression algorithms traditionally use either a dynamic or static compression table. Those who want the best compression results use a dynamic table at the cost of more processing time, while the algorithms focused on throughput will use static tables. The Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) semi-dynamic compression comes close to getting the best of both worlds. Testing shows the usage of semi-dynamic compression and decompression is only slightly slower than using a static table and almost as efficient as algorithms that use dynamic tables. This article's goal is to help you incorporate Intel ISA-L’s semi-dynamic compression and decompression algorithms into your storage application. It describes prerequisites for using Intel ISA-L, and includes a downloadable code sample, with full build instructions. The code sample is a compression tool that can be used to compare the compression ratio and performance of Intel ISA-L’s semi-dynamic compression algorithm on a public data set with the output of its open source equivalent, zlib*.

Hardware and Software Configuration

CPU and Chipset	Intel® Xeon® processor E5-2699 v4, 2.2 GHz Number of cores per chip: 22 (only used single core) Number of sockets: 2 Chipset: Intel® C610 series chipset, QS (B-1 step) System bus: 9.6 GT/s Intel® QuickPath Interconnect Intel® Hyper-Threading Technology off Intel SpeedStep® technology enabled Intel® Turbo Boost Technology disabled
Platform	Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass) BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932 DIMM slots: 24 Power supply: 1x1100W
Memory	Memory size: 256 GB (16X16 GB) DDR4 2133P Brand/model: Micron – MTA36ASF2G72PZ2GATESIG
Storage	Brand and model: 1 TB Western Digital (WD1002FAEX) Plus Intel® SSD Data Center P3700 Series (SSDPEDMD400G4)
Operating System	Ubuntu* 16.04 LTS (Xenial Xerus) Linux* kernel 4.4.0-21-generic

Note: Depending on the platform capability, Intel ISA-L can run on various Intel® processor families. Improvements are obtained by speeding up the computations through the use of the following instruction sets:

Intel® Advanced Encryption Standard New Instruction (Intel® AES-NI)
Intel® Streaming SIMD Extensions (Intel® SSE)
Intel® Advanced Vector Extensions (Intel® AVX)
Intel® Advanced Vector Extensions 2 (Intel® AVX2)

Why Use Intel® Intelligent Storage Library (Intel® ISA-L)?

Intel ISA-L has the ability to compress and decompress faster than zlib* with only a small sacrifice in the compression ratio. This capability is well suited for high throughput storage applications. This article includes a sample application that simulates a compression and decompression scenario where the output will show the efficiency. Click on the button at the top of this article to download.

Prerequisites

Intel ISA-L supports Linux and Microsoft Windows*. A full list of prerequisite packages can be found here.

Building the sample application (for Linux):

Install the dependencies:
- a c++14 compliant c++ compiler
- cmake >= 3.1
- git
- autogen
- autoconf
- automake
- yasm and/or nasm
- libtool
- boost's "Filesystem" library and headers
- boost's "Program Options" library and headers
- boost's "String Algo" headers
  >sudo apt-get update >sudo apt-get install gcc g++ make cmake git zlib1g-dev autogen autoconf automake yasm nasm libtool libboost-all-dev
You also need the latest versions of isa-l and zlib. The get_libs.bash script can be used to get them. The script will download the two libraries from their official GitHub* repositories, build them, and then install them in `./libs/usr` directory.
>`bash ./libs/get_libs.bash`
Build from the `ex1` directory:
- `mkdir <build-dir>`
- `cd <build-dir>`
- `cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD`
- `make`

Getting Started with the Sample Application

The sample application contains the following files:

Sample App

This example goes through the following steps at a high-level work flow and focuses on the “main.cpp” and “bm_isal.cpp” files:

Setup

1. In the “main.cpp” file, the program parses the command line and displays the options that are going to be performed.

int main(int argc, char* argv[])
{
     options options = options::parse(argc, argv);

Parsing the option of the command line

2. In the options.cpp file, the program parses the command line arguments using `options::parse()`.

Create the benchmarks object

3. In the “main.cpp” file, the program will benchmark each raw file using a compression-level inside the benchmarks::add_benchmark() function. Since the benchmarks do not run concurrently, there is only one file “pointer” created.

benchmarks benchmarks;

// adding the benchmark for each files and libary/level combination
for (const auto& path : options.files)
{
	auto compression   = benchmark_info::Method::Compression;
	auto decompression = benchmark_info::Method::Decompression;
	auto isal          = benchmark_info::Library::ISAL;
	auto zlib          = benchmark_info::Library::ZLIB;

	benchmarks.add_benchmark({compression, isal, 0, path});
	benchmarks.add_benchmark({decompression, isal, 0, path});

	for (auto level : options.zlib_levels)
	{
		if (level >= 1 && level <= 9)
		{
			benchmarks.add_benchmark({compression, zlib, level, path});
			benchmarks.add_benchmark({decompression, zlib, level, path});
		}
		else
		{
			std::cout << "[Warning] zlib compression level "<< level << "will be ignored\n";
		}
	}
}

Intel® ISA-L compression and decompression

4. In the “bm_isal.cpp” file, the program performs the compression and decompression on the raw file using a single thread. The key functions to note are isal_deflate and isal_inflate. Both functions accept a stream as an argument, and this data structure holds the data about the input buffer, the length in bytes of the input buffer, and the output buffer and the size of the output buffer. end_of_stream indicates whether it will be last iteration.

std::string bm_isal::version()
{
    return std::to_string(ISAL_MAJOR_VERSION) + "." + std::to_string(ISAL_MINOR_VERSION) + "." +
           std::to_string(ISAL_PATCH_VERSION);
}

bm::raw_duration bm_isal::iter_deflate(file_wrapper* in_file, file_wrapper* out_file, int /*level*/)
{
    raw_duration duration{};

    struct isal_zstream stream;

    uint8_t input_buffer[BUF_SIZE];
    uint8_t output_buffer[BUF_SIZE];

    isal_deflate_init(&stream);
    stream.end_of_stream = 0;
    stream.flush         = NO_FLUSH;

    do
    {
        stream.avail_in      = static_cast<uint32_t>(in_file->read(input_buffer, BUF_SIZE));
        stream.end_of_stream = static_cast<uint32_t>(in_file->eof());
        stream.next_in       = input_buffer;
        do
        {
            stream.avail_out = BUF_SIZE;
            stream.next_out  = output_buffer;

            auto begin = std::chrono::steady_clock::now();
            isal_deflate(&stream);
            auto end = std::chrono::steady_clock::now();
            duration += (end - begin);

            out_file->write(output_buffer, BUF_SIZE - stream.avail_out);
        } while (stream.avail_out == 0);
    } while (stream.internal_state.state != ZSTATE_END);

    return duration;
}

bm::raw_duration bm_isal::iter_inflate(file_wrapper* in_file, file_wrapper* out_file)
{
    raw_duration duration{};

    int                  ret;
    int                  eof;
    struct inflate_state stream;

    uint8_t input_buffer[BUF_SIZE];
    uint8_t output_buffer[BUF_SIZE];

    isal_inflate_init(&stream);

    stream.avail_in = 0;
    stream.next_in  = nullptr;

    do
    {
        stream.avail_in = static_cast<uint32_t>(in_file->read(input_buffer, BUF_SIZE));
        eof             = in_file->eof();
        stream.next_in  = input_buffer;
        do
        {
            stream.avail_out = BUF_SIZE;
            stream.next_out  = output_buffer;

            auto begin = std::chrono::steady_clock::now();
            ret        = isal_inflate(&stream);
            auto end   = std::chrono::steady_clock::now();
            duration += (end - begin);

            out_file->write(output_buffer, BUF_SIZE - stream.avail_out);
        } while (stream.avail_out == 0);
    } while (ret != ISAL_END_INPUT && eof == 0);

    return duration;
}

5. When all compression and decompression tasks are complete, the program displays the results on the screen. All temporary files are deleted using benchmarks.run().

Execute the sample application

In this example, the program will run as a single thread through the compression and decompression functions of the Intel ISA-L and zlib.

Run

From the ex1 directory:

cd <build-bir>/ex1

./ex1 --help

Usage

Usage: ./ex1 [--help] [--folder <path>]... [--file <path>]... :
  --help                display this message
  --file path           use the file at 'path'
  --folder path         use all the files in 'path'
  --zlib-levels n,...   coma-separated list of compression level [1-9]
•	--file and --folder can be used multiple times to add more files to the benchmark
•	--folder will look for files recursively
•	the default --zlib-level is 6

Test corpuses are public data files designed to test the compression and decompression algorithms, which are available online (for example, Calgary and Silesia corpuses). The --folder option can be used to easily benchmark them: ./ex1 --folder /path/to/corpus/folder.

Running the example

As Intel CPUs have integrated PCI-e* onto the package, it is possible to optimize access to solid-state drives (SSD) and avoid a potential performance degradation for accesses over an Intel® QuickPath Interconnect (Intel® QPI)/Intel® Ultra Path Interconnect (Intel® UPI) Intel QPI/Intel UPI link. For example, if you have a two-socket (two CPU) system with a PCI-e SSD, this SSD will be attached to either one of the sockets. If the SSD is attached to socket 1 and the program accessing the SSD is being accessed on socket 2, these requests and the data have to go over the Intel QPI/Intel UPI link that is used to connect the sockets together. To avoid this potential problem you can find out which socket the PCI-e SSD is attached to and then set thread affinity so that program runs on the same socket as the SSD. The following commands shows the list of PCI-e devices attached to the system where it can find ‘ssd’ in the output. For example:

lscpi –vvv | grep –i ssd
cd /sys/class/pci_bus

PCI Identifier

05:00.0 is the PCI* identifier and can be used to get more details from within Linux.

cd /sys/class/pci_bus/0000:05/device

This directory includes a number of files that give additional information about the PCIe device, such as make, model, power settings, and so on. To determine which socket this PCIe device is connected to, use:

cat local_cpulist

The output returned looks like the following:

Output Return

Now we can use this information to set thread affinity, using taskset:

taskset -c 10 ./ex1..

For the `-c 10`option, this number can be anything from 0 to 21 as those are the core IDs for the socket this PCI-e SSD is attached to.

The application runs with the taskset command assigns to core number 10 which should give the output below. If the system does not have a PCI-e SSD, the application can just run without the taskset command.

Compression Library

Program output displays a column for the compression library, either ‘isa-l’ or ‘zlib’. The table shows the compression ratio (compressed file/raw file), and the system and processor time that it takes to perform the operation. For decompression, it just measures the elapsed time for the decompression operation. All the data was produced on the same system.

Conclusion

This tutorial and its sample application demonstrates one method through which you can incorporate the Intel ISA-L compression and decompression features into your storage application. The sample application’s output data shows there is a balancing act between processing time (CPU time) and disk space. It can assist you in determining which compression and decompression algorithm best suits your requirements, then help you to quickly adapt your application to take advantage of Intel® Architecture with the Intel ISA-L.

Authors

Thai Le is a software engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an application engineer focusing on cloud computing within the Software Services Group at Intel Corporation (UK).

Notices

This sample source code is released under the Intel Sample Source Code License Agreement.

↧

3D Isotropic Acoustic Finite-Difference Wave Equation Code: A Many-Core Processor Implementation and Analysis

January 6, 2017, 2:43 pm

Latest and popular articles on Intel Technologies

≫ Next: What to Do When Auto-Vectorization Fails?

≪ Previous: Intel® ISA-L: Semi-Dynamic Compression Algorithms

Finite difference is a simple and efficient mathematical tool that helps solve differential equations. In this paper, we solve an isotropic acoustic 3D wave equation using explicit, time domain finite differences.

Propagating seismic waves remains a compute-intensive task even when considering the simplest expression of the wave equation. In this paper, we explain how to implement and optimize a three-dimension isotropic kernel with finite differences to run on the Intel® Xeon® processor v4 Family and the Intel® Xeon Phi™ processor.

We also give a brief overview of new memory hierarchy introduced with the Intel® Xeon Phi™ processor and the different settings and modifications of the source code needed to incorporate the use of C/C++ High Bandwidth Memory (HBM) application programming interfaces (APIs) for doing dynamic storage allocation from Multi-Channel DRAM (MCDRAM).

↧

What to Do When Auto-Vectorization Fails?

January 17, 2017, 2:26 pm

Latest and popular articles on Intel Technologies

≫ Next: Implementing a masked SVML-like function explicitly in user defined way

≪ Previous: 3D Isotropic Acoustic Finite-Difference Wave Equation Code: A Many-Core Processor Implementation and Analysis

Introduction

The following article is a follow up and a detailed analysis of a problem reported on the Intel® Developer Zone (Intel® DZ) forum¹ dedicated to the Intel® C++ Compiler ².

An Intel DZ user implemented a simple program as part of a code modernization workshop, and a problem with an inner for-loop was detected. Here is a piece of the code related to the problem:

	...
	for (std::size_t i = 0; i < nb_cluster; ++i) {
	float x = point[k].red - centroid[i].red;
	float y = point[k].green - centroid[i].green;
	float z = point[k].blue - centroid[i].blue;
	float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
	if (distance < best_distance) {
	best_distance = distance;
	best_centroid = i;
	}
	...

Note: This is not an auto-vectorized inner for-loop from KmcTestAppV1.cpp.

The Intel DZ user suspects that the inner for-loop was not auto-vectorized because a variable 'i' was declared as a 'std::size_t' data type, that is as 'unsigned int'.

Unmodified source code⁶ is attached. See KmcTestAppV1.cpp for more details.

Note that this article is not a tutorial on vectorization or parallelization techniques. However, a brief overview of these techniques will be given in the next part of this article.

A brief overview of vectorization and parallelization techniques

Modern software is complex and in order to achieve peak performance, especially when doing data-intensive processing, the vectorization and parallelization capabilities of modern CPUs, which could have many cores with several Logical Processing Units (LPUs) and Vector Processing Units (VPUs), need to be fully used.

VPUs allow different operations to be performed on multiple values of a data set simultaneously, and this technique, called vectorization, increases the performance of the processing when compared to the same processing implemented in a scalar, or sequential, way.

Parallelization is another technique, which allows different parts of a data set to be processed at the same time by different LPUs.

When vectorization and parallelization are combined, the performance of the processing can be boosted significantly.

Generic vectorization rules

You need to take into account the following generic rules related to vectorization of source codes:

A modern C/C++ compiler needs to be used with vectorization support.
Two types of vectorization techniques can be used: auto-vectorization (AV) and explicit vectorization (EV).
Only relatively simple inner for-loops can be vectorized.
Some inner for-loops cannot be vectorized with AV or EV techniques, because complex C or C++ constructions are used, for example, Standard Template Library classes or C++ operators.
It is recommended to review and analyze all cases when a modern C/C++ compiler cannot vectorize inner for-loops.

How an inner for-loop counter variable should be declared

The AV technique is considered the most effective for simple inner for-loops, because no code modifications are required, and AV of modern C/C++ compilers is enabled by default when optimization options 'O2' or 'O3' are used.

In more complex cases EV can be used to force vectorization using intrinsic functions, or vectorization #pragma directives, but it requires some modifications of inner for-loops.

A question can be asked: How should an inner for-loop counter variable be declared?

Two possible declarations can be considered:

Case A - Variable 'i' is declared as 'int'

	...
	for( int i = 0; i < n; i += 1 )
	{
		A[i] = A[i] + B[i];
	}
	...

and

Case B - Variable 'i' is declared as 'unsigned int'

	...
	for( unsigned int i = 0; i < n; i += 1 )
	{
		A[i] = A[i] + B[i];
	}
	...

In Case A the variable 'i' is declared as a signed data type 'int'.

In Case B the variable 'i' is declared as an unsigned data type 'unsigned int'.

Cases A and B could be combined in a simple test program ³ to evaluate vectorization capabilities of a C/C++ compiler:

////////////////////////////////////////////////////////////////////////////////////////////////////
// TestApp.cpp - To generate assembly listings an option '-S' needs to be used.
// Linux:
//		icpc -O3 -xAVX -qopt-report=1 TestApp.cpp -o TestApp.out
//		g++ -O3 -mavx -ftree-vectorizer-verbose=1 TestApp.cpp -o TestApp.out
// Windows:
//		icl   -O3 /QxAVX /Qvec-report=1 TestApp.cpp TestApp.exe
//		g++ -O3 -mavx -ftree-vectorizer-verbose=1 TestApp.cpp -o TestApp.exe

#include <stdio.h>
#include <stdlib.h>
//

////////////////////////////////////////////////////////////////////////////////////////////////////

	typedef float			RTfnumber;

	typedef int				RTiterator;			// Uncomment for Test A
	typedef int				RTinumber;
//	typedef unsigned int	RTiterator;			// Uncomment for Test B
//	typedef unsigned int	RTinumber;

////////////////////////////////////////////////////////////////////////////////////////////////////

	const RTinumber iDsSize = 1024;

////////////////////////////////////////////////////////////////////////////////////////////////////

int main( void )
{
	RTfnumber fDsA[ iDsSize ];
	RTfnumber fDsB[ iDsSize ];

	RTiterator i;

	for( i = 0; i < iDsSize; i += 1 )
		fDsA[i] = ( RTfnumber )( i );
	for( i = 0; i < iDsSize; i += 1 )
		fDsB[i] = ( RTfnumber )( i );

	for( i = 0; i < 16; i += 1 )
		printf( "%4.1f ", fDsA[i] );
	printf( "\n" );
	for( i = 0; i < 16; i += 1 )
		printf( "%4.1f ", fDsB[i] );
	printf( "\n" );

	for( i = 0; i < iDsSize; i += 1 )
		fDsA[i] = fDsA[i] + fDsB[i];			// Line 49

	for( i = 0; i < 16; i += 1 )
		printf( "%4.1f ", fDsA[i] );
	printf( "\n" );

	return ( int )1;
}

It turns out that these two for-loops (see Line 49 in the code sample above) are easily vectorizible⁴ (instructions with a prefix 'v' are used, like vmovups, vaddps, and so on) and the Intel C++ Compiler generated identical vectorization reports regardless of how the variable 'i' is declared:

Vectorization report for cases A and B

...
	Begin optimization report for: main()
	Report from: Interprocedural optimizations [ipo]
	INLINE REPORT: (main())
	Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]
	LOOP BEGIN at TestApp.cpp(37,2)
		remark #25045: Fused Loops: ( 37 39 )
		remark #15301: FUSED LOOP WAS VECTORIZED
	LOOP END
	LOOP BEGIN at TestApp.cpp(39,2)
	LOOP END
	LOOP BEGIN at TestApp.cpp(42,2)
		remark #25460: No loop optimizations reported
	LOOP END
	LOOP BEGIN at TestApp.cpp(45,2)
		remark #25460: No loop optimizations reported
	LOOP END
	LOOP BEGIN at TestApp.cpp(49,2)
		remark #15300: LOOP WAS VECTORIZED
	LOOP END
	LOOP BEGIN at TestApp.cpp(52,2)
		remark #25460: No loop optimizations reported
	LOOP END
...

The vectorization reports⁴ show that a for-loop at Line 49³ was vectorized:

	...
	LOOP BEGIN at TestApp.cpp(49,2)
		remark #15300: LOOP WAS VECTORIZED
	LOOP END
	...

However, the Intel C++ Compiler considers these two for-loops as different C language constructions and generates different vectorized binary codes.

Here are the two core pieces of assembler listings, related to the for-loop at Line 49³, for both cases:

Case A - Assembler listing (option '-S' needs to be used when compiling TestApp.cpp)

...
..B1.12:								# Preds ..B1.12 ..B1.11
	vmovups		(%rsp,%rax,4), %ymm0					#50.13
	vmovups		32(%rsp,%rax,4), %ymm2					#50.13
	vmovups		64(%rsp,%rax,4), %ymm4					#50.13
	vmovups		96(%rsp,%rax,4), %ymm6					#50.13
	vaddps		4128(%rsp,%rax,4), %ymm2, %ymm3			#50.23
	vaddps		4096(%rsp,%rax,4), %ymm0, %ymm1			#50.23
	vaddps		4160(%rsp,%rax,4), %ymm4, %ymm5			#50.23
	vaddps		4192(%rsp,%rax,4), %ymm6, %ymm7			#50.23
	vmovups		%ymm1, (%rsp,%rax,4)					#50.3
	vmovups		%ymm3, 32(%rsp,%rax,4)					#50.3
	vmovups		%ymm5, 64(%rsp,%rax,4)					#50.3
	vmovups		%ymm7, 96(%rsp,%rax,4)					#50.3
	addq		$32, %rax#49.2
	cmpq		$1024, %rax								#49.2
	jb			..B1.12						# Prob 99%	#49.2
...

Note: See TestApp.icc.itype.s^5.1 for a complete assembler listing.

Case B - Assembler listing (option '-S' needs to be used when compiling TestApp.cpp)

...
..B1.12:								# Preds ..B1.12 ..B1.11
	lea			8(%rax), %edx							#50.13
	lea			16(%rax), %ecx							#50.13
	lea			24(%rax), %esi							#50.13
	vmovups		(%rsp,%rax,4), %ymm0					#50.13
	vaddps		4096(%rsp,%rax,4), %ymm0, %ymm1			#50.23
	vmovups		%ymm1, (%rsp,%rax,4)					#50.3
	addl		$32, %eax								#49.2
	vmovups		(%rsp,%rdx,4), %ymm2					#50.13
	cmpl		$1024, %eax								#49.2
	vaddps		4096(%rsp,%rdx,4), %ymm2, %ymm3			#50.23
	vmovups		%ymm3, (%rsp,%rdx,4)					#50.3
	vmovups		(%rsp,%rcx,4), %ymm4					#50.13
	vaddps		4096(%rsp,%rcx,4), %ymm4, %ymm5			#50.23
	vmovups		%ymm5, (%rsp,%rcx,4)					#50.3
	vmovups		(%rsp,%rsi,4), %ymm6					#50.13
	vaddps		4096(%rsp,%rsi,4), %ymm6, %ymm7			#50.23
	vmovups		%ymm7, (%rsp,%rsi,4)					#50.3
	jb			..B1.12						# Prob 99%	#49.2
...

Note: See TestApp.icc.utype.s^5.2 for a complete assembler listing.

It is finally clear that the problem where the inner for-loop is not auto-vectorized (see beginning of the forum posting¹) is not related to how the variable 'i' is declared, and that something else is affecting a vectorization engine of the Intel C++ Compiler.

In order to pinpoint a root cause of the vectorization problem a question needs to be asked: What compiler messages will be generated when AV or EV techniques cannot be applied?

A small list of some “loop was not vectorized” messages of the Intel C++ Compiler when AV or EV techniques can't be applied is as follows:

...loop was not vectorized: not inner loop.
...loop was not vectorized: existence of vector dependence.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: unsupported reduction.
...loop was not vectorized: unsupported loop structure.
...loop was not vectorized: vectorization possible but seems inefficient.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: nonstandard loop is not a vectorization candidate.
...loop was not vectorized: dereference too complex.
...loop was not vectorized: statement cannot be vectorized.
...loop was not vectorized: conditional assignment to a scalar.
...warning #13379: loop was not vectorized with "simd".
...loop skipped: multiversioned.

One message deserves special attention:

...loop was not vectorized: unsupported loop structure.

It is seen in KmcTestAppV1.cpp⁶ that the inner for-loop has three parts:

Part 1 - Initialization of x, y, and z variables

...
float x = point[k].red - centroid[i].red;
float y = point[k].green - centroid[i].green;
float z = point[k].blue - centroid[i].blue;
...

Part 2 - Calculation of a distance between points x, y, and z

...
float distance = std::pow(x, 2) + std::pow(y, 2) + std::pow(z, 2);
...

Part 3 - Update of a 'best_distance' variable

...
if (distance < best_distance) {
best_distance = distance;
best_centroid = i;
}
...

Because all these parts are in the same inner for-loop, the Intel C++ Compiler cannot match its structure to a predefined vectorization template. However, Part 3, with a conditional if-statement, is the root cause of the vectorization problem.

A possible solution of the vectorization problem is to split the inner for-loop into three parts as follows:

...														// Calculate Distance
for( i = 0; i < nb_cluster; i += 1 )
{
	float x = point[k].red - centroid[i].red;
	float y = point[k].green - centroid[i].green;
	float z = point[k].blue - centroid[i].blue;			// Performance improvement: ( x * x ) is
	distance[i] = ( x * x ) + ( y * y ) + ( z * z );	// used instead of std::pow(x, 2), etc
}
														// Best Distance
for( i = 0; i < nb_cluster; i += 1 )
{
	best_distance = ( distance[i] < best_distance ) ? ( float )distance[i] : best_distance;
}
														// Best Centroid
for( i = 0; i < nb_cluster; i += 1 )
{
	cluster[k] = ( distance[i] < best_distance ) ? ( float )i : best_centroid;
}
...

The most important two modifications are related to the conditional if-statement in the for-loop. It was modified from a generic form:

...
if( A < B )
{
	D = val1
	C = val3
}
...

to a form that uses two conditional operators ( ? : ):

...
D = ( A < B ) ? ( val1 ) : ( val2 )
...
C = ( A < B ) ? ( val3 ) : ( val4 )
...

also known as ternary operators. Now a modern C/C++ compiler can match this C language construction to a pre-defined vectorization template.

Performance evaluation of unmodified and modified source code

A performance evaluation of both versions of the program for 1,000,000 points, 1,000 clusters, and 10 iterations was completed and the results are as follows:

...>KmcTestAppV1.exe
		Time: 111.50

Note: Original version⁶.

...>KmcTestAppV2.exe
	Time:  20.48

Note: Optimized and vectorized version⁷.

The optimized and vectorized version⁷ is about 5.5x faster than the original version of the program (see ¹ or ⁶). Times are in seconds.

Conclusion

If a modern C/C++ compiler fails to vectorize a for-loop it is important to evaluate its complexity. In the case of the Intel C++ Compiler, 'opt-report=n' option needs to be used (n greater than 3).

In most cases, a C/C++ compiler cannot vectorize the for-loop because it cannot match its structure to a predefined vectorization template. For example, in the case of the Intel C++ Compiler, the following vectorization messages would be reported:

...loop was not vectorized: unsupported reduction.

...loop was not vectorized: unsupported loop structure.

If this is the case, you need to modify the for-loop to simplify its structure, consider EV techniques using #pragma directives, like #pragma simd, or consider reimplementation of the required functionality using intrinsic functions.

About the author

Sergey Kostrov is a highly experienced C/C++ software engineer and Intel® Black Belt. He is an expert in design and implementation of highly portable C/C++ software for embedded and desktop platforms, scientific algorithms and high-performance computing of big data sets.

Downloads

WhatToDoWhenAVFails.zip

List of all files (sources, assembly listings and vectorization reports):

KmcTestAppV1.cpp
KmcTestAppV2.cpp
TestApp.cpp
TestApp.icc.itype.rpt
TestApp.icc.utype.rpt
TestApp.icc.itype.s
TestApp.icc.utype.s

Implementing a masked SVML-like function explicitly in user defined way

December 19, 2016, 4:29 am

Latest and popular articles on Intel Technologies

≫ Next: Accessing Intel® Media Server Studio for Linux* codecs with FFmpeg

≪ Previous: What to Do When Auto-Vectorization Fails?

Intel Compiler provides SIMD intrinsics APIs for short vector math library (SVML) and starting with AVX512 generation it also exposes masked versions of SVML functions to the users. e.g. see zmmintrin.h:

extern __m512d __ICL_INTRINCC _mm512_mask_exp_pd(__m512d, __mmask8, __m512d);

Masked SIMD functions are handy, just like masked instructions – one can use mask as a vector predicate to avoid computations on certain elements in a vector register e.g. because of unwanted floating point, memory or performance side-effects. Intel Compiler autovectorizer could always optimize this loop with condition into a masked SVML function call

   for (int32_t i=0; i<LEN; i++)
      if (x[i] > 0.0)
        y[i] = exp(x[i]);
      else
        y[i] = 0.0;

AVX512(-xCORE-AVX512) code generation(disassembly) snippet for above code:

 ..B1.24:                        # Preds ..B1.59 ..B1.23
                                # Execution count [8.48e-01]
        vpcmpud   $1, %ymm16, %ymm18, %k6                       #54.17
        vmovupd   (%rbx,%r12,8), %zmm2{%k6}{z}                  #55.9
        vcmppd    $6, %zmm17, %zmm2, %k5                        #55.16
        kandw     %k5, %k6, %k4                                 #55.16
        vmovupd   (%rbx,%r12,8), %zmm1{%k4}{z}                  #56.18
        vmovaps   %zmm17, %zmm0                                 #56.14
        kmovw     %k4, %k1                                      #56.14
        call      __svml_exp8_mask                              #56.14
                                # LOE rbx rsi r12 r13 r14 edi r15d ymm16 ymm18 ymm19 zmm0 zmm17 k4 k5 k6
..B1.59:                        # Preds ..B1.24
                                # Execution count [8.48e-01]
        vpaddd    %ymm19, %ymm18, %ymm18                        #54.17
        kandnw    %k6, %k5, %k1                                 #58.7
        vmovupd   %zmm0, (%r13,%r12,8){%k4}                     #56.7
        vmovupd   %zmm17, (%r13,%r12,8){%k1}                    #58.7
        addq      $8, %r12                                      #54.17
        cmpq      %rsi, %r12                                    #54.17
        jb        ..B1.24       # Prob 82%                      #54.17

Before AVX512 , the x86 vector unit instruction set didn’t provide architectural support for vector masks but the desired behaviour could be easily emulated. For example disassembly of AVX2(-xCORE-AVX2) for above conditional code.

..B1.11:                        # Preds ..B1.14 ..B1.10
                                # Execution count [0.00e+00]
        vmovupd   (%rbx,%r14,8), %ymm0                          #55.9
        vcmpgtpd  %ymm10, %ymm0, %ymm11                         #55.16
        vptest    %ymm8, %ymm11                                 #55.16
        je        ..B1.13       # Prob 20%                      #55.16
                                # LOE rbx r12 r13 r14 r15d ymm0 ymm8 ymm9 ymm10 ymm11
..B1.12:                        # Preds ..B1.11
                                # Execution count [8.48e-01]
        vmovdqa   %ymm11, %ymm1                                 #56.14
        call      __svml_exp4_mask                              #56.14
                                # LOE rbx r12 r13 r14 r15d ymm0 ymm8 ymm9 ymm10 ymm11
..B1.39:                        # Preds ..B1.12
                                # Execution count [8.48e-01]
        vmovdqa   %ymm0, %ymm2                                  #56.14
        vmovupd   (%r12,%r14,8), %ymm0                          #56.7
        vblendvpd %ymm11, %ymm2, %ymm0, %ymm2                   #56.7
        jmp       ..B1.14       # Prob 100%                     #56.7
                                # LOE rbx r12 r13 r14 r15d ymm2 ymm8 ymm9 ymm10 ymm11
..B1.13:                        # Preds ..B1.11
                                # Execution count [0.00e+00]
        vmovupd   (%r12,%r14,8), %ymm2                          #58.7
                                # LOE rbx r12 r13 r14 r15d ymm2 ymm8 ymm9 ymm10 ymm11
..B1.14:                        # Preds ..B1.39 ..B1.13
                                # Execution count [8.48e-01]
        vxorpd    %ymm11, %ymm9, %ymm0                          #55.16
        vandnpd   %ymm2, %ymm0, %ymm1                           #58.7
        vmovupd   %ymm1, (%r12,%r14,8)                          #58.7
        addq      $4, %r14                                      #54.17
        cmpq      $8388608, %r14                                #54.17
        jb        ..B1.11       # Prob 82%                      #54.17

So users benefited from masked functions in SVML even before architecture added support for vector masks. In below recipe we would like to address users that do not rely on autovectorizer and chose to call SVML through intrinsics on pre-AVX512 platforms. We are not exposing pre-AVX512 masked APIs through intrinsics this time, instead we show how users could implement their own masked vector math functions if needed. Here’s an example:

static __forceinline __m256d _mm256_mask_exp_pd(__m256d old_dst, __m256d mask, __m256d src)
{
    // Need to patch masked off inputs with good values
    // that do not cause side-effects like over/underflow/nans/denormals, etc.
    // 0.5 is a good value for EXP and most other functions.
    // acosh is not defined in 0.5, so it can rather use 2.0
    // 0.0 and 1.0 are often bad points, e.g. think log()
   __m256d patchValue = _mm256_set1_pd(0.5);
    __m256d patchedSrc = _mm256_blendv_pd(patchValue, src, mask);
    // compute SVML function on a full register
    // NOTE: one may choose to totally skip expensive call to exp
    // if the mask was all-zeros, this is left as an exercise to
    // the reader.
    __m256d res = _mm256_exp_pd(patchedSrc);
    // discard masked off results, restore values from old_dst
    old_dst = _mm256_blendv_pd(old_dst, res, mask);
    return old_dst;
}

One would probably achieve better performance if masked function was inlined, thus we use static __forceinline in the declaration.And here’s how one would use this function if the original loop was written with intrinsics:

void vfoo(int n4, double * a, double *r)
{
    int i;
    for (i = 0; i < n4; i+=4)
    {
        __m256d src, dst, mask;
        src = _mm256_load_pd(a + i);


        // fill mask based on desired condition
        mask = _mm256_cmp_pd(src, _mm256_setzero_pd(), _CMP_GT_OQ);
        // do something useful for the else path
        dst = _mm256_setzero_pd();
        // compute masked exp that will preserve above useful values
        dst = _mm256_mask_exp_pd(dst, mask, src);


        _mm256_store_pd(r + i, dst);
    }
}

Here’s the assembly listing for the above loop:

..B1.3:                         # Preds ..B1.8 ..B1.2
                                # Execution count [5.00e+00]
        vmovupd   (%rdi,%r12,8), %ymm1                          #25.30
        vcmpgt_oqpd %ymm9, %ymm1, %ymm10                        #28.16
        vblendvpd %ymm10, %ymm1, %ymm8, %ymm0                   #32.15
        call      __svml_exp4                                   #32.15
                                # LOE rbx rsi rdi r12 r13 r14 r15 ymm0 ymm8 ymm9 ymm10
..B1.8:                         # Preds ..B1.3
                                # Execution count [5.00e+00]
        vblendvpd %ymm10, %ymm0, %ymm9, %ymm1                   #32.15
        vmovupd   %ymm1, (%rsi,%r12,8)                          #34.25
        addq      $4, %r12                                      #22.25
        cmpq      %r13, %r12                                    #22.21
        jl        ..B1.3        # Prob 82%                      #22.21

Note:Similarly we can develop our own masked version of intrinsics for other functions like log,sqrt,cos,sin also by just trivial change of “exp” to “cos", "sin" ..etc. as in the above sample code. Mind the note on patch value though.

↧

Accessing Intel® Media Server Studio for Linux* codecs with FFmpeg

January 31, 2017, 11:21 am

Latest and popular articles on Intel Technologies

≫ Next: Implementation of Classic Gram-Schmidt in a Reservoir Simulator

≪ Previous: Implementing a masked SVML-like function explicitly in user defined way

Intel hardware accelerated codecs are now accessible via FFmpeg* on Linux* systems where Intel® Media Server Studio is installed. The same OS and platform requirements apply, as the new additions to FFmpeg are simple wrappers to bridge the APIs. For more information on Intel Media Server Studio requirements please see the Release Notes and Getting Started Guide.

To get started:

Install Intel Media Server Studio for Linux. A free community edition is available from https://software.intel.com/en-us/intel-media-server-studio.
Get the latest FFmpeg source from https://www.ffmpeg.org/download.html. Intel Quick Sync Video support is available in FFmpeg 2.8 and newer for those who prefer a stable release. Development is active so anyone needing latest updates and fixes should check the git repository tip.
Configure FFmpeg with "--enable-libmfx --enable-nonfree", build, and install. This requires copying include files to /opt/intel/mediasdk/include/mfx and adding a libmfx.pc file. More details below.
Transcode with an accelerated codec such as "-vcodec h264_qsv" on the ffmpeg command line. Performance boost increases with resolution.

ffmpeg -i in.mp4  -vcodec h264_qsv out_qsv.mp4

Additional configure info:

The *_qsv codecs are enabled with "configure --enable-libmfx --enable-nonfree".

A few additional steps are required for configure to work with Intel Media Server Studio codecs.

1. copy the mediasdk header files to include/mfx

# mkdir /opt/intel/mediasdk/include/mfx
# cp /opt/intel/mediasdk/include/*.h /opt/intel/mediasdk/include/mfx

2. provide a libmfx.pc file. Same search rules apply as for other pkg-config configurations. A good place to start is the same directory as an already findable config like libdrm.pc, but the PKG_CONFIG_PATH environment variable can be used to customize the search path.

example libmfx.pc file

prefix=/opt/intel/mediasdk
exec_prefix=${prefix}
libdir=${prefix}/lib/lin_x64
includedir=${prefix}/include

Name: libmfx
Description: Intel Media SDK
Version: 16.4.2
Libs: -L${libdir} -lmfx -lva -lstdc++ -ldl -lva-drm -ldrm
Cflags: -I${includedir} -I/usr/include/libdrm

Validation notes:

While this solution is known to work on a wide variety of systems, here are the most tested configurations:

Hardware	Intel® Xeon® Processors and Intel® Core™ Processors with support for Intel® Quick Sync Video
OS	CentOS7.2 (3.10.0-327.36.3.el7.x86_64)
Software	Gold CentOS install of Intel® Media Server Studio 2017 R2

↧

Implementation of Classic Gram-Schmidt in a Reservoir Simulator

February 10, 2017, 11:30 am

Latest and popular articles on Intel Technologies

≫ Next: What’s new with Intel® Cluster Checker 2017 update 2

≪ Previous: Accessing Intel® Media Server Studio for Linux* codecs with FFmpeg

Introduction

Reservoir simulators typically use Krylov methods for solving the systems of linear equations that appear at every Newton iteration. One key step in most Krylov methods, executed at every linear iteration, involves orthogonalizing a given vector against a set of (already) orthogonal vectors. The linear solver used in the reservoir simulator we worked on implements the Orthomin method and utilizes the Modified Gram-Schmidt algorithm to execute this operation. This process has, for some simulations, a high contribution to the total computation time of the linear solver. Therefore, performance optimizations on this portion of the code may provide performance improvements for the overall simulation.

Figure 1 shows the percentage of total time spent on the Linear Solver and in the Gram-Schmidt kernel for several parallel simulations. The percentage of time spent on the Modified Gram-Schmidt method inside the linear is a hotspot on the simulator, ranging from 6% on Case 1 up to 51% on the Case 2. This result demonstrates the importance of trying to optimize this computation kernel.

Figure 1. Gram-Schmidt and Linear solver total time percentage for parallel simulations. The number of MPI processes used in the run is indicated between parentheses.

This work describes an implementation of the Classic Gram-Schmidt method on the linear solver of a reservoir simulator under development. The implementation is based on the work developed by João Zanardi, a Graduate Student from Rio de Janeiro State University (UERJ) during his internship at Intel Labs (Santa Clara, USA). The proposed implementation of the Classic Gram-Schmidt method provides performance benefits by improving data locality in cache during the linear algebra operations and by reducing the number of collective MPI calls.

We achieved up to 1.7x performance speedup in total simulation time with our optimized Classic Gram-Schmidt when compared to the current implementation of the Modified Gram-Schmidt. However, the Classic Gram-Schmidt implementation does not over perform the current implementation in all cases. It seems that the part of the vectors associated to each thread needs to have a minimum size in order for the Classic Gram-Schmidt to be advantageous.

In Section 1 we describe the two versions of the Gram-Schmidt method, leaving for Section 2 a more detailed explanation of certain aspects of how the Classic version was implemented in our reservoir simulator. Section 3 presents the results of the tests we made, starting with a microbenchmark program specifically written to test the Gram-Schmidt kernel in isolation, then comparing the performance of the two approaches for the Linear Solver alone on matrices dumped from the reservoir simulator and finally testing the methods on actual simulation runs. Section 4 provides the conclusions of this study.

1. Description of the Classic and Modified Gram-Schmidt Methods

Listing 1 shows the current implementation of the Modified Gram-Schmidt method used in the reservoir simulator linear solver. Vector qj is orthogonalized with respect to vectors qi, i = 0 to j− 1, while vector dj is updated as a linear combination of vectors di, i = 0 to j– 1. For applications of interest in this work, qj and dj are vectors whose size can reach several million, while j, the number of vectors in the base, is a small number, typically a few dozen. For each pair of vectors qi and qj it is necessary to compute an inner product (line 9) through the brackets operator of the object ip. Then the scalar alpha is stored and vectors qj and dj are updated in lines 10 and 11.

V* q = &qv[0];
V* d = &dv[0];
V& qj = q[j];
V& dj = d[j];
for( int i = 0; i < j; i++ )
{
        const V& qi = q[i];
        const V& di = d[i];
        const double alpha = ip( qj, qi );
        qj -= alpha * qi;
        dj -= alpha * di;
}

Listing 1.Source code for the Modified Gram-Schmidt utilized on the current linear solver.

We can see in Listing 1 that the inner product and update (axpy) on lines 9, 10 and 11 are BLAS level one operations (vector operations) and, therefore, they have low arithmetic intensity, with their performance limited by the memory bandwidth. The Modified Gram-Schmidt method has a loop dependence since every pass on the loop updates the qj vector (line 10) to be used in the inner product of the next pass (line 9), which leaves no room for data reuse in order to improve data locality.

In order to overcome this limitation, one possibility is to use the Classic Gram-Schmidt method, as discussed in several references in the literature (e.g. ¹ and ²). The Classic GS method is also the default option in the well-known PETSc library, as well as other solver libraries, and it is implemented in other reservoir simulators. Listing 2 presents the algorithm. The inner products needed to calculate the alpha values are performed before the qj vector is updated, removing the recurrence observed in the Modified version. All three loops in the algorithm can be recast as matrix-vector multiplications and, therefore, BLAS2 operations can be used, with better memory access. More specifically, let Q be the matrix whose columns are the vectors qi, i = 0 to j− 1, then the loop in lines 5 to 9 calculates

alpha_vec = Q^T qj, (1.1)

where alpha_vec is a size j– 1 vector containing the alpha values, while the loop in lines 11 to 15 calculates

qj = qj − Q alpha_vec. (1.2)

Similarly, if D denotes the matrix whose columns are the di vectors, the loop in lines 17 to 21 calculates

dj = dj − D alpha_vec. (1.3)

In order to realize the potential performance benefits of interpreting the orthogonalization and updating as BLAS2 operations, as given by, and, blocking has to be used. The objective of blocking techniques is to organize data memory accesses. The idea is to load a small subset of a large dataset into the cache and then to work on this block of data without the need to bring it back to cache. By using/reusing the data already in cache, we reduce the need to go to memory, thus reducing memory bandwidth pressure ⁶.

Our implementation of the blocking technique will be shown in the next section, where our implementation in the simulator is detailed. It will also be clear that switching to BLAS2 results in less communication when running in parallel.

V* q = &qv[0];
V* d = &dv[0];
V& qj = q[j];
V& dj = d[j];
for( int i = 0; i < j; i++ )
{
   const V& qi = q[i];
   alpha[i] = ip( qj, qi );
}

for( int i = 0; i < j; i++ )
{
   const V& qi = q[i];
   qj -= alpha[i] * qi;
}

for( int i = 0; i < j; i++ )
{
   const V& di = d[i];
   dj -= alpha[i] * di;
}

Listing 2. Source code for the Classic Gram-Schmidt method without blocking.

Note that the Modified version could be rewritten in such way that the update of dj is recast as matrix-vector calculation. In fact, line 11 of Listing 1 is independent of the other calculations in the same loop and could be isolated in a separated loop equal to the loop in lines 17 to 21 of Listing 2. We had tested this alternative implementation of the Modified version, but our preliminary results indicated that the speedup obtained is always very close to or lower to what it can be obtained with the Classic version and we decided to not pursue on any further investigation along those lines.

It is very important to note that the Classic and Modified versions are not equivalent and it is a well-known fact that, in the presence of round-off errors, Classic GS is less stable than Modified GS ⁵, with the consequence that it is more prone to loss of orthogonalization in the resulting vector basis. To what extent this is an issue for its application to Krylov methods has been discussed in the literature ^{1, 2, 3, 4}, but apparently, it does not seem to be particularly serious, considering that, as alluded above, it is successfully applied in several solvers. This seems to be corroborated by our experience with the implementation in the simulator, as it will be shown in Section 3.

2. Implementation of Classic Gram-Schmidt on the Simulator

In order to explore the potential of data reuse introduced by recasting the calculations as BLAS2 operations, it is necessary to block the matrix-vector multiplications. Figure 2 depicts the blocking strategy for the matrix-vector multiplication. The contribution of each chunk of qj is calculated for all qi’s, allowing reuse of qj, improving memory access. Listing 3 shows the corresponding code for calculating the alpha vector using this blocking strategy. The Intel Compiler provides a set of pragmas to ensure vectorization ⁷. The reservoir simulator already made use of pragmas on several computational kernels and we also add pragmas to ensure vectorization.

Figure 2. Representation of the blocking strategy used to improve data traffic for.

const int chunkSize = 2048;
for(int k = 0; k < size; k += chunkSize)
{
    for( int i = 0; i < j; i++ )
    {
        double v = 0.0;
#pragma simd reduction(+:v)
       for(int kk = 0; kk < chunkSize; kk++)
       {
            v += qj[k + kk] * Q[i][k + kk];
       }
       alpha[i] += v;
    }
}

Listing 3. Source code for local computation of the alpha factor using a matrix-vector blocked operation.

Careful examination of Figure 2 and Listing 3 reveals that implementation of the Classic method has another opportunity to improve performance, in addition to blocking. The alpha factors are calculated without using the inner product operator, reducing MPI communication calls, as only a single MPI All reduce call on the entire alpha vector is required. Note that in the Modified version, an inner product has to be done for each alpha due to the loop recurrence and, consequently, a reduction is triggered for each i in the loop in line 5 of Listing 1.

Similarly, blocking is also required to reduce data traffic for calculations and. Figure 3 and Listing 4 are the counterparts of Figure 2 and Listing 3 for the updating operations, showing how the blocking strategy is implemented in that case. The updating of each chunk of qj is calculated for all qi’s, allowing reuse of qj, improving memory access.

Figure 3. Representation of the blocking strategy used to improve data traffic for. The same strategy can also be applied to.

const int chunkSize = 2048;
for(int k = 0; k < size; k += chunkSize)
{
    double temp[chunkSize];
    for( int i = begin; i < j; i++ )
    {
#pragma simd vectorlength(chunkSize)
        for(int kk = 0; kk < chunkSize; kk++)
        {
            temp[kk] += alpha[i] * Q[i][k + kk];
        }
    }
#pragma simd vectorlength(chunkSize)
    for(int kk = 0; kk < chunkSize; kk++)
    {
        qj[k + kk] -= temp[kk];
    }
}

Listing 4. Source code for the update of the vectors qj and dj using a modified matrix-vector blocked operation.

The optimizations used on this work are focused on our hardware used on production (Sandy Bridge). This kernel may show better performance on newer Intel® processors (Broadwell) that supports FMA (Fused Multiply Add) instructions and improved support for vectorization.

3. Tests

3.1 System Configuration

We performed all experiments on a workstation with two Intel® Xeon® processors and 128GB of DDR3 1600MHz memory. Table 1 shows the configurations of the processors used. All experiments were executed using Linux* RedHat 6 with kernel 2.6.32. Intel® MPI Library 5.0.1 and Intel® C++ Compiler XE 14.0.4 with compilation flags -O3, -fp-model fast and -vec were used.

Table 1.Description of the hardware used in the experiments.

3.2 Performance Comparison with a Microbenchmark

To evaluate the performance relation between the size of the vectors and the number of processes, we developed a microbenchmark that initializes a set of vectors in parallel and executes only the orthogonalization process, so that we can look directly to the Gram-Schmidt performance without any influences from the linear solver.

Table 2.Performance improvement of the Classic Gram-Schmidt over the Modified version in the microbenchmark for several vector sizes and the number of processes. Greater than one means Classic is faster than Modified, the higher the better.

Table 2 shows the performance improvement of the Classic method over the Modified version of the Gram-Schmidt method. In the table rows, we vary the qj and dj vector sizes and in the table columns, we vary the number of MPI processes used for parallel processing. Vector sizes were generated based on the number of cells of a regular N x N x N grid. The total number of vectors was 32 and chunk size was 2048. The value for the chunk size was obtained by performing a series of experiments with the microbenchmark program. Figure 4 shows the results of one such experiment performed in an early phase of this study, showing the performance benefits of increasing chunk size up to 1024. Further experiments determine that 2048 was slightly better. In all results to be presented in the next sections, this was the chunk size value used. It is expected that the ideal chunk size will depend on the machine features, such as cache size.

Figure 4.Ratio between the times for the Modified and Classic Gram-Schmidt implementations as a function of chunk size. Greater than one means Classic is faster than Modified, the higher the better. Vector size is 256 x 1024 and 32 is the number of vectors.

From Table 2 one can notice that Classic can be more than twice as fast as the Modified for the largest vectors. On the other hand, for vectors of intermediate size, Modified is faster when eight or more processors are used. For all numbers of processes, Classic loses performance, relatively to Modified, for the intermediate size vectors. So far, we have not reached a conclusion about the reason for this performance loss. Apparently, there is a minimum size for the part of the vector associated with a process in order to have the advantage when using Classic Gram-Schmidt.

3.3 Microbenchmark Profiling with Intel® VTune™ Amplifier and Intel® Trace Analyzer

To evaluate the reduction in communication time between the current implementation of the Modified method and our Blocked Classic method we utilized the microbenchmark and the Intel Trace Analyzer tool in the Linux operating system. To do so, we recompiled our microbenchmark by linking with appropriate Intel Trace Analyzer libraries.

We run the microbenchmark by varying the size of the vectors from 4,096 to 2,097,152 in the same execution with 16 MPI processes. Figure 5 shows the percentage of time spent with MPI communication in relation to the total time for the current implementation of the Modified method. In Figure 6 we have the percentage of time spent with MPI communication in relation to the total time for our implementation of the Classic method. Comparing the two figures we can notice a reduction of the percentage of time spends with MPI from 15.3% to 1.7%, which implies a reduction of time in the order of 15x in the Classic method.

Figure 5.Ratio of all MPI calls to the rest of the code in the application for the Modified method.

Figure 6. Ratio of all MPI calls to the rest of the code in the application for the Classic method.

In addition, we also use an Intel VTune Amplifier tool to check the vectorization of the Modified method and our implementation of the Classic method. For this, we executed the microbenchmark with a single process and with vectors of sizes 2,097,152 using the General Exploration Analysis type on Intel VTune Amplifier.

In Figure 7 and Figure 8, we have images from the Intel VTune Amplifier focusing on the code generated to vector update operations (AXPY) for Modified and Classic methods, respectively. These figures show that in both versions of the method the compiler was able to generate versions of the code with vectorization.

Figure 7. Update vector section generated code from the Intel VTune Amplifier running the Modified method.

Figure 8. Update vector section generated code from the Intel VTune Amplifier running Classic method.

In Figure 9 and Figure 10, we have the initial section of the Bottom-Up view in General Exploration Analysis for the Modified and Classic methods, respectively. In Figure 9, the code section responsible for updating the vectors (AXPY) is marked with a high incidence of LLC Hits. According to the VTune documentation: "The LLC (last level cache) is the last and highest latency, level in the memory hierarchy before the main memory (DRAM). While LLC hits are met much faster than DRAM hits, they can still incur a significant performance penalty. This metric also includes consistency penalties for shared data”. Figure 10 shows that our Classic method implementation does not present a high incidence of LLC, showing that the blocking method implemented was efficient to maintain the data at the L1 and L2 cache levels.

Figure 9.Initial section of the General Exploration Analysis Bottom-Up view from VTune Amplifier executing the Modified method.

Figure 10.Initial section of the General Exploration Analysis Bottom-Up view from VTune Amplifier executing the Classic method.

3.4 Experiments with Extracted Matrices

In order to understand how the two methods compare within the overall linear solver, we used two matrices extracted from simulations and compare the linear solve performance using the Classic and Modified versions with 1, 2, 4, 8 and 16 processes. In the first case (Case 3) vector size is 2,479,544 and in the second (Case 2) 4,385,381. The number of iterations obtained by both methods was the same in all cases.

Figure 11. Time ratio of the Classic method over the Modified for matrices extracted from Case 3. Greater than one means Classic was faster, the higher the better.

Figure 11 and Figure 12 show the performance improvements for the two cases for the different numbers of processes. In all configurations, the Classic version yields substantial gains in the Gram-Schmidt kernel, ranging from 1.5x to 2.5x when compared to the Modified one. The corresponding benefit in the overall linear solution is also very expressive, ranging from 1.2x to 1.5x for Case 2 and from 1.1x to 1.3x for Case 3.

Figure 12.Time ratio of the Classic method over the Modified for matrices extracted from Case 2. Greater than one means Classic was faster, the higher the better.

For the Case 3 matrix with 16 MPI processes, we use the Intel® VTune™ Amplifier XE 2015 tool in Hotspot Analysis mode to evaluate the communication reduction. In Figure 13 and Figure 14 we show the profile of the Modified and the Classic methods, respectively. The MPI_Allreduce calls within the inner product method for the Modified version take 36 seconds of CPU time (6% of orthogonalization time). The Classic method profile shows 6.28 seconds spent on MPI_Allreduce calls (2% of orthogonalization of time), showing a large reduction in communication time. However, in this scenario of a small number of processes the communication does not represent a major portion of the orthogonalization time and, therefore, it does not have a big impact on the overall performance. This is likely to change when increasing the number of processes and running in cluster environments where communication takes place via the communication network.

Figure 13. Modified Gram-Schmidt benchmark profile from the Intel® VTune™ Amplifier XE 2015 for the Ostra matrix with 16 MPI processes.

Figure 14.Classic Gram-Schmidt benchmark profile from the Intel VTune Amplifier XE 2015 for the Ostra matrix with 16 MPI processes.

3.5 Performance Comparison with Full Simulations

In order to assess the impact of replacing the Modified Gram-Schmidt method by the Classic one in the performance of actual simulation runs, seven test cases were executed. Table 2 contains the main features of the cases. Note that vector sizes for most of the cases are in the intermediate range where Table 2 shows the least performance for the Classic when compared with the Modified, the exceptions being Case 2, which is beyond the largest sizes in Table 2, and Case 1, which is in the range of the smallest sizes.

Table 3. Main features for the seven test cases.

The number of time steps, cuts and linear and nonlinear iterations taken for each case with the two Gram-Schmidt implementations is shown in Table 2. In five out of the seven cases, the performance of the Modified and Classic is very close. For Case 2 and Case 4, Classic performs clearly better, particularly in Case 2 where the number of linear iterations decreases 16%.

Table 4.Numerical data for the seven test cases. Inc is the relative increment from Modified to Classic (negative when Classic took less time steps, cuts, and iterations).

Figure 9 shows the performance gains provided by using the Gram-Schmidt Classic method for the three serial runs. The performance of both methods is very close for Case 5 and Case 4, while there is around 10% improvement in the Gram-Schmidt kernel for Case 1. Those results seem to be in line with the findings from the microbenchmark program, as Case 1 vector size is in the small range where Table 2 shows benefits for the Classic version. The improvement in Gram-Schmidt does not extend to the Linear Solver whose performance is almost the same with both methods.

Figure 15.Time ratio of Classic Gram-Schmidt over Modified for the serial runs.

Figure 10 is similar to Figure 9 for the parallel runs. Case 7 shows a slight improvement with the Classic, while Case 4 and Case 1 show degradation in performance, particularly the latter, where time for the Modified is almost 20% smaller. The impact of those differences in Linear Solver and Asset time is minor. For Case 2, there is a substantial improvement in performance of the orthogonalization, with Classic being 2.8x faster. For this case, the benefits in Gram-Schmidt translate into a noteworthy improvement in both Linear Solver and Asset time, making the full simulation almost 1.7x faster. This is due both to the fact that vectors are very large and, therefore, Classic is supposed to over perform Modified by a large amount (see Table 2), as well as to improvement in linear and nonlinear iterations resulting from changing the orthogonalization algorithm. It is also important to note that Gram-Schmidt contributes with around half of the total simulation time for Case 2 (see Figure 16), which makes any benefits in this kernel to result in much clear improvements in total simulation time.

Figure 16. Time ratio of Classic Gram-Schmidt over Modified for the parallel runs.

4. Conclusions

The Classic Gram-Schmidt method was implemented in the reservoir simulator linear solver, using a blocking strategy for achieving better memory access. From the tests we made, the following conclusions can be taken:

The new implementation provides a substantial performance improvement over the current one, based on the Modified Gram-Schmidt algorithm, for large problems where the orthogonalization step takes a considerable share of total simulation time. Typically, this will be the case when the number of linear iterations to solve each system is big, making the Krylov basis large. Outside of this class of problems, the performance is either close to or slightly worse than the current implementation.
Using a performance analysis tool, we could observe a substantial reduction in communication time when using the Classic version. For the hardware configuration we used, it does not translate into a large benefit for the overall procedure, as the tests were executed in a workstation and parallelization was limited to at most 16 processes. It is expected that, for parallel runs in cluster environments with a large number of processes, the reduction in communication costs will become important to ensure good performance and parallel scalability.
Despite known to be less stable than the Modified version, we have not noticed any degradation in convergence of the Krylov method when switching to the Classic version in our test cases. In fact, convergence for the Classic was even better than Modified in two out of the seven actual simulation models we ran.
The blocking strategy adopted in the implementation depends on a parameter, the chunk size, which is hardware dependent. The study does not allow to say to what extent tuning this parameter to a specific hardware is crucial to obtain adequate levels of performance, as one single machine configuration was used in all tests.
Experiments with a microbenchmark program focused on the Gram-Schmidt kernel showed a decrease in the performance of the Classic relative to the Modified for intermediate vector sizes. The results obtained for full simulations seem to corroborate those findings. At the moment, we have not found any consistent explanation for this phenomenon, although it seems to be related to the division of work per thread or process. It is also still unclear if it is possible to avoid the performance downgrade of Classic Gram-Schmidt (relative to the Modified) by tuning implementation.

References

Frank, J. & Vuik, C., Parallel Implementation of a Multiblock Method with Approximate Subdomain Solution, Applied Numerical Mathematics, 30, pages 403-423, 1999.
Frayssé, V., Giraud, L., Gratton, S. & Langou, J., Algorithm 842: A Set of GMRES Routines for Real and Complex Arithmetics on High Performance Computers, ACM Transactions on Mathematical Software, 31, pages 228-238, 2005.
Giraud, L., Langou, J. & Rozloznik, M., The Loss of Orthogonality in the Gram-Schmidt Orthogonalization Process, Computers and Mathematics with Applications, 50, pages 1069-1075, 2005.
Greenbaum, A., Rozloznik, M. & Strakos, Z., Numerical Behaviour of the Modified Gram-Schmidt GMRES Implementation, BIT, 37, pages 706-719, 1997.
Golub, G.H. & Van Loan, C.F., Matrix Computations, Third Edition, The Johns Hopkins University Press, Baltimore and London, 1996.
Cache Blocking Techniques, https://software.intel.com/en-us/articles/cache-blocking-techniques accessed on 16/12/2016.
Improve Performance with Vectorization, https://software.intel.com/en-us/articles/improve-performance-with-vectorization.

↧

What’s new with Intel® Cluster Checker 2017 update 2

March 3, 2017, 2:56 am

Latest and popular articles on Intel Technologies

≫ Next: Tencent Ultra-Cold Storage System Optimization with Intel® ISA-L – A Case Study

≪ Previous: Implementation of Classic Gram-Schmidt in a Reservoir Simulator

Intel released update 2 to Intel® Cluster Checker 2017. This comprehensive diagnostic tool is distributed in Intel® Parallel Studio XE 2017 Cluster Edition Update 2 and supports Intel® Xeon™ processors, Intel® Xeon Phi™ processors, Intel® Omni-Path, and Intel® Enterprise Edition for Lustre* software. Here is a quick summary of what’s new:

Additional support for Intel® Xeon Phi™ Product Family x200 processors on configuration parameters that may have noticeable impact on system stability or performance:

Identifying cluster and memory mode of Intel® Xeon Phi processors through backup-method with ‘numactl’ command, if hwloc-dump-data service is missing or not enabled.
Taskset support for DGEMM/STREAM can be configured via xml configuration file.
Check for tickless kernel boot options ‘nohz_full’, ‘rcu_nocbs’, ‘isolcpus
Uniformity check
Cross reference check to ensure ‘nohz_full’ is a subset of ‘rcu_nocbs’ and ‘isolcpus’ (if set)
Desired kernel boot options check can be specified via xml configuration file.

Additional support for Intel® Omni-Path Architecture:

Two Intel® Omni-Path Host Fabric Interface (Intel® OP HFI) Adapters supported on one node.

Core improvements and bug fixes

↧

Tencent Ultra-Cold Storage System Optimization with Intel® ISA-L – A Case Study

March 17, 2017, 9:54 am

Latest and popular articles on Intel Technologies

≫ Next: Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

≪ Previous: What’s new with Intel® Cluster Checker 2017 update 2

Download PDF [823KB]

In this era of data explosion, the cumulative amount of obsolete data is becoming extremely large. For storage cost considerations, many independent Internet service providers are developing their own cold storage system. This paper discusses one such collaboration between Tencent and Intel to optimize the ultra-cold storage project in Tencent File System* (TFS). The XOR functions in Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) successfully help TFS meet the performance requirement.

Introduction to Tencent and TFS

Tencent is one of the largest Internet companies in the world, whose services include social networks, web portals, e-commerce, and multiplayer online games. Its offerings in China include the well-known instant messenger Tencent QQ*, one of the largest web portals, QQ.com, and the mobile chat service WeChat. These offerings have helped bolster Tencent's continuous expansion.

Behind these offerings, TFS serves at the core of file services necessary for many businesses. With hundreds of millions of users, TFS is facing performance and capacity challenges. Since the Tencent Data Center is mainly based on Intel® architecture, Tencent has been working with Intel to optimize the TFS’s performance.

Challenge of ultra-cold storage project in TFS

Unlike for Online Systems, procurement of processors for TFS’s ultra-cold storage project is not a budget priority, so existing processors have been recycled from outdated systems. This approach does not result in powerful compute performance, with calculation performance easily the biggest bottleneck for the system.

Previously, in order to save disks capacity and maintain high reliability, the project adopted the erasure code 9+3 solution (see Figure 1).

Figure 1: Original Erasure Code 9+3 solution.

Tencent has reconsidered erasure coding for several reasons:

Much of the data stored in this ultra-cold storage system are outdated pictures. Occasional data corruptions are acceptable.
Redundancy rate of erasure code 9+3 may be too much of a luxury for this kind of data.
Even optimized with Intel ISA-L erasure code, it is still a heavy workload for these outdated, low-performance servers assigned to ultra-cold storage system.

In order to reduce the redundancy rate and improve performance bottlenecks, a solution that uses XOR operation on 10 stripes to generate 2 parities was adopted (see Figure 2). The first parity is horizontal processing, and the second parity is vertical processing.

Figure 2: New XOR 10+2 solution

This new solution still had one obvious hotspot: the XOR operation limits system performance. Despite simplifying the data protection algorithm, this cost-optimized solution couldn’t meet the performance requirements that Tencent Online Systems needed.

Tencent was seeking an effective and convenient way to reduce the calculation effort of the XOR operation. It needed an efficient and optimized version of XOR to alleviate the performance bottleneck and meet the design requirements for the ultra-cold storage solution.

About Intel® Intelligent Storage Acceleration Library

Intel ISA-L is a collection of optimized, low-level functions used primarily in storage applications. The general library for Intel ISA-L contains an expanded set of functions used for erasure code, data protection and integrity, compression, hashing, and encryption. It is written primarily in hand-coded ASM but with bindings for the C/C++ programming languages. Intel ISA-L contains highly optimized algorithms behind an API, automatically choosing an appropriate binary implementation for the detected processor architecture, allowing ISA-L to run on past, current, and next-generation CPUs without interface changes.

The library includes an XOR generation function, gen_xor_avx, as part of the Intel ISA-L data-protection functions. Intel ISA-L is highly performance optimized by Intel’s Single Instruction Multiple Data instructions.

Collaboration between Tencent and Intel

Tencent and Intel have worked together using Intel ISA-L to optimize ultra-cold storage project in TFS.

The XOR function used in the ultra-cold storage project was originally coded in C Language and in Galois code format, named galois_xor. The first optimization proposal was to replace galois_xor with Intel ISA-L gen_xor_avx directly. The test results from this single change showed a ~50-percent performance gain.

After analyzing the parity generation method of the ultra-cold storage system, we suggested using gen_xor_avx in pointer array format. This second optimization proposal improved coding efficiency further, by avoiding unnecessary memory operation.

Results

The performance optimization scheme, based on the Intel ISA-L XOR function, helped solve the practical problems encountered in building an ultra-cold storage system. The test results from Tencent showed a 250-percent performance increase compared with previous method.

Method	Galois xor	Intel ISA-L gen_xor_avx on non-array form	Intel ISA-L gen_xor_avx on array form
Performance	800 MB/s	1.2 GB/s	2 GB/s

This distinct performance gain successfully met the requirements from Online Systems. Even better, since Intel ISA-L is open-source (BSD-licensed) code, there was no cost to the Tencent team for the huge improvement in system performance.

Acknowledgement

As a result of this successful collaboration with Intel, Sands Zhou, principal of the Tencent ultra-cold storage system, said: “TFS ultra-cold storage project, based on entire cabinet program, CPU became a performance bottleneck. In the meantime, the project got strong supports from Intel based on ISA-L XOR program. Thanks again, wish more collaborations with Intel in the following work.”

↧

What is Principal Component Analysis?

Applications of PCA

Pros and Cons of PCA

Intel® Data Analytics Acceleration Library

Using the PCA Algorithm in Intel Data Analytics Acceleration Library

Conclusion

References

Background

AisaInfo ADB on Intel® Xeon® Processor-Based Systems

Performance Test Procedure

Test Configuration

Test Results

Conclusion

References

What is libfabric?

Building and installing libfabric from the source

Configuration options1

Examples1

Validate installation1

Who are the libfabric1 providers?

gni*1

mxm*1

psm1

psm21

sockets1

udp1

usnic*1

Dependencies1

Configure options1

verbs*1

Dependencies1

Selecting a fabric provider within OFI when using the Intel® MPI Library

Syntax

Using a DAPL or a DAPL UD equivalent when migrating to OFI

Using gni* under OFI

OVERVIEW1

REQUIREMENTS1

Using mxm* under OFI

Using TCP (Transmission Control Protocol) under OFI

OVERVIEW1

SUPPORTED FEATURES1

Endpoint types1

Endpoint capabilities1

Modes1

Progress1

LIMITATIONS1

Using UDP under OFI

Using usnic* under OFI

Using TMI under OFI

OVERVIEW1

LIMITATIONS1

Endpoint types1

Endpoint capabilities1

Modes1

Progress1

Unsupported features1

Using PSM2 under OFI

OVERVIEW1

LIMITATIONS1

Endpoint types1

Endpoint capabilities1

Modes1

Progress1

Unsupported features1

Using verbs under OFI

OVERVIEW1

SUPPORTED FEATURES1

Endpoint types1

Endpoint capabilities1

Modes1

Progress1

Operation flags1

Msg Ordering1

Is the multi-rail feature supported under OFI?

References

Introduction

Cython in Intel® Distribution for Python* 2017

Thread Parallelism in Cython

Conclusion

Introduction

Configuration options¹

Examples¹

Validate installation¹

Who are the libfabric¹ providers?

**gni*¹**

mxm*¹

psm¹

psm2¹

sockets¹

udp¹

usnic*¹

Dependencies¹

Configure options¹

verbs*¹

Dependencies¹

OVERVIEW¹

REQUIREMENTS¹

OVERVIEW¹

SUPPORTED FEATURES¹

Endpoint types¹

Endpoint capabilities¹

Modes¹

Progress¹

LIMITATIONS¹

OVERVIEW¹

LIMITATIONS¹

Endpoint types¹

Endpoint capabilities¹

Modes¹

Progress¹

Unsupported features¹

OVERVIEW¹

LIMITATIONS¹

Endpoint types¹

Endpoint capabilities¹

Modes¹

Progress¹

Unsupported features¹

OVERVIEW¹

SUPPORTED FEATURES¹

Endpoint types¹

Endpoint capabilities¹

Modes¹

Progress¹

Operation flags¹

Msg Ordering¹