Quantcast
Channel: Intel Developer Zone Articles
Viewing all 216 articles
Browse latest View live

Demo: Software Defined Visualization Using Intel® Xeon Phi™ Processor

$
0
0

In this demo we are showcasing the use of Intel® Xeon Phi™ processor, to do a 3D visualization of tumor in a human brain. This can help advance research in medical field by getting precise detection and removal of something like tumor in human brain.

More information

The tool used for visualization is Paraview, with OSPRay as the rendering library.

Pre-requisites

Intel® Xeon Phi™ processor system with CentOS 7.2 Linux* (internet enabled)

Open a terminal, in your work area directory and follow the steps below:

  1. Create directory for the demo

    mkdir Intel_brain_demo

  2. Change directory

    cd Intel_brain_demo

  3. Create two directories under this

    mkdir paraview
    mkdir ospray

  4. Access the files from Dropbox:

    https://www.dropbox.com/s/wj0qp1clxv5xssv/SC_2016_BrainDemo.tar.gz?dl=0

  5. Copy the Paraview and Ospray tar files into the respective directories you created in steps above

    mv SC_2016_BrainDemo/paraview_sc_demo.tgz paraview/
    mv SC_2016_BrainDemo/ospray.tgz ospray/

  6. Untar each of the *tgz directories in the respective area

    tar –xzvf *.tgz

  7. Point the library path

    Export
    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<…../Intel_brain_demo/ospray/install/lib64>

  8. Optional step: set graphics system to a random variable, only if Paraview doesn’t load normally

    export QT_GRAPHICSSYSTEM=gtk

  9. Change directory to paraview/install where the binaries are

    cd paraview/install

  10. Run Paraview

    ./bin/paraview

  11. Once Paraview loads

    Select File/Load State

  12. Then load the brain_demo.psvm file from the SC_2016BrainDemo that you sync’d in step above

  13. Then it will ask you to load VTK files, click the “...” button to select the appropriate *tumor1.vtk file, then *tumor2.vtk file and then *Tumor1.vtk file in order on your local machine. Then click OK.

  14. An Output Messages pop window will show with warnings. Ignore the warnings and click close, and you should see something like following:

  15. Now you can go to File/Save State and save this state. Now every time you load you can load to this state file to skip the previous step of having to locate the data files.
  16. Then on properties tab on left side, enable Ospray for every view (all the rendewViews1/2/2 by selecting that view and clicking enable Ospray)

  17. Once you do that you should see the images for all three views look as below:

  18. You can also rotate the views and see how they look.

A few issues and how to resolve

Missing OpenGL, install mesa for OpenGL

Sudo yum –y install mesa-libGL
Sudo yum –y install mesa-libGL-devel

libQtGui.so.4 error, install qt-x11 package

yum –y install qt-x11

Acknowledgements

Special thanks to Carson Brownlee and James Jeffers from Intel Corporation for all their contributions and support. Without their efforts, it wouldn’t have been possible to get this demo running.

References

  1. http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html
  2. https://software.intel.com/en-us/blogs/Intel-Parallel-Studio-XE-2016
  3. https://gitlab.kitware.com/carson/paraview
  4. https://gitlab.kitware.com/carson/vtk
  5. http://www.ospray.org
  6. http://www.ospray.org/getting_ospray.html
  7. http://dap.xeonphi.com
  8. https://ispc.github.io/downloads.html
  9. https://www.threadingbuildingblocks.org
  10. https://en.wikipedia.org/wiki/Software_rendering

How to Install DevStack* in Linux*

$
0
0

Introduction

OpenStack* is a set of software tools for building and managing cloud computing platforms for public and private clouds.

According to the OpenStack web site, “DevStack is a series of extensible scripts used to quickly bring up a complete OpenStack environment based on the latest versions of everything from git master. It is used interactively as a development environment and as the basis for much of the OpenStack project’s functional testing.”

It is tricky to install DevStack* using the instructions found on the OpenStack web site since it assumes that you have certain knowledge about the network, about setting user accessing rights, and so on.

This article shows step-by-step how to install DevStack in CentOS* 7.

Installation

The following steps show you how to install DevStack:

  1. Install CentOS 7.
    1. Go to the CentOS web site and download the latest 64-bit version of CentOS.
    2. The best way to install CentOS is to use the entire hard drive. During the installation, make sure to:
      1. Set up the network in advance; otherwise, you will have to set it up manually later on in the process, in order to download DevStack.
      2. Ensure that the user has administrative rights by checking the Make this user administrator option, as seen in Figure 1.

         

        Figure 1

        Figure 1: Set user account.

      3. Do not select any additional packages, just use the default installation option, which is the minimum installation option as recommended by the instructions at devstack.org.
    Now we are ready to download DevStack.
  2. Download DevStack.
    1. Use the following command to download:
      git clone https://git.openstack.org/openstack-dev/devstack

      You will see the error command not found as shown in Figure 2:

      Figure 2

      Figure 2:Error when executing command git.

      The command git is not found because it is not included in the OS when installing the OS with the minimum installation option.

    2. Use the following command to install the git package:
      sudo yum install git
    3. After git is installed correctly, rerun this command to download DevStack:

      git clone https://git.openstack.org/openstack-dev/devstack

      If the DevStack download is successful, you will see a message like the one in Figure 3:

      Figure 3

      Figure 3:Downloading DevStack.

      After the download, the directory devstack will be created in the current directory.

  3. Change the directory to devstack. Use the following command to change the directory:
    cd devstack
  4. Create a local configuration file named local.conf. Create the file local.conf in the directory devstack with the following content:
    	[[local|localrc]]
    	ADMIN_PASSWORD=secret
    	DATABASE_PASSWORD=$ADMIN_PASSWORD
    	RABBIT_PASSWORD=$ADMIN_PASSWORD
    	SERVICE_PASSWORD=$ADMIN_PASSWORD
  5. Install DevStack. Use the following command to begin the installation:
    ./stack.sh

    During the installation the script prompts for users to enter the password. Alternatively, to continue without a password just hit the enter key, as shown in Figure 4:

    Figure 4

    Figure 4:The script prompts to give users the option to enter the password.

  6. Check to make sure DevStack is installed correctly. If DevStack is installed correctly, you should see a message as shown in Figure 5

     

    Figure 5

    Figure 5:Messages shown after DevStack is finished installing.

    After DevStack is installed, check to make services like Keystone*, Glance*, Nova*, and Cinder* exist by typing the name of those services at the prompt. For example, type:

    nova
    If the system returns with the error command not found, then we know that DevStack was not installed successfully.

Conclusion

This article describes a straightforward and easy way to install DevStack in CentOS.

MariaDB* Performance with Intel® Xeon® Processor E5 v4 Family

$
0
0

MariaDB* increases database throughout by 51% and cuts response times by 15%

As businesses become more and more data intensive, the cost per transaction becomes an important metric. There are two ways to lower cost per transaction. The first is to lower the cost of data infrastructure; and the second is to increase hardware efficiency. With the MariaDB Enterprise and Intel® Xeon® processor-based solution, organizations can do both by using an enterprise subscription to reduce database costs and multi-core processors to increase performance with existing servers.

“By adopting the Intel® Xeon® Processor E5-2600 v4, our users and customers will not only get faster response times, they’ll also reduce total cost of ownership, getting even more value out of MariaDB.”
— Bruno Šimić Solutions Engineer, MariaDB
 

Using MPI-3 Shared Memory in Intel® Xeon Phi™ Processors

$
0
0

This whitepaper introduces the MPI-3 shared memory feature, the corresponding APIs, and a sample program to illustrate the use of MPI-3 shared memory in the Intel® Xeon Phi™ processor.

Introduction to MPI-3 Shared Memory

MPI-3 shared memory is a feature introduced in version 3.0 of the message passing interface (MPI) standard. It is implemented in Intel® MPI Library version 5.0.2 and beyond. MPI-3 shared memory allows multiple MPI processes to allocate and have access to the shared memory in a compute node. For applications that require multiple MPI processes to exchange huge local data, this feature reduces the memory footprint and can improve performance significantly.

In the MPI standard, each MPI process has its own address space. With MPI-3 shared memory, each MPI process exposes its own memory to other processes. The following figure illustrates the concept of shared memory: Each MPI process allocates and maintains its own local memory, and exposes a portion of its memory to the shared memory region. All processes then can have access to the shared memory region. Using the shared memory feature, users can reduce the data exchange among the processes.

Figure 1

By default, the memory created by an MPI process is private. It is best to use MPI-3 shared memory when only memory needs to be shared and all other resources remain private. As each process has access to the shared memory region, users need to pay attention to process synchronization when using shared memory.

Sample Code

In this section, sample code is provided to illustrate the use of MPI-3 shared memory.

A total of eight MPI processes are created on the node. Each process maintains a long array of 32 million elements. For each element j in the array, the process updates this element value based on its current value and the values of the element j in the corresponding arrays of two nearest processes, and the same procedure is applied for the whole array. The following pseudo-code shows when running the program for eight MPI processes with 64 iterations:

Repeat the following procedure 64 times:
for each MPI process n from 0 to 7:
    for each element j in the array A[k]:An[j] ← 0.5*An[j]  + 0.25*Aprevious[j] + 0.25*Anext[j]

where An is the long array belonging to the process n, and An [j] is the value of the element j in the array belonging to the process n. In this program, since each process exposes it to local memory, all processes can have access to all arrays, although each process just needs the two neighbor arrays (for example, process 0 needs data from processes 1 and 7, process 1 needs data from processes 0 and 2,…).

Figure 2

Besides the basic APIs used for MPI programming, the following MPI-3 shared memory APIs are introduced in this example:

  • MPI_Comm_split_type: Used to create a new communicator where all processes share a common property. In this case, we pass MPI_COMM_TYPE_SHARED as an argument in order to create a shared memory from a parent communicator such as MPI_COMM_WORLD, and decompose the communicator into a shared memory communicator shmcomm.
  • MPI_Win_allocate_shared: Used to create a shared memory that is accessible by all processes in the shared memory communicator. Each process exposes its local memory to all other processes, and the size of the local memory allocated by each process can be different. By default, the total shared memory is allocated contiguously. The user can pass an info hint “alloc_shared_noncontig” to specify that the shared memory does not have to be contiguous, which can cause performance improvement, depending on the underlying hardware architecture. 
  • MPI_Win_free: Used to release the memory.
  • MPI_Win_shared_query: Used to query the address of the shared memory of an MPI process.
  • MPI_Win_lock_all and MPI_Win_unlock_all: Used to start an access epoch to all processes in the window. Only shared epochs are needed. The calling process can access the shared memory on all processes.
  • MPI_Win_sync: Used to ensure the completion of copying the local memory to the shared memory.
  • MPI_Barrier: Used to block the caller process on the node until all processes reach a barrier. The barrier synchronization API works across all processes.

Basic Performance Tuning for Intel® Xeon Phi™ Processor

This test is run on an Intel Xeon Phi processor 7250 at 1.40 GHz with 68 cores, installed with Red Hat Enterprise Linux* 7.2 and Intel® Xeon Phi™ Processor Software 1.5.1, and Intel® Parallel Studio 2017 update 2. By default, the Intel compiler will try to vectorize the code, and each MPI process has a single thread of execution. OpenMP* pragma is added at loop level for later use. To compile the code, run the following command line to generate the binary mpishared.out:

$ mpiicc mpishared.c -qopenmp -o mpishared.out
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 5699 (after 64 iterations)

To explore the thread parallelism, run four threads per core, and re-compile with –xMIC-AVX512 to take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions:

$ mpiicc mpishared.c -qopenmp -xMIC-AVX512 -o mpishared.out
$ export OMP_NUM_THREADS=4
$ mpirun -n 8 ./mpishared.out
Elapsed time in msec: 4535 (after 64 iterations)

As MCDRAM in this system is currently configured as flat, the Intel Xeon Phi processor appears as two NUMA nodes. The node 0 contains all CPUs and the on-platform memory DDR4, while node 1 has the on-packet memory MCDRAM:

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 98200 MB
node 0 free: 92775 MB
node 1 cpus:
node 1 size: 16384 MB
node 1 free: 15925 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

To allocate the memory in the MCDRAM (node 1), pass the argument –m 1 to the command numactl as follows:

$ numactl -m 1 mpirun -n 8 ./mpishared.out
Elapsed time in msec: 3070 (after 64 iterations)

This simple optimization technique greatly improves performance speeds.

Summary

This whitepaper introduced the MPI-3 shared memory feature, followed by sample code, which used IMP-3 shared memory APIs. The pseudo-code explained what the program is doing along with an explanation of shared memory APIs. The program ran on an Intel Xeon Phi processor, and it was further optimized with simple techniques.

Reference

  1. MPI Forum, MPI 3.0
  2. Message Passing Interface Forum, MPI: A Message-Passing Interface Standard Version 3.0
  3. The MIT Press, Using Advanced MPI
  4. James Reinders, Jim Jeffers, Publisher: Morgan Kaufmann, Chapter 16 - MPI-3 Shared Memory Programming Introduction, High Performance Parallelism Pearls Volume Two

Appendix

The code of the sample MPI program is available for download.

Intel® ISA-L: Erasure Code and Recovery

$
0
0

Introduction

Download Code Sample

The Intel® Intelligent Storage Library (Intel ISA-L) includes support for erasure coding. This article describes how the use of erasure coding can benefit a storage application and explains the Intel ISA-L implementation, which uses Reed Solomon error correction (RS EC). The attached sample application demonstrates the efficiency of RS EC on Intel Architecture CPUs, showing how with Intel ISA-L erasure coding, error correction can be performed on the CPU sufficiently fast enough to not induce a bottleneck or add substantial latency to storage devices. To download the sample application, click on the button at the top of this article.

Historically, RAID storage arrays protected against data loss through mirroring (RAID 1) or calculating parity (RAID 5), which required additional storage and resulted in slower throughput. To address the problems with RAID 5, I/O Host Bus Adapters (HBAs) were given dedicated circuitry to offload parity computation instead of performing the computations on the CPU. However, with the advent of low-latency NVMe storage devices connected directly to the PCIe bus, this offload of parity calculations can no longer be done in the data path, which eliminates any advantage of offloading parity calculations. In addition, the continuing march of Moore’s Law has reduced the computational cost of parity calculations to fractions of a CPU cycle per byte, meaning a single Xeon core can compute tens of gigabytes per second.

The largest change to storage architecture has been driven by the emergence datacenter-scale storage solutions. As applications have scaled from server to rack to datacenter scale, so have the demands on the storage infrastructure that support them. Without access to data, a datacenter is nothing more than an elaborate space heater, so storage availability and fault tolerance have become the most crucial challenges in the of the data center. For this reason, datacenter scale systems are designed for continuous operation and at least 99.999% availability (about five minutes of downtime per year). To achieve this goal, two techniques are pervasive: multi-replication (typically triple-replication) and Reed-Solomon erasure codes (RS EC), both of which ensure there are always copies of the data available despite single or dual failures.

Triple replication has the advantage of simplicity. It is computationally easy to maintain three copies of data, and this technique is used in scalable file systems like Hadoop HDFS, but it does have a drawback: it requires that a system’s raw storage capacity is (at least) 3X the design capacity. By contrast, RS EC has historically been computationally intensive, but is far more flexible and space efficient, enabling raw capacity to be only 1.5X the design capacity. For storage applications requiring large data sets, this difference in the underlying availability algorithm can translate into huge differences in capital and operating expenditures.

In this article, we demonstrate that RS EC efficiency on Intel Architecture CPUs has improved to become a viable choice even for directly PCIe-connected, high-throughput, low latency media like NVMe devices. The Intel ISA-L implementation of RS will be used to demonstrate that it is now possible to keep up with current storage hardware throughput requirements by doing RS error correction in software instead of offloading to dedicated hardware or using drive-based parity computations

Hardware and Software Configuration

CPU and Chipset

Intel® Xeon® processor E5-2699 v4, 2.2 GHz

  • Number of cores per chip: 22 (only used single core)
  • Number of sockets: 2
  • Chipset: Intel® C610 Series Chipset, QS (B-1 step)
  • System bus: 9.6 GT/s Intel® QuickPath Interconnect
  • Intel® Hyper-Threading Technology off
  • Intel SpeedStep® technology enabled
  • Intel® Turbo Boost Technology disabled

Platform

Platform: Intel® Server System R2000WT product family (code-named Wildcat Pass)

  • BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932
  • DIMM slots: 24
  • Power supply: 1x1100W

Memory

Memory size: 256 GB (16X16 GB) DDR4 2133P

Brand/model: Micron – MTA36ASF2G72PZ2GATESIG

Storage

Brand and model: 1 TB Western Digital (WD1002FAEX)

Intel® SSD Data Center P3700 Series (SSDPEDMD400G4)

Operating System

Ubuntu* 16.04 LTS (Xenial Xerus)

Linux* kernel 4.4.0-21-generic

Note: Depending on the platform capability, Intel ISA-L can run on various Intel® processor families. Improvements are obtained by speeding up the computations through the use of the following instruction sets:

Why Use Intel® Intelligent Storage Acceleration Library?

Intel ISA-L has the capabilities to do RS EC in software instead of offloading to dedicated hardware or using drive-based parity computations. This capability is well suited for high throughput storage applications. This article has the sample application that simulates a RS EC scenario, using a set of memory buffers instead of discrete I/O devices. Memory buffers are used for two reasons: first, to put the focus on the software overhead instead of the media latency involved in I/Os; and second, to allow this example to be run on any system configuration.

The sample application creates up to 24 different buffers in memory to store data; a larger data set of up to 256 MB of data is distributed across these buffers. ISA-L’s RS algorithm implementation is used to calculate error correction codes that are stored and distributed with the data across the buffers. Storage times, i.e., the time to calculate the RS EC are measured and logged. To simulate failure, up to two memory buffers are made inaccessible and the program demonstrates that the data can be completely and correctly recovered from the remaining buffers. Recovery time is also measured and logged. The approach is compared to published results from parity-based (RAID 5) systems.

Prerequisites

Intel ISA-L supports Linux and Microsoft Windows*. A full list of prerequisite packages can be found here.

Building the sample application (for Linux):

  1. Install the dependencies:
    • a c++14 compliant c++ compiler
    • cmake >= 3.1
    • git
    • autogen
    • autoconf
    • automake
    • yasm and/or nasm
    • libtool
    • boost's "Program Options" library and headers
    >sudo apt-get update>sudo apt-get install gcc g++ make cmake git autogen autoconf automake yasm nasm libtool libboost-all-dev
  2. Also needed is the latest versions of ISA-L. The get_libs.bash script can be used to get this. The script will download two library from their official GitHub* repository, build them, and then install them in the `./libs/usr` directory.
    >`bash ./libs/get_libs.bash`   
  3. Build from the `ex2` directory:
    • `mkdir <build-dir>`
    • `cd <build-dir>`
    • `cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD`
    • `make`

Getting Started with the Sample Application

The sample application contains the following:

Figure 1

This example goes through the following steps at a high-level work flow and only focuses on those routines demonstrating the whole data recovery using the Intel® ISA-L RS implementation.

Setup

  1. In the “main.cpp” file, the program parses the command line and displays the options that are going to be performed.
     int main(int argc, char* argv[])
    {
         options options = options::parse(argc, argv);

    Parsing the option of the command line

  2. In the options.cpp file, the program parses the command line arguments using `options::parse()`.
     

    Performing storage and recovery

  3. In the “main.cpp” file, the program performs as many storage/recovery cycles (up to 10,000) as it can until the aggregate storage or recovery time exceeds 1 seconds in a while loop.
  4. Then it calls the encode_data() to create the error correction codes for the data.
    // Create the error correction codes, and store them alongside the data.
    std::vector encode_data(int m, int k, uint8_t** sources, int len, prealloc_encode prealloc)
    {
        // Generate encode matrix
        gf_gen_cauchy1_matrix(prealloc.encode_matrix.data(), m, k);
    
        // Generates the expanded tables needed for fast encoding
        ec_init_tables(k, m - k, &prealloc.encode_matrix[k * k], prealloc.table.data());
    
        // Actually generated the error correction codes
        ec_encode_data(
            len, k, m - k, prealloc.table.data(), (uint8_t**)sources, (uint8_t**)&sources[k]);
    
        return prealloc.encode_matrix;
    }
    
  5. Out of the 24 buffers for source data, the iteration() function randomly picks 2 buffers to contain the erroneous_data.
    // We randomly choose up to 2 buffers to "lose", and return the indexes of those buffers.
    // Note that we can lose both part of the data or part of the error correction codes indifferently.
    std::vector generate_errors(int m, int errors_count)
    {
        random_number_generator idx_generator(0, m - 1);
        std::vector             errors{idx_generator.get(), 0};
        if (errors_count == 2)
        {
            do
            {
                errors[1] = idx_generator.get();
            } while (errors[1] == errors[0]);
            std::sort(errors.begin(), errors.end());
        }
        return errors;
    }
    // We arrange a new array of buffers that skip the ones we "lost"
    uint8_t** create_erroneous_data(int k, uint8_t** source_data, std::vector errors)
    {
        uint8_t** erroneous_data;
        erroneous_data = (uint8_t**)malloc(k * sizeof(uint8_t*));
    
        for (int i = 0, r = 0; i < k; ++i, ++r)
        {
            while (std::find(errors.cbegin(), errors.cend(), r) != errors.cend())
                ++r;
            for (int j = 0; j < k; j++)
            {
                erroneous_data[i] = source_data[r];
            }
        }
        return erroneous_data;
    }
  6. Finally, the iteration() function calls the recover_data() function to recover the data and then compare with the original data.
    // Recover the contents of the "lost" buffers
    // - m              : the total number of buffer, containint both the source data and the error
    //                    correction codes
    // - k              : the number of buffer that contain the source data
    // - erroneous_data : the original buffers without the ones we "lost"
    // - errors         : the indexes of the buffers we "lost"
    // - encode_matrix  : the matrix used to generate the error correction codes
    // - len            : the length (in bytes) of each buffer
    // Return the recovered "lost" buffers
    uint8_t** recover_data(
        int                         m,
        int                         k,
        uint8_t**                   erroneous_data,
        const std::vector&     errors,
        const std::vector& encode_matrix,
        int                         len,
        prealloc_recover            prealloc)
    {
        for (int i = 0, r = 0; i < k; ++i, ++r)
        {
            while (std::find(errors.cbegin(), errors.cend(), r) != errors.cend())
                ++r;
            for (int j = 0; j < k; j++)
            {
                prealloc.errors_matrix[k * i + j] = encode_matrix[k * r + j];
            }
        }
    
        gf_invert_matrix(prealloc.errors_matrix.data(), prealloc.invert_matrix.data(), k);
    
        for (int e = 0; e < errors.size(); ++e)
        {
            int idx = errors[e];
            if (idx < k) // We lost one of the buffers containing the data
            {
                for (int j = 0; j < k; j++)
                {
                    prealloc.decode_matrix[k * e + j] = prealloc.invert_matrix[k * idx + j];
                }
            }
            else // We lost one of the buffer containing the error correction codes
            {
                for (int i = 0; i < k; i++)
                {
                    uint8_t s = 0;
                    for (int j = 0; j < k; j++)
                        s ^= gf_mul(prealloc.invert_matrix[j * k + i], encode_matrix[k * idx + j]);
                    prealloc.decode_matrix[k * e + i] = s;
                }
            }
        }
    
        ec_init_tables(k, m - k, prealloc.decode_matrix.data(), prealloc.table.data());
        ec_encode_data(len, k, (m - k), prealloc.table.data(), erroneous_data, prealloc.decoding);
    
        return prealloc.decoding;
    }
    
  7. Once the recovery is completed, the program displays the average storage and recovery time.

Execute the sample application

In this example, the program runs using 24 buffers, and 2 of those buffers are lost. Therefore, 2 buffers are used to hold the error correction codes, and the other 22 are holding the data.

Run

From the ex2 directory:

cd <build-bir>/ex2

./ex2 --help

Usage

Usage: ./ex2 [--help] [--data-size <size>] [--buffer-count <n>] [--lost-buffers <n>]:
  --help                     display this message
  --data-size size (=256MiB) the amount of data to be distributed among the buffers (size <=256 MiB)
  --buffer-count n (=24) the number of buffers to be distributed data across (n <= 24)
  --lost-buffers n (=2) the number of buffers that will get inaccessible (n <= 2)

Sizes must be formatted the following way:

  • a number
  • zero or more white space
  • zero or one prefix (k, M)
  • zero or one i character (note: 1kiB == 1024B, 1kB == 1000B)
  • the character B

For example,

  • 1KB
  • "128 MiB"
  • 42B

Running the example

>./ex2

Figure 2

From the output, the program displays the results of the storage and recovery time applying the Intel ISA-L RS method. With the Intel® Xeon® processor server, the RS computation can be done efficiently and quickly as demonstrated in the example runs.

Notes:  2x Intel Xeon processor E5-2699 v4 (HT off), Intel Speed Step technology enabled, Intel Turbo Boost Technology disabled, 16x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu 16.04 LTS, Linux kernel 4.4.0-21-generic, 1 TB Western Digital (WD1002FAEX), 1 Intel SSD P3700 Series (SSDPEDMD400G4),  22x per CPU socket.  Performance measured by the written sample application in this article.

Conclusion

As demonstrated in this quick tutorial, developers have a how-to guide to incorporate the Intel ISA-L RS EC feature into their storage applications along with the full build instructions. The example shows how to prepare the data and recover it, which can help developers adopt the technology. Intel ISA-L provides the library for storage developers for quick adoption to your specific application run on Intel® architecture.

Other Useful Links

Algorithms Built for Speed - A Brief Introduction to Intel® Intelligent Storage Acceleration Library (ISA-L) - BrightTALK webinar

Additional Intel ISA-L Code Samples

Authors

Thai Le is a software engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an application engineer focusing on cloud computing within the Software Services Group at Intel Corporation (UK).

Jonathan Stern is an applications engineer and solutions architect who works to support storage acceleration software at Intel.

 

Fast memcpy Using SPDK and the Intel® I/OAT DMA Engine

$
0
0

Download Code Sample

Introduction

Memcpy is an important and often-used function of the standard C library. Its purpose is to move data in memory from one virtual or physical address to another, consuming CPU cycles to perform the data movement. Intel® I/O Acceleration Technology (Intel® I/OAT) allows offloading of data movement to dedicated hardware within the platform, reclaiming CPU cycles that would otherwise be spent on tasks like memcpy. This article demonstrates describes a usage of Storage Performance Development Kit (SPDK) with the Intel® I/OAT DMA engine, which is implemented through the Intel® QuickData Technology Driver. The SPDK provides an interface to Intel I/OAT hardware directly from user space, which greatly reduces the software overhead involved in using the hardware. Intel I/OAT can take advantage of PCI-Express nontransparent-bridging, which allows movement of memory blocks between two different PCIe connected motherboards, thus effectively allowing the movement of data between two different computers at nearly the same speed as moving data in memory of a single computer. We include a sample application that contrasts performance of memcpy and Intel I/OAT equivalent functionality when moving a series of different sized chunks of data in memory. The benchmarks are logged and results compared. To download the sample application, click on the button at the top of this article.

Figure 1
Figure 1The SPDK is an end-to-end reference architecture for Storage.

Hardware and Software Configuration

See below for information about the hardware and software configuration of the system used to create and validate the technical content of this article and sample application.

CPU and Chipset

Intel® Xeon® processor E5-2699 v4, 2.2 GHz

  • Number of cores per chip: 22 (only used single core)
  • Number of sockets: 2
  • Chipset: Intel® C610 series, QS (B-1 step)
  • System bus: 9.6 GT/s Intel® QuickPath Interconnect
  • Intel® Hyper-Threading Technology off
  • Intel SpeedStep® technology enabled
  • Intel® Turbo Boost Technology disabled

Platform

Platform: Intel® Server System R2000WT product family

  • BIOS: GRRFSDP1.86B.0271.R00.1510301446 ME:V03.01.03.0018.0 BMC:1.33.8932
  • DIMM slots: 24
  • Power supply: 1x1100W

Memory

Memory size: 256 GB (16X16 GB) DDR4 2133P

Brand/model: Micron* – MTA36ASF2G72PZ2GATESIG

Storage

Brand and model: 1 TB Western Digital* (WD1002FAEX)

Plus Intel® SSD Data Center P3700 Series (SSDPEDMD400G4)

Operating System

Ubuntu* 16.04 LTS (Xenial Xerus*)

Linux* kernel 4.4.0-21-generic

Note: SPDK can run on various Intel® processor families with platform support for Intel I/OAT.

Why Use Intel® Storage Performance Development Kit?

Solid-state storage media is becoming a part of the storage infrastructure in the data center. Current-generation flash storage enjoys significant advantages in performance, power consumption, and rack density over rotational media. These advantages will continue to grow as next-generation media enter the marketplace.

The SPDK is all about efficiency and scalable performance. The development kit reduces both processing and development overhead, and ensures the software layer is optimized to take advantage of the performance potential of next-generation storage media, like Non-Volatile Memory Express* (NVMe) devices. The SPDK team has open-sourced the user mode NVMe driver and Intel I/OAT DMA engine to the community under a permissive BSD license. The code is available directly through the SPDK GitHub* page.

Prerequisites

SPDK runs on Linux with a number of prerequisite libraries installed, which are listed below.

Building the sample application (for Linux):

  1. Install the dependencies:
    • a c++14 compliant c++ compiler
    • cmake >= 3.1
    • git
    • make
    • CUnit library
    • AIO library
    • OpenSSL library
    >sudo apt-get update>sudo apt-get install gcc g++ make cmake git libcunit1-dev libaio-dev libssl-dev
  2. Get the latest version of the SPDK, using the get_spdk.bash script included with the sample application. The script will download the SPDK from the official GitHub* repository, build it, and then install it in “./spdk”directory.
    >bash ./libs/get_spdk.bash
  3. Build from the “ex4” directory:
    • mkdir <build-dir>
    • cd <build-dir>
    • cmake -DCMAKE_BUILD_TYPE=Release $OLDPWD
    • make
  4. Getting the system ready for SPDK:
    The following command needs to be run once before running any SPDK application. It should be run as a privileged user.
    • (cd ./spdk && sudo scripts/setup.sh)

Getting Started with the Sample Application

The sample application contains the following:

Figure 2

Figure 2: List of files that are parts of the sample application

This example goes through the following steps to show the usage of the Intel I/OAT driver:

Program Setup

  1. In the “main.cpp” file, the program calls probes the system for Intel I/OAT devices and calls a callback function for each device. If the probe callback returns true, then SPDK will go ahead and attach the Intel I/OAT device and on its success call the attach callback function.
    spdk_ioat_chan* init_spdk()
    {
        char* args[] = {(char*)("")};
        rte_eal_init(1, args);
    
        spdk_ioat_chan* chan = nullptr;
    
        // Probe available devices.
        // - 'probe_cb' is called for each device found.
        // - 'attach_cb' is then called if 'probe_cb' returns true
        auto ret = spdk_ioat_probe((void*)(&chan), probe_cb, attach_cb);
    
        if (ret != 0)
            return nullptr;
    
        return chan;
    }
  2. Then, the main program defines each test and sets up the buffers.
      // Each test is defined by 2 number {a, b}.
        // We're copying chunks of 2^a bytes inside a 2^b bytes buffer.
        const std::vector<std::pair<uint8_t, uint8_t>> tests = {
            {1, 5},   {3, 5},   {3, 9},   {5, 9},   {7, 9},   {7, 13},  {9, 13},
            {11, 13}, {11, 17}, {13, 17}, {15, 17}, {15, 21}, {17, 21}, {19, 21},
            {19, 25}, {21, 25}, {23, 25}, {23, 29}, {25, 29}, {27, 29}};
  3. After setting up the buffers, the main runs through three different memcpy routines in a for-loop. The first routine is using the regular memcpy from the standard C library.
    benchmark seq_memcpy(uint64_t chunk_size, uint64_t buffer_size)
    {
        using namespace std::chrono_literals;
    
        // Allocate the whole buffer, 8 bytes-aligned
        uint64_t* buffer64 = new uint64_t[buffer_size / sizeof(uint64_t)];
        uint8_t*  buffer8  = reinterpret_cast<uint8_t*>(buffer64);
    
        // Trick the optimizer into not optimizing any copies away
        utils::escape(buffer8);
    
        // Fill the buffer with random data
        random_number_generator<uint64_t> rnd;
        for (uint i = 0 / sizeof(uint64_t); i < buffer_size / sizeof(uint64_t); ++i)
        {
            buffer64[i] = rnd.get();
        }
    
        uint64_t                 iterations = 0;
        std::chrono::nanoseconds time       = 0s;
    
        uint64_t nb_chunks = buffer_size / chunk_size;
    
        random_number_generator<uint64_t> chunk_idx_gen(0, nb_chunks / 2 - 1);
    
        do
        {
            // pick a random even-indexed buffer as source
            uint64_t src_chunk_idx = chunk_idx_gen.get() * 2;
            // pick a random odd-index buffer as destination
            uint64_t dst_chunk_idx = chunk_idx_gen.get() * 2 + 1;
    
            auto start_cpy = std::chrono::steady_clock::now();
    
            // performs the copy
            memcpy(
                buffer8 + (dst_chunk_idx * chunk_size),
                buffer8 + (src_chunk_idx * chunk_size),
                chunk_size);
    
            time += (std::chrono::steady_clock::now() - start_cpy);
    
            // Trick the optimizer into not optimizing any copies away
            utils::clobber();
    
            ++iterations;
        } while (time < 1s);
    
        delete[] buffer64;
    
        return {chunk_size, buffer_size, time, iterations};
    }
  4. The second routine uses the Intel I/OAT driver to perform the sequential memory copy using the Intel I/OAT channels.
    benchmark seq_spdk(uint64_t chunk_size, uint64_t buffer_size, spdk_ioat_chan* chan)
    {
        using namespace std::chrono_literals;
    
        // Allocate the whole buffer, 8 bytes-aligned
        uint64_t* buffer64 = (uint64_t*)spdk_malloc(buffer_size, sizeof(uint64_t), nullptr);
        uint8_t*  buffer8  = reinterpret_cast<uint8_t*>(buffer64);
    
        // Trick the optimizer into not optimizing any copies away
        utils::escape(buffer8);
    
        // Fill the buffer with random data
        random_number_generator<uint64_t> rnd;
        for (uint i = 0 / sizeof(uint64_t); i < buffer_size / sizeof(uint64_t); ++i)
        {
            buffer64[i] = rnd.get();
        }
    
        uint64_t                 iterations = 0;
        std::chrono::nanoseconds time       = 0s;
    
        uint64_t nb_chunks = buffer_size / chunk_size;
    
        random_number_generator<uint64_t> chunk_idx_gen(0, nb_chunks / 2 - 1);
    
        bool copy_done = false;
    
        do
        {
            // pick a random even-indexed buffer as source
            uint64_t src_chunk_idx = chunk_idx_gen.get() * 2;
            // pick a random odd-index buffer as destination
            uint64_t dst_chunk_idx = chunk_idx_gen.get() * 2 + 1;
    
            auto start_cpy = std::chrono::steady_clock::now();
    
            copy_done = false;
            // Submit the copy. req_cb is called when the copy is done, and will set 'copy_done' to true
            spdk_ioat_submit_copy(
                chan,&copy_done,
                req_cb,
                buffer8 + (dst_chunk_idx * chunk_size),
                buffer8 + (src_chunk_idx * chunk_size),
                chunk_size);
    
            // We wait for 'copy_done' to have been set to true by 'req_cb'
            do
            {
                spdk_ioat_process_events(chan);
            } while (!copy_done);
    
            time += (std::chrono::steady_clock::now() - start_cpy);
    
            // Trick the optimizer into not optimizing any copies away
            utils::clobber();
    
            ++iterations;
        } while (time < 1s);
    
        spdk_free(buffer64);
    
        return {chunk_size, buffer_size, time, iterations};
    }
  5. The third routine uses the Intel I/OAT driver to perform the parallel memory copy using the Intel I/OAT channels.
    benchmark par_spdk(uint64_t chunk_size, uint64_t buffer_size, spdk_ioat_chan* chan)
    {
        using namespace std::chrono_literals;
    
        // Allocate the whole buffer, 8 bytes-aligned
        uint64_t* buffer64 = (uint64_t*)spdk_malloc(buffer_size, sizeof(uint64_t), nullptr);
        uint8_t*  buffer8  = reinterpret_cast<uint8_t*>(buffer64);
    
        // Trick the optimizer into not optimizing any copies away
        utils::escape(buffer8);
    
        // Fill the buffer with random data
        random_number_generator<uint64_t> rnd;
        for (uint i = 0 / sizeof(uint64_t); i < buffer_size / sizeof(uint64_t); ++i)
        {
            buffer64[i] = rnd.get();
        }
    
        uint64_t                 iterations = 0;
        std::chrono::nanoseconds time       = 0s;
    
        uint64_t nb_chunks = buffer_size / chunk_size;
    
        std::mt19937 random_engine(std::random_device{}());
    
        do
        {
            // We want to match each source chunk with a random destination chunks,
            // while making sure we're not copying several chunk to the same destination
            std::vector<int> src_pool(nb_chunks / 2);
            std::vector<int> dst_pool(nb_chunks / 2);
            for (int i = 0; i < nb_chunks / 2; ++i)
            {
                src_pool.push_back(i);
                dst_pool.push_back(i);
            }
            std::shuffle(src_pool.begin(), src_pool.end(), random_engine);
            std::shuffle(dst_pool.begin(), dst_pool.end(), random_engine);
    
            // For each parallel copy, we need a flag telling us if the copy is done
            std::vector<int> copy_done(nb_chunks / 2, 0);
    
            auto start_cpy = std::chrono::steady_clock::now();
    
            for (int i = 0; i < nb_chunks / 2; ++i)
            {
                // Even-indexed chunk used as source
                uint64_t src_chunk_idx = src_pool[i] * 2;
                // Odd-indexed chink used as destination
                uint64_t dst_chunk_idx = dst_pool[i] * 2 + 1;
    
                // Submit 1 copy
                spdk_ioat_submit_copy(
                    chan,
                    &copy_done[i],
                    req_cb,
                    buffer8 + (dst_chunk_idx * chunk_size),
                    buffer8 + (src_chunk_idx * chunk_size),
                    chunk_size);
            }
    
            // We wait for all copies to be done
            do
            {
                spdk_ioat_process_events(chan);
            } while (
                std::any_of(copy_done.cbegin(), copy_done.cend(), [](int done) { return done == 0; }));
    
            time += (std::chrono::steady_clock::now() - start_cpy);
    
            // Trick the optimizer into not optimizing any copies away
            utils::clobber();
    
            iterations += (nb_chunks / 2);
        } while (time < 1s);
    
        spdk_free(buffer64);
    
        return {chunk_size, buffer_size, time, iterations};
    }
  6. Once the three routines are complete, the main program displays the results for each for-loop iteration.
  7. Finally, after completing the for-loop, the main program releases the buffers.
    void uninit_spdk(spdk_ioat_chan* chan)
    {
        spdk_ioat_detach(chan);
    }

BIOS Setup

Before running the application, the platform needs to enable the Intel I/OAT feature in the BIOS for each CPU socket; otherwise, the sample program will not run. 

Figure 3

Figure 3: BIOS setting for Intel I/OAT function

SPDK Setup

After the BIOS setup is done, SPDK needs to be initialized for the application to recognize all of the Intel I/OAT channels.

cd /spdk/scripts
sudo ./setup.sh

Figure 4

Figure 4: Setting up the Intel I/OAT channels

Run the Example

sudo ./ex4 

Figure 5

Figure 5: Results of the memcpy and Intel I/OAT equivalent function

From the output, storage developers can use the results as a guide to determine the best combination of chunk size and buffer size that they can offload to the memcpy, using the CPU resources over to the Intel I/OAT channels for their storage application. By offloading the CPU resources for the memcpy over to the Intel I/OAT channels, the CPU can perform other tasks in parallel with the memcpy task.

Notes:  2x Intel® Xeon® processor E5-2699v4 (HT off), Intel Speed Step® enabled, Intel® Turbo Boost Technology disabled, 16x16GB DDR4 2133 MT/s, 1 DIMM per channel, Ubuntu* 16.04 LTS, Linux kernel 4.4.0-21-generic, 1 TB Western Digital* (WD1002FAEX), 1 Intel® SSD P3700 Series (SSDPEDMD400G4),  22x per CPU socket. Performance measured by the written sample application in this article.   

Conclusion

This tutorial and sample application shows one way to incorporate SPDK and the Intel I/OAT feature into your storage application. The example shows how to prepare the buffers and perform the memory copy, along with hardware configuration and full build instructions. SPDK provides the Intel QuickData Technology drivers, and helps you quickly adopt your application to run on Intel® architecture with Intel I/OAT.  

Other Useful Links

Authors

Thai Le is a software engineer who focuses on cloud computing and performance computing analysis at Intel.

Steven Briscoe is an application engineer focusing on cloud computing within the Software Services Group at Intel Corporation (UK). 

Jonathan Stern is an applications engineer and solutions architect who works to support storage acceleration software at Intel.

Configure SNAP* Telemetry Framework to Monitor Your Data Center

$
0
0

Figure 1

Figure 1.Snap* logo.

Introduction

Would you believe that you can get useful, insightful information on your data center's operations AND provide a cool interface to it that your boss will love—all in the space of an afternoon, and entirely with open source tooling? It's true. Free up a little time to read this, and then maybe free up your afternoon to reap the benefits!

This article shows you how to use Snap* to rapidly select and begin collecting useful measurements, from basic system information to metrics, on sophisticated cloud orchestration.

We'll also show you how to publish that information in ways that are useful to you, as someone who needs true insight into their data center's operations (and, possibly, ways to trigger automation on the basis of it). Finally, we'll show how to publish that information in ways that your management will like, making a useful dashboard with Grafana*.

After that, you'll discover a way to do all of that even faster. Let's get started!

What Is Snap*?

Snap is an open-source telemetry framework. Telemetry is simply information about your data center systems. It covers anything and everything you can collect, from basic descriptive information, to performance counters and statistics, to log file entries.

In the past, this huge stew of information has been difficult to synthesize and analyze together. There were collectors for log files that were separate from collectors for performance data, and so on. Snap unifies the collection of common telemetry as a set of community-provided plugins. It does quite a bit more than that too, but for now let's drill-down on collectors, and introduce our demonstration environment.

A Sample Data Center

To keep things simple we'll be working with a small Kubernetes* cluster, consisting of only two hosts. Why Kubernetes? We're aiming at a typical cloud data center. We could have chosen Mesos* or OpenStack* just as easily; Snap has plugins for all of them. Kubernetes just gives us an example to work with. It's important to realize that even if you're running another framework or even something proprietary in your data center, you can still benefit from the system-level plugins. The principles are the same, regardless.

The test environment has two nodes. One is a control node from which we will control and launch Kubernetes Pods; the other is the sole Kubernetes worker node.

We'll be collecting telemetry from both hosts, and the way that Snap is structured makes it easy to extrapolate how to do much larger installations from this smallish example. The nodes are running CentOS* 7.2. This decision was arbitrary; Snap supports most Linux* distributions by distributing binary releases in both RPM* and Debian* packaging formats.

Installation and Initial Setup

We'll need to install Snap, which is quite simple, using the packaging. Complete instructions are available at the home of the Snap repository. We won't repeat all those instructions here, but for simplicity, these are the steps that we took on both of our CentOS 7.2 hosts:

curl -s https://packagecloud.io/install/repositories/intelsdi-x/snap/script.rpm.sh | sudo bash

The point of this step is to set up the appropriate package repository from which to install Snap.

Note: It is not a good idea to run code straight off the Internet as root. This is done here for ease-of-use in an isolated lab environment. You can and should download the script separately, and examine it to be satisfied with exactly what it will do. Alternatively, you can build Snap from source and install it using your own cluster management tools.

After that, we can install the package:

     sudo yum install -y snap-telemetry

This step installs the Snap binaries and startup scripts. We’ll make sure the Snap daemon runs on system startup, and that we’re running it now:

     systemctl enable snap-telemetry
     systemctl start snap-telemetry

Now we have the snap daemon (snapteld) up and running. Below is some sample output from systemctl status snap-telemetry. You can also validate that you are able to communicate with the daemon with a command like ‘snaptel plugin list’. For now, you'll just get a message that says ‘No plugins found. Have you loaded a plugin?’ This is fine, and it means you're all set.

Figure 2

Figure 2. Screen capture showing the snap daemon ‘snapteld’ running.

Now we're ready to get some plugins!

The Plugin Catalog

The first glimpse of the power of Snap is in taking a look at the Snap plugin catalog. For now, we'll concentrate on the first set of plugins, labeled 'COLLECTORS'. Have a quick browse through that list and you'll see that Snap's capabilities are quite extensive. There are low-level system information collectors like PSUtil* and Meminfo*. There are Intel® processor-specific collectors such as CPU states and Intel® Performance Counter Monitor (Intel® PCM). There are service and application-level collectors such as Apache* and Docker*. There are cluster-level collectors such as the OpenStack services (Keystone*, and so on) and Ceph*.

The first major challenge a Snap installer faces is what NOT to collect! There's so much available that it can quickly add up to a mountain of data. We'll leave those decisions to you, but for our examples, we're going to select three:

  • PSUtil—basic system information, common to any Linux data center.
  • Docker—Information about Docker containers and the Docker system.
  • Kubestate*—Information about Kubernetes clusters.

This selection will give us a good spread of different types of collectors to examine.

Installing Plugins

Installing a plugin is a relatively straightforward process. We'll start with PSUtil. First, look at the plugin catalog and click the release link on the far right of the PSUtil entry, shown here:

Figure 3

Figure 3. The line for PSUtil in the plugin catalog.

On the next page, the most recent release is at the top of the page. We'll copy the link to the binary for the current release of PSUtil for Linux x86_64.

Figure 4

Figure 4. Copying the binary link from the plugin release page.

Now we’ll download that plugin and load it. Paste in the URL that you copied above, and you get a pair of commands that look like this:You should receive some output indicating that the plugin loaded. Check that with ‘snaptel plugin list’.

Following the exact same procedure with the Docker plugin will work great:

curl -sfL https://github.com/intelsdi-x/snap-plugin-collector-docker/releases/download/7/snap-plugin-collector-docker_linux_x86_64 -o snap-plugin-collector-docker

snaptel plugin load snap-plugin-collector-docker

The final plugin we are interested in is Kubestate*. Note that the maintainer of Kubestate is not Intel, but Grafana. That means the releases are not maintained in the same GitHub* repository, so the procedure has to change a bit. Fortunately, by examining the documentation of the Kubestate repository, you can easily find the Kubestate release repository.

From there the procedure is exactly the same:

curl -sfL https://github.com/grafana/snap-plugin-collector-kubestate/releases/download/1/snap-plugin-collector-kubestate_linux_x86_64 -o snap-plugin-collector-kubestate

snaptel plugin load snap-plugin-collector-kubestate

If you want to track to current code updates, you are more than welcome to build your own binaries and load them, instead of the precompiled releases. Most of the plugin repository home pages provide instructions for doing it this way.

Publishers

We aren't quite ready to collect information just yet with Snap. Let’s take a look at the overall flow of Snap:

Figure 5

Figure 5. Snap workflow.

You can find the collectors we’ve been dealing with easily on the left-hand side.

In the middle are processors, which is another set of plugins available in the plugin catalog. We won't be installing any of these as part of this demonstration, but they are very useful plugins for making your telemetry data genuinely useful to you. Statistical transformations and filtering can help you get a handle on your operational environment, or even trigger automatic responses to loading or other events. Tagging allows you to usher data through to appropriate endpoints; for example, separating data out by customer, in the case of a cloud service provider data center.

Finally, on the right you can find publishers. These plugins allow you to take the collected, processed telemetry and output it to something useful. Note that Snap itself doesn't USE the telemetry. It is all about collecting, processing, and publishing the telemetry data as simply and flexibly as possible, but what is done with it from that point is up to you.

In the simplest case, Snap can publish to a file on the local file system. It can publish to many different databases such as MySQL* or Cassandra*. It can publish to message queues like RabbitMQ*, to feed into automation systems.

For our examples, we're going to use the Graphite* publisher, for three reasons. One is that Graphite itself is a flexible, well-known, and useful package for dealing with lots of metrics. A data center operation can use the information straight out of Graphite to get all kinds of useful insight into their data center operations. The second reason we're using Graphite is that it feeds naturally into Grafana, which will give us a pretty, manager-friendly dashboard. Finally, most of the example tasks (which we'll discuss shortly) that are provided in the plugin repositories are based on a simple file publisher. Using Graphite will involve a bit more complexity and is a more likely real-world use of publisher plugins.

Loading the publisher plugin works exactly the same as the previous plugins: Find the latest binary release, download it, and load it:

curl -sfL https://github.com/intelsdi-x/snap-plugin-publisher-graphite/releases/download/6/snap-plugin-publisher-graphite_linux_x86_64 -o snap-plugin-publisher-graphite

snaptel plugin load snap-plugin-publisher-graphite

Now we have all the plugins we need for our installation:

[plse@cspnuc03 ~]$ snaptel plugin list
NAME             VERSION         TYPE            SIGNED          STATUS          LOADED TIME
psutil           9               collector       false           loaded          Mon, 27 Mar 2017 20:36:05 EDT
docker           7               collector       false           loaded          Mon, 27 Mar 2017 20:39:45 EDT
kubestate        1               collector       false           loaded          Mon, 27 Mar 2017 20:51:17 EDT
graphite         6               publisher       false           loaded          Mon, 27 Mar 2017 21:03:31 EDT
[plse@cspnuc03 ~]$

We're almost ready to tie all this together. First, we need to pick out the metrics we're interested in collecting.

Metrics

Now that you've got all these plugins installed, it's time to select some metrics that you're interested in collecting. Most plugins offer a large listing of metrics; you may not want to take all of them.

To see what's available, you can view the master list of metrics that are available from the plugins you have installed with a single command:

snaptel metric list

The output from this is quite long—234 lines for the three collectors we have loaded, at the time of this writing. Rather than paste all of the output here, we'll just look at a few from each namespace, generated by our three collector plugins.

PSUtil

There are many metrics available from this package that should be readily identifiable to anyone who runs Linux. Here's the selection we'll go with for our examples:

     /intel/psutil/load/load1
     /intel/psutil/load/load15
     /intel/psutil/load/load5
     /intel/psutil/vm/available
     /intel/psutil/vm/free
     /intel/psutil/vm/used

These look like filesystem paths, but they are not. They are Snap namespace paths. The first element, 'intel', indicates the maintainer of the plugin. The second is the name of the plugin, after which comes the namespaces of the various metrics provided.

The metrics themselves are familiar; the typical, three-value load-average numbers, for 1-minute, 5-minute, and 15-minute averages, and some simple memory usage values.

Docker

For the Docker plugin, there are 150 metrics available at the time of this writing. They run from simple container information to details about network and filesystem usage per-container. For this, we'll take a few of the broader values, as recommended in the task examples provided in the plugin repository.

     /intel/docker/*/spec/*
     /intel/docker/*/stats/cgroups/cpu_stats/*
     /intel/docker/*/stats/cgroups/memory_stats/*

Kubestate

The Kubestate plugin (at /grafanalabs/kubestate in the namespace) provides 34 metrics for tracking Kubernetes information. Since Kubernetes is the top-level application platform, it's worth ensuring that we collect all of them. The full list will show up in the task definition file, below.

All of the metrics lists and documentation (available from the plugin repos) are worth having a closer look to get the insight you need for your workloads.

Now that we've selected some metrics to track, let's actually get some data flowing!

Our First Task

Putting together collector, processor, and publisher processors into an end-to-end workflow is performed by specifying task manifests. These are either JSON or YAML definition files that are loaded into the running Snap daemon to tell it what data to collect, what to do with it, and where to send it. A single Snap daemon can be in charge of many tasks at once, meaning you can run multiple collectors and publish to multiple endpoints. This makes it very easy and flexible to direct telemetry where you need it, when you need it.

All of the plugin repositories generally include some sample manifests to help you get started. We're going to take some of those and extend them just a bit to tie in the Graphite publisher.

For Graphite itself, we've run a simple graphite container image from Docker Hub* on the 'control' node in our setup. Of course, you can use any Graphite service you wish or, of course, try another publisher. (For example, you may be using Elasticsearch*, and there's a publisher plugin for that!)

sudo docker run -d --name graphite --restart=always -p 80:80 -p 2003-2004:2003-2004 -p 2023-2024:2023-2024 -p 8125:8125/udp -p 8126:8126 hopsoft/graphite-statsd

The first task we define will be to collect data from the PSUtil plugin and publish it to our Graphite instance. We'll start with this YAML file. It's saved in our lab to psutil-graphite.yaml.

---
version: 1
schedule:
  type: simple
  interval: 1s
max-failures: 10
workflow:
  collect:
    metrics:"/intel/psutil/load/load1": {}"/intel/psutil/load/load15": {}"/intel/psutil/load/load5": {}"/intel/psutil/vm/available": {}"/intel/psutil/vm/free": {}"/intel/psutil/vm/used": {}
    publish:
    - plugin_name: graphite
      config:
        server: cspnuc02
        port: 2003

The best part about this is that it is straightforward and quite readable. To begin with, we've defined a collection interval of one second in the schedule section. The simple schedule type just means to run at every interval. There are two other types: windowed, which means you can define exactly when the task will start and stop, and cron, which allows you to set a crontab- like entry for when the task will run. There's lots more information on the flexibility of task scheduling in the task documentation.

The ‘max-failures’ value indicates just what you would expect: After 10 consecutive failures of the task, the Snap daemon will disable the task and stop trying to run it.

Finally, the ‘workflow’ section defines our collector and our publisher, what metrics to collect, and defines the Graphite service to connect to as cspnuc02 on port 2003. This is the control node where the Graphite container we're using is running. Starting this task on both machines will ensure that both report into the same Graphite server.

To create the task, we'll run the task creation command:

     snaptel task create -t psutil-graphite.yaml

Assuming all is well, you should see output similar to this:

     [plse@cspnuc03 ~]$ snaptel task create -t psutil-graphite.yaml
     Using task manifest to create task
     Task created
     ID: d067a659-0576-44eb-95fe-0f01f7e33fbf
     Name: Task-d067a659-0576-44eb-95fe-0f01f7e33fbf
     State: Running

The task is running! For now, anyway. There's a couple of ways to keep tabs on your running tasks. One is just to list the installed tasks:

snaptel task list

You'll get a listing like the following that shows how many times the task has completed successfully (HIT), come up empty with no values (MISS), or failed altogether (FAIL). It will also indicate if the task is Running, Stopped, or Disabled.

ID                                       NAME                                            STATE           HIT     MISS    FAIL
CREATED                  LAST FAILURE
d067a659-0576-44eb-95fe-0f01f7e33fbf     Task-d067a659-0576-44eb-95fe-0f01f7e33fbf       Running         7K      15      3
3:55PM 3-28-2017         rpc error: code = 2 desc = Error Connecting to graphite at cspnuc02:2003. Error: dial tcp: i/o timeout

Even though you can see a failure here (from the last time it failed), you can also see that the task is still running, it's got 7000 hits, 15 misses and only 3 failures. Without 10 consecutive failures, the task will still try to run.

Another way to see your tasks working is to watch them. By taking the task ID field from the above output and pasting it into this command:

snaptel task watch d067a659-0576-44eb-95fe-0f01f7e33fbf

You can get a continuously updating text-mode output of the values as they stream by:

Figure 6

Figure 6. Output from snaptel task watch.

Now that we've created our first simple task, let's get the other collectors collecting!

Remaining Tasks

For Docker, our YAML looks like this, saved as docker-graphite.yaml:

---
max-failures: 10
schedule:
  interval: 5s
  type: simple
version: 1
workflow:
  collect:
    config:
      /intel/docker:
        endpoint: unix:///var/run/docker.sock
    metrics:
      /intel/docker/*/spec/*: {}
      /intel/docker/*/stats/cgroups/cpu_stats/*: {}
      /intel/docker/*/stats/cgroups/memory_stats/*: {}
    publish:
      -
        config:
          server: cspnuc02
          port: 2003
        plugin_name: graphite

By now you can probably see the general outline of how this works. Note the additional configuration of the collector to be able to communicate with the local Docker daemon. In this case we're using a socket; other examples work with a network connection to Docker on a specific port.

We'll enable this one the same way on both nodes:

snaptel task create -t docker-graphite.yaml

Finally, we'll enable Kubestate. This one is a bit different than the other two. We don't want to enable it on both nodes, since we're interested in the overall state of the Kubernetes cluster, rather than the values on a specific node. Let's enable it on the control node only.

The example task manifest for Kubestate is a JSON file, so the modifications made to it will be too. This one is saved as kubestate-graphite.json:

{"version": 1,"schedule": {"type": "simple","interval": "10s"
  },"workflow": {"collect": {"metrics": {"/grafanalabs/kubestate/container/*/*/*/*/limits/cpu/cores": {},"/grafanalabs/kubestate/container/*/*/*/*/limits/memory/bytes": {},"/grafanalabs/kubestate/container/*/*/*/*/requested/cpu/cores": {},"/grafanalabs/kubestate/container/*/*/*/*/requested/memory/bytes": {},"/grafanalabs/kubestate/container/*/*/*/*/status/ready": {},"/grafanalabs/kubestate/container/*/*/*/*/status/restarts": {},"/grafanalabs/kubestate/container/*/*/*/*/status/running": {},"/grafanalabs/kubestate/container/*/*/*/*/status/terminated": {},"/grafanalabs/kubestate/container/*/*/*/*/status/waiting": {},"/grafanalabs/kubestate/deployment/*/*/metadata/generation": {},"/grafanalabs/kubestate/deployment/*/*/spec/desiredreplicas": {},"/grafanalabs/kubestate/deployment/*/*/spec/paused": {},"/grafanalabs/kubestate/deployment/*/*/status/availablereplicas": {},"/grafanalabs/kubestate/deployment/*/*/status/deploynotfinished": {},"/grafanalabs/kubestate/deployment/*/*/status/observedgeneration": {},"/grafanalabs/kubestate/deployment/*/*/status/targetedreplicas": {},"/grafanalabs/kubestate/deployment/*/*/status/unavailablereplicas": {},"/grafanalabs/kubestate/deployment/*/*/status/updatedreplicas": {},"/grafanalabs/kubestate/node/*/spec/unschedulable": {},"/grafanalabs/kubestate/node/*/status/allocatable/cpu/cores": {},"/grafanalabs/kubestate/node/*/status/allocatable/memory/bytes": {},"/grafanalabs/kubestate/node/*/status/allocatable/pods": {},"/grafanalabs/kubestate/node/*/status/capacity/cpu/cores": {},"/grafanalabs/kubestate/node/*/status/capacity/memory/bytes": {},"/grafanalabs/kubestate/node/*/status/capacity/pods": {},"/grafanalabs/kubestate/node/*/status/outofdisk": {},"/grafanalabs/kubestate/pod/*/*/*/status/condition/ready": {},"/grafanalabs/kubestate/pod/*/*/*/status/condition/scheduled": {},"/grafanalabs/kubestate/pod/*/*/*/status/phase/Failed": {},"/grafanalabs/kubestate/pod/*/*/*/status/phase/Pending": {},"/grafanalabs/kubestate/pod/*/*/*/status/phase/Running": {},"/grafanalabs/kubestate/pod/*/*/*/status/phase/Succeeded": {},"/grafanalabs/kubestate/pod/*/*/*/status/phase/Unknown": {}
      },"config": {"/grafanalabs/kubestate": {"incluster": false,"kubeconfigpath": "/home/plse/.kube/config"
        }
      },"process": null,"publish": [
        {"plugin_name": "graphite","config": {"server": "localhost","port": 2003
          }
        }
      ]
    }
  }
}

Again, most of this is quite straightforward. As noted above, we're collecting all metrics available in the namespace, straight from what we would have gotten from snaptel metric list. To configure the collector itself, we tell it that we're not running from within the cluster ("incluster": false), and where to look for information on how to connect to the cluster management ("kubeconfigpath": "/home/plse/.kube/config").

The config file for Kubernetes that's referenced there contains the server name and port to connect to and the cluster and context to use when conducting queries. So, clearly, multiple tasks could be set up to query different clusters and contexts, and route them as desired. We could even add in a tagging processor plugin to tag the data by cluster and deliver it with the tags; that would allow us to split cluster data out by customer, for example.

Also note that here the server for Graphite is ‘localhost’ since only this one node needs to access the service. We could have used the hostname here as well; it works either way.

Enabling the service is the same as the others:

snaptel task create -t kubestate-graphite.json

Once we're satisfied that our tasks are up and running, we can go take a look at them with Graphite's native tools.

Real-World Deployments

We'll take a moment here to pause and have a quick look at how to make these kinds of settings permanent, as well as how to integrate Snap's tooling into a real-world environment.

The Snap daemon has several methods of configuration. We've been doing all of them via the command-line interface, but none of it is persistent. If we rebooted our nodes right now, the Snap daemon would start up, but it wouldn't load the plugins and tasks we've defined for it.

To make that happen, you would want to use the guidelines at Snap Daemon Configuration.

We won't get into the specifics here, but suffice it to say that /etc/snap/snapteld.conf can be set up as either a YAML or JSON file that contains more or less the same information that our task manifests did. This file will suffice to install plugins and run tasks at boot time. As well, it defines many defaults about the way that Snap runs, so that you can tune the daemon to collect properly without imposing too much of its own overhead on your servers.

Likewise, it's likely you've been wondering about loading plugins and how secure that process is. The default installation method that we've used here sets the value plugin_trust_level to 0 in the /etc/snap/snapteld.conf configuration file. This means that the plugins that we've been downloading and installing haven't been checked for integrity by the daemon.

Snap uses GPG* keys and signing methods to allow you to sign and trust binaries in your deployment. The instructions for doing this are at Snap Plugin Signing. Again, it is beyond the scope of this article to examine this system deeply, but we strongly advise that any production deployments integrate with the plugin signing system, and that signatures are distributed independently of the plugins. This should not be an unusual model for deploying software in most data centers (although the work will probably run past our single afternoon).

Examining the Data

The Graphite container that we ran earlier exposes a useful web interface from the server. If we connect to it, we'll find that the available metrics are listed on the left-hand side of the resulting page.

The headings by hostname are the ones that we're interested in here (the others are defaults provided by the Graphite and StatsD* container system we've started). In the screenshot below I've expanded some of the items so you can get a feel for where the metrics come out in Graphite.

Figure 7

Figure 7. Metrics in the Graphite homepage.

From here it would be quite simple to construct Graphite-based graphs that offer interesting and useful data about your servers. For example, Figure 8 shows a graph that we constructed that looks at the Kubernetes node, and combines information about system load, Docker containers, and Kubernetes pods over a 30-minute period.

You can see from here when a new pod was launched and then removed. The purple line that shoots to 1.0 and drops again is for one of the containers in the pod; it didn't exist before spawning, and ceased to exist afterwards.

The green line is one-minute load average on the worker node, and the red line is Docker memory utilization on the same node.

Figure 8

Figure 8. A simple Graphite chart.

This is a simple example, just to give an idea of what can be generated, and the point here is that it took very little time to construct a graph with viable information. A little bit of experimentation with your specific environment and workloads will almost certainly reveal more useful data collection and uses in a very short period of time!

Making a Nice Dashboard

From here, it's a relatively simple matter to make some nice boss-friendly charts out of our pile of Graphite data. First, we'll load up another couple of quickie containers on the control host, to store our Grafana info, and to run the tool itself:

sudo docker run -d -v /var/lib/grafana --name grafana-storage busybox:latest
sudo docker run   -d   -p 3000:3000   --name=grafana   --volumes-from grafana-storage   grafana/grafana

Now you've got a Grafana installation running on port 3000 of the node these commands were run on. We'll bring it up in a browser and log in with the default credentials of admin for both Username and Password. You will arrive at the Home dashboard, and the next task on the list is to add a data source. Naturally, we'll add our Graphite installation:

Figure 9

Figure 9. Adding a data source in Grafana.

Once that's set up, we can return to the Home dashboard using the pull-down menu from the Grafana log in the upper-left hand corner. Select Dashboards and Home. From there, click New Dashboard. You'll get a default graph that's empty on that page. Click the Panel Title, then Edit, and you can add metrics, change the title, and so on.

Adding metrics works the same here as it did on the Graphite screen, more or less. A little time spent exploring can give you something reasonably nice, as shown below. In an afternoon you could generate a very cool dashboard for your data sources!

Here's a quick dashboard we did with our one-minute load average chart and a chart of memory utilization of all Kubernetes containers in the cluster. The containers come into being as the Pod is deployed, which is why there is no data for them on the first part of the graph. We can also see that the initial deployment of the Pod was quickly halted and re-spawned between 17:35 and 17:40.

Figure 10

Figure 10. A simple dashboard in Grafana.

This may or may not be useful information for you; the point is to generate useful information for yourself that is very simple and quick.

Once Again, But Faster!

So by now we've explored a lot of Snap's potential, but one area we haven't covered too much is its extensibility. The plugin framework and open source tooling allows it to be extended quite easy by anyone interested.

For the example we used here, a Kubernetes setup, it turns out that there is a nice extension designed to plug directly into Snap with a full set of metrics and dashboards already available. It's called the Grafana Kubernetes app, and it runs all the components directly in your Kubernetes cluster instead of outside of it, the way we've done in this article.

You can find that in the Grafana Kubernetes app.

Besides prepared, chaining plugins like this one, other areas of extension are possible for Snap as well. For example, more advanced schedulers than the basic three (simple, windowed, cron) can be slotted-in with relative ease. And of course, new collector and publisher plugins are always welcome!

Summary

In this article we've introduced you to Snap, an extensible and easy-to-use framework for collecting data center telemetry. We've talked about the kinds of system, application, and cluster-level data that Snap can collect (including Intel® architecture-specific counters like CPU states and Intel PCM). We've demonstrated some common ways to put plugins together to produce usable data for data center operators and, furthermore, create good-looking graphs with Grafana. We've shown you how to install, configure, add plugins, schedule and create tasks, and check the health of the running tasks.

We've also had a short discussion on how to take it to the next level and deploy Snap for real with signed plugins and persistent state. Finally, we've shown that Snap is easily extended to deliver the telemetry you need, to where you need it.

We hope you take an afternoon to try it out and see what it can do for you!

About the Author

Jim Chamings is a Sr. Software Engineer at Intel Corporation, who focuses on enabling cloud technology for Intel’s Developer Relations Division. He’d be happy to hear from you about this article at: jim.chamings@intel.com.

Intel(R) Media SDK GStreamer* Getting Started Guide

$
0
0

Intel(R) Media SDK GStreamer* Installation Process

1 Overview

This document provides the system requirements, installation instructions, issues and limitations. System Requirements:

  • Intel(R) Core(TM) Processor: SkyLake, Broadwell.
  • Fedora* 24 / 25
  • Intel(R) Media Server Studio 2017 R2.

2 Installing Fedora* 24 / 25

2.1 Downloading Fedora*

Go to the Fedora* download site and download Workstation OS image:

Fedora* 24: http://mirror.nodesdirect.com/fedora/releases/24/Workstation/x86_64/iso/Fedora*-Workstation-Live-x86_64-24-1.2.iso
Fedora* 25: http://mirror.nodesdirect.com/fedora/releases/25/Workstation/x86_64/iso/Fedora*-Workstation-Live-x86_64-25-1.3.iso

2.2 Creating the installation USB

Get an imaging tool like Rufus to create the USB bootable image

2.3 Installing Fedora* 24 / 25 on the system

For Fedora* 25, you may log on to the system with "GNOME on Xorg" option in the Gnome login manager. This is because the default desktop for Fedora* 25 uses Wayland, and the renderer plugin (mfxsink) native Wayland backend is not very well supported by the Fedora* Wayland desktop. In this case, you should use the Wayland EGL backend in mfxsink for native Wayland rendering in Fedora* 25 Wayland.

2.4 Configuring the Fedora* system (optional)

In case the user is behind a VPN, you may use the following method to set up the network proxy:

vi /etc/dnf/dnf.conf
# Add the following lines:
proxy=http://<proxy address>:<port>

Enable sudo privileges:

$ su
Password:
# vi /etc/sudoers
Find one line such like
  root    ALL=(ALL)    ALL
Then add one line for the normal user who wants to use sudo, e.g. for normal user "user"
  user    ALL=(ALL)    ALL

2.5 Installing rpm fusion

Fedora* 24:

wget <http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-24.noarch.rpm> -e use_proxy=yes -e  http_proxy=<proxy_address>:<port>
sudo rpm -ivh rpmfusion-free-release-24.noarch.rpm

Fedora* 25:

wget <http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-25.noarch.rpm> -e use_proxy=yes -e  http_proxy=<proxy_address>:<port>
sudo rpm -ivh rpmfusion-free-release-25.noarch.rpm

2.6 Updating system

sudo dnf update

3 Installing Intel(R) Media Server Studio 2017

3.1 Downloading Intel(R) Media Server Studio (MSS) 2017 R2 Community Edition

Go to https://software.intel.com/en-us/intel-media-server-studio and download the tar.gz file

3.2 Installing the user-space modules for MSS

Note: Before starting the following command sequence, please take note that the last cp command may reset the system, the system may freeze for a while and logout automatically. This is expected, continue logging in and resume the installation procedure. Create a folder for installation, for example “development”, download the tar file MediaServerStudioEssentials2017R2.tar.gz to this folder.

# cd ~
# mkdir development
# cd development
# tar -vxf MediaServerStudioEssentials2017R2.tar.gz
# cd MediaServerStudioEssentials2017R2/
# tar -vxf SDK2017Production16.5.1.tar.gz
# cd SDK2017Production16.5.1/Generic/
# tar -vxf intel-linux-media_generic_16.5.1-59511_64bit.tar.gz
# sudo cp -rdf etc/* /etc
# sudo cp -rdf opt/* /opt
# sudo cp -rdf lib/* /lib
# sudo cp -rdf usr/* /usr

3.3 Installing the custom kernel module package

3.3.1 Install the build tools

# sudo dnf install kernel-headers kernel-devel bc wget bison ncurses-devel hmaccalc zlib-devel binutils-devel elfutils-libelf-devel rpm-build redhat-rpm-config asciidoc hmaccalc perl-ExtUtils-Embed pesign xmlto audit-libs-devel binutils-devel elfutils-devel elfutils-libelf-devel newt-devel numactl-devel pciutils-devel python-devel zlib-devel mesa-dri-drivers openssl-devel

3.3.2 Download and build the kernel

# cd ~/development
# wget  https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.4.tar.xz -e use_proxy=yes -e  https_proxy= https://<proxy address>:<port>
# tar -xvf linux-4.4.tar.xz
# cp /opt/intel/mediasdk/opensource/patches/kmd/4.4/intel-kernel-patches.tar.bz2 .
# tar -xjf intel-kernel-patches.tar.bz2
# cd linux-4.4/
# vi patch.sh
(Added: “for i in ../intel-kernel-patches/*.patch; do patch -p1 < $i; done”)
# chmod +x patch.sh
# ./patch.sh
# make olddefconfig
# echo "CONFIG_NVM=y">> .config
# echo "CONFIG_NVM_DEBUG=n">> .config
# echo "CONFIG_NVM_GENNVM=n">> .config
# echo "CONFIG_NVM_RRPC=n">> .config
# make -j 8
# sudo make modules_install
# sudo make install

3.3.3 Validate the kernel change

Reboot the system with kernel 4.4 and check the kernel version

# uname –r
4.4.0

3.3.4 Validate the MSDK installation

The vainfo utility should show the Media SDK iHD driver details (installed in /opt/intel/mediasdk) and several codec entry points that indicate the system support for various codec formats.

$ vainfo
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.99 (libva 1.67.0.pre1)
vainfo: Driver version: 16.5.1.59511-ubit
vainfo: Supported profile and entrypoints
 VAProfileH264ConstrainedBaseline: VAEntrypointVLD
 VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
 VAProfileH264Main : VAEntrypointVLD
 VAProfileH264Main : VAEntrypointEncSlice
 VAProfileH264High : VAEntrypointVLD
 VAProfileH264High : VAEntrypointEncSlice

Prebuilt samples are available for installation smoke testing in MediaSamples_Linux_*.tar.gz

# cd ~/development/MediaServerStudioEssentials2017R2/
# tar -vxf MediaSamples_Linux_2017R2.tar.gz
# cd MediaSamples_Linux_2017R2_b634/samples/_bin/x64/
# ./sample_multi_transcode -i::h264 ../content/test_stream.264 -o::h264 out.264

This test should pass on successful installation.

4 Installing GStreamer*

4.1 Install GStreamer* and corresponding plugins packages

# sudo dnf install gstreamer1 gstreamer1-devel gstreamer1-plugins-base gstreamer1-plugins-base-devel gstreamer1-plugins-good gstreamer1-plugins-ugly gstreamer1-plugins-bad-free gstreamer1-plugins-bad-freeworld gstreamer1-plugins-bad-free-extras gstreamer1-libav gstreamer1-plugins-bad-free-devel gstreamer1-plugins-base-tools

4.2 Validate the installation

# gst-launch-1.0 –-version
# gst-launch-1.0 -v fakesrc num_buffers=5 ! fakesink
# gst-play-1.0 sample.mkv

5 Building the GStreamer* MSDK plugin

5.1 Install the GStreamer* MSDK plugin dependencies

# sudo dnf install gcc-c++ glib2-devel libudev-devel libwayland-client-devel libwayland-cursor-devel mesa-libEGL-devel mesa-libGL-devel mesa-libwayland-egl-devel mesa-libGLES-devel libstdc++-devel cmake libXrandr-devel

5.2 Download the GStreamer* MSDK plugin

Go to https://github.com/01org/gstreamer-media-SDK and download the package to a "development" folder.

5.3 Build and install the plugin

# cd development/gstreamer-media-SDK-master/
# mkdir build
# cd build
# cmake .. -DCMAKE_INSTALL_PREFIX=/usr/lib64/gstreamer-1.0/plugins
# make
# sudo make install

5.4 Validate the installation

# gst-inspect-1.0 mfxvpp
# gst-inspect-1.0 mfxdecode
# gst-play-1.0 sample.mkv
# gst-launch-1.0 filesrc location=/path/to/BigBuckBunny_320x180.mp4 ! qtdemux ! h264parse ! mfxdecode ! fpsdisplaysink video-sink=mfxsink

You can go to the following site to download the clip: http://download.blender.org/peach/bigbuckbunny_movies/


Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

$
0
0

Overview

 This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and  a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® Architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.

 

Introduction to BVLC Caffe* and Intel® Optimized Caffe*

Caffe* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV, MATLAB and Python before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers. 

Intel® Optimized Caffe* is Intel distributed customized Caffe* version for Intel Architectures. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel Architectures optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.

To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer this page : https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques

In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.

Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

1. Caffe* provides its own timing option for example : 

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

2. Intel® VTune™ Amplifier :  Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface.  https://software.intel.com/en-us/intel-vtune-amplifier-xe

 

 

How to Install BVLC Caffe*

Please refer the BVLC Caffe project web page for installation : http://caffe.berkeleyvision.org/installation.html

If you have Intel® MKL installed on your system, it is better using MKL as BLAS library. 

In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )

In our test, we kept all configurations as they are specified as default except the CPU only option. 

 

Test example

In this article, we will use 'Cifar 10' example included in Caffe* package as default. 

You can refer BVLC Caffe project page for detail information about this exmaple : http://caffe.berkeleyvision.org/gathered/examples/cifar10.html

You can simply run the training example of Cifar 10 as the following : 

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./examples/cifar10/train_full_sigmoid_bn.sh

First, we will try the Caffe's own benchmark method to obtain its performance results as the following:

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation. 

This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2.

The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*. 

Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail. 

 

VTune Profiling

Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link :

Intel® VTune™ Amplifier : https://software.intel.com/en-us/intel-vtune-amplifier-xe

We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working. 

 

VTune result analysis

 

What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization. 

In this case, we will focus on 'caffe::im2col_cpu<float>' function as a optimization candidate. 

'im2col_cpu<float>' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*. 

Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function. 

On the above result, we can see the CPI ( Cycles Per Instruction ) of the fuction is 0.907 and the function utilizes only one single thread for the entire calculation.

One more intuitive data provided by VTune is here. 

This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded. 

However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later.

 

Let's move on to Intel® Optimized Caffe* now.

 

How to Install Intel® Optimized Caffe*

 The basic procedure of installation of  Intel® Optimized Caffe* is the same as BVLC Caffe*. 

When clone  Intel® Optimized Caffe* from Git, you can use this alternative : 

git clone https://github.com/intel/caffe

 

Additionally, it is required to install  Intel® MKL to bring out the best performance of  Intel® Optimized Caffe*. 

Please download and install  Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of  Intel® Optimized Caffe* is set to MKL.

 Intel® MKL : https://software.intel.com/en-us/intel-mkl

After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB

BLAS :=mkl

BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64

 

If you encounter 'libstdc++' related error during the compilation of  Intel® Optimized Caffe*, please install 'libstdc++-static'. For example :

sudo yum install libstdc++-static

 

 

 

Optimization factors and tunes

Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance.

  • Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also.
  • Set CPU utilization limit : 
    echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct
    echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
  • Put 'engine:"MKL2017"' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017"
  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used.
  • Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.
  • Please also refer here : Intel Recommendation to Achieve the best performance 

If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable : 

KMP_AFFINITY=compact,granularity=fine

 

Test example

 For Intel® Optimized Caffe* we run the same example to compare the results with the previous results. 

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

 

Comparison

 The results with the above example is the following :

Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together, 

  --> 

to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case. 

 DirectionBVLC (ms)Intel (ms)Performance Benefit (x)
conv1Forward40.29661.6506324.413
conv1Backward54.59112.2478724.286
pool1Forward162.2881.9714682.319
pool1Backward21.71330.45976747.227
bn1Forward1.607170.8124871.978
bn1Backward1.222361.244490.982
Sigmoid1Forward132.5152.2476458.957
Sigmoid1Backward17.90850.26279768.146
conv2Forward125.8113.891532.330
conv2Backward239.4598.4569528.315
bn2Forward1.585820.8549361.855
bn2Backward1.22531.258950.973
Sigmoid2Forward132.4432.224759.533
Sigmoid2Backward17.91860.23470176.347
pool2Forward17.28680.3845644.952
pool2Backward27.01680.66175540.826
conv3Forward40.64051.7472223.260
conv3Backward79.01864.9582215.937
bn3Forward0.9188530.7799271.178
bn3Backward1.180061.181850.998
Sigmoid3Forward66.29181.154357.430
Sigmoid3Backward8.980230.12176673.750
pool3Forward12.55980.22036956.994
pool3Backward17.35570.33383751.989
iplForward0.3018470.1864661.619
iplBackward0.3018370.1842091.639
lossForward0.8022420.6412211.251
lossBackward0.0137220.0138250.993
Ave.Forward735.53421.679933.927
Ave.Backward488.04921.721422.469
Ave.Forward-Backward1223.8643.63628.047
Total 12238604363628.047

 

Some of many reasons this optimization was possible are :

  • Code vectorization for SIMD 
  • Finding hotspot functions and reducing function complexity and the amount of calculations
  • CPU / system specific optimizations
  • Reducing thread movements
  • Efficient OpenMP* utilization

 

Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*. 

Simply we will looking at how efficiently im2col_cpu function has been utilized. 

BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded. 

In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers. 

The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.

VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.

CPU Usage Histogram for the whole process also shows better threading results in this case. 

 

 

 

Useful links

BVLC Caffe* Project : http://caffe.berkeleyvision.org/ 
 
Intel® Optimized Caffe* Git : https://github.com/intel/caffe
Intel® Optimized Caffe* Recommendations for the best performance : https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance 
 

 

Summary

Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.

In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization. 

 

 

Intel Solutions and Technologies for the Evolving Data Center

$
0
0

 

One Stop for Optimizing Your Data Center

From AI to Big Data to HPC: End-to-end Solutions

Whether your data center is data- or compute-intensive and whether it serves cloud, high-performance computing, enterprise, storage, networking, or big data analytics, we have solutions and technologies to make your life easier. 

Explore

 

Data center managers, integrators, and developers can now optimize the entire stack to run faster and more efficiently on Intel® architecture. The Intel® Xeon® and Intel® Xeon Phi™ product family paired with Intel® Solid State Drives and NVMe* storage provide a strong foundation. Intel is committed to a standardized, shared platform for virtualization including SDN/NFV (networking), while providing hardware-based security and manageability for now and in the future.

But Intel is more than a hardware innovator. Regardless of your challenges, Intel provides optimized industry SDKs, libraries, and tuning tools. And these tools are supplemented by expert-provided training plus documentation including code samples, configuration guides, walk-throughs, use cases, and support forums.
 

 

AI: MACHINE LEARNING AND DEEP LEARNING

Intel supports rapid innovation in artificial intelligence focusing on community, tools, and training. Starting with the Intel® Nervana™ AI Academy, this section of the Intel® Software Developer Zone drills down into to computational machine learning and deep learning, with extensive Intel-optimized libraries and frameworks along with documentation and tutorials.

The Deep Learning Training Tool Beta helps you easily develop and train deep learning solutions using your own hardware. It can ease your data preparation, as well as design and train models using automated experiments and advanced visualizations.

Tools available include:
BigDL open source distributed library for Apache Spark*
Intel® Distribution for Python*
Deep Learning Webinar

 

MODERN CODE

You’ve no doubt heard of recent hardware innovations of the Intel® Many Integrated Core Architecture (Intel® MIC) including the multilevel extreme parallelism, vectorization and threading of the Intel® Xeon® and Intel® Xeon Phi™ product family. Plus, there are larger caches, new SIMD extensions, new memory and file architectures and hardware enforced security of select data and application code via Intel® Software Guard Extensions (Intel® SGX).

But they all require code and tool changes to get the most from the data center. To address this, Intel provides training and tools to quickly and easily optimize code for new technologies.

Extensive free training on code improvements and parallel programming is available online and by workshops and events.

Tools available include:
Intel® Parallel Studio XE (vectorization advisor and MPI profiling)
Intel® Advisor (vectorization optimization and threading design tool)
Intel® C/C++ Compilers and Intel® Fortran Compilers
Intel® VTune™ Amplifier XE (performance analysis of multiple CPUs and FPUs)
Application Performance Snapshot Tool

 

BIG DATA ANALYTICS

When handling huge volumes of data, Intel can help you provide faster, easier and more insightful big data analytics using open software platforms, libraries, developer kits and tools that take advantage of the Intel Xeon and Intel Xeon Phi product family’s extreme parallelism and vectorization. Fully integrated with popular platforms (Apache* Hadoop*, Spark*,R, Matlab* Java*, and NoSQL), Intel optimizations have been well-tested and benchmarked.

Extensive documentation is available on how real-life developers are using Intel hardware, software, and tools to effectively store, manage, process, and analyze data.

The Intel® Data Analytics Acceleration Library (Intel® DAAL) provides highly-optimized algorithmic building blocks and can be paired with the Intel® Math Kernel Library (Intel® MKL) containing optimized threaded and vectorized functions. In fact, the TAP Analytics Toolkit (TAP ATK) provides both Intel® DAAL and Intel® MKL already integrated with Spark.

 

HIGH-PERFORMANCE STORAGE

Intel is at the cutting edge of Storage not only with Intel® SSDs and NVMe but by working with the open source community to optimize and secure the infrastructure. Training is available at Intel® Storage Builders University.


Major tools available include:
Intel® Intelligent Storage Acceleration Library (Intel® ISA-L)
Storage Performance Development Kit (SPDK)
Intel® QuickAssist Technology
Intel® VTune™ Amplifier
Storage Performance Snapshot
Intel® Cache Acceleration Software (Intel® CAS)

 

SDN/NFV NETWORKING

Besides providing a standardized open platform ideal for SDN/NFV (virtualized networking) and the unique hardware capabilities in Intel’s network controllers, Intel has provided extensive additions to, and testing of, the Data Plane Development Kit (DPDK) and training through Intel® Network Builders University. Check out the thriving community of developers and subscribe to the 'Out of the Box' Network Developers Newsletter.

   

HPC AND CLUSTER

If you run visualization or other massive parallelism applications, you know the advantages of using the Intel Xeon and Intel Xeon Phi product family with MCDRAM and associated NUMA/Memory/Cache Modes, wide vector units and up to 68 cores. While the Intel® Scalable System Framework (Intel® SSF) and Intel® Omni-Path Architecture (Intel® OPA) focus on performance, balance and scalability, Intel is working with research and production HPC and clusters to support integration with all the major stacks as well as developing code and tools to optimize and simplify the work.

The Intel® HPC Orchestrator provides a modular integrated validated stack including the Lustre* parallel file system. It is supplemented by critical tools for cluster optimization:

Intel® Trace Analyzer and Collector which quickly finds MPI bottlenecks
Intel® MPI Library and docs to improve implementation of MPI 3.1 on multiple fabrics
MPI Performance Snapshot to help with performance tuning.
Intel® VTune™ Amplifier XE for performance analysis of multiple CPUs, FPUs and NUMA

 

 

Conclusion

Regardless of your job title and data center activities, Intel helps streamline and optimize your work to gain a competitive edge with end-to-end solutions, from high-performance hardware to new technologies, optimizations, tools and training. See what resources Intel provides to optimize and speed up your development now and remain competitive in the industry.

Explore

Intel® Xeon Phi™ Coprocessor x200 Quick Start Guide

$
0
0

Introduction

This document introduces the basic concept of the Intel® Xeon Phi™ coprocessor x200 product family, tells how to install the coprocessor software stack, discusses the build environment, and points to important documents so that you can write code and run applications.

The Intel Xeon Phi coprocessor x200 is the second generation of the Intel Xeon Phi product family. Unlike the first generation running on an embedded Linux* uOS, this second generation supports the standard Linux kernel. The Intel Xeon Phi coprocessor x200 is designed for installation in a third-generation PCI Express* (PCIe*) slot of an Intel® Xeon® processor host. The following figure shows a typical configuration:

 Intel Xeon Phi coprocessor x200 architecture

Benefits of the Intel Xeon Phi coprocessor:

  • System flexibility: Build a system that can support a wide range of applications, from serial to highly parallel, while leveraging code optimized for Intel Xeon processors or Intel Xeon Phi processors.
  • Maximize density: Gain significant performance improvements with limited acquisition cost by maximizing system density.
  • Upgrade path: Improve performance by adding to an Intel Xeon processor system or upgrading from the first generation of the Intel Xeon Phi product family with minimum code changes.

For workloads that fit within 16 GB coprocessor memory, adding a coprocessor to a host server allows customers to avoid costly networking. For workloads that have a significant portion of highly parallel phases, offload can offer significant performance with minimal code optimization investment.

Additional Documentation

Basic System Architecture

The Intel Xeon Phi coprocessor x200 is based on a modern Intel® Atom™ microarchitecture with considerable high performance computing (HPC)-focused performance improvements. It has up to 72 cores with four threads per core, giving a total of 288 CPUs as viewed by the operating system, and has up to 16 GB of high-bandwidth on-package MCDRAM memory that provides over 500 GB/s effective bandwidth. The coprocessor has an x16 PCI Express Gen3 interface (8 GT/s) to connect to the host system.

The cores are laid out in units called tiles. Each tile contains a pair of cores, a shared 1 MB L2 cache, and a hub connecting the tile to a mesh interface. Each core contains two 512-bit wide vector processing units. The coprocessor supports Intel® AVX-512F (foundation), Intel AVX-512CD (conflict detection), Intel AVX-512PF (prefetching), and Intel AVX-512ER (exponential reciprocal) ISA.

The coprocessor supports Intel® AVX-512F (foundation), Intel AVX-512CD (conflict detection), Intel AVX-512PF (prefetching), and Intel AVX-512ER (exponential reciprocal) ISA

Intel® Manycore Platform Software Stack

Intel® Manycore Platform Software Stack (Intel® MPSS) is the user and system software that allows programs to run on and communication with the Intel Xeon Phi coprocessor. Intel MPSS version 4.x.x is used for the Intel Xeon Phi coprocessor x200 and can be download from here [(https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200)]. (Note that the older Intel MPSS version 3.x.x is used for the Intel Xeon Phi coprocessor x100); standard Linux kernel running on the coprocessor.

You can download the Intel MPSS stack at https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200. The following host operating systems are supported: Red Hat* Enterprise Linux Server, SUSE* Linux Enterprise Server and Microsoft Windows*. For detailed information on requirements and on installation, please consult the README file for Intel MPSS. The figure below shows the high-level representation of the Intel MPSS. The host software stack is on the left and the coprocessor software stack is on the right.

 High-level representation of the Intel MPSS.

Install the Software Stack and Start the Coprocessor

Installation Guide for Linux* Host:

  1. From the “Intel Manycore Platform Software Stack for Intel Xeon Phi Coprocessor x200 (https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200), navigate to the latest version of the Intel MPSS release for Linux and download “Readme for Linux (English)” (README.txt). Also download the release notes (releasenotes-linux.txt) and the User’s Guide for Intel MPSS.
  2. Install one of the following supported operating systems in the host:
    • Red Hat Enterprise Linux Server 7.2 64-bit kernel 3.10.0-327
    • Red Hat Enterprise Linux Server 7.3 64-bit kernel 3.10.0-514
    • SUSE Linux Enterprise Server SLES 12 kernel 3.12.28-4-default
    • SUSE Linux Enterprise Server SLES 12 SP1 kernel 3.12.49-11-default
    • SUSE Linux Enterprise Server SLES 12 SP2 kernel 4.4.21-69-default

    Be sure to install ssh, which is used to log in to the card.

    WARNING: On installing Red Hat, it may automatically update you to a new version of the Linux kernel. If this happens, you will not be able to use the prebuilt host driver, but will need to rebuild it manually for the new kernel version. Please see Section 5 in the readme.txt for instructions on building an Intel MPSS host driver for a specific Linux kernel.

  3. Log in as root.
  4. Download the release driver appropriated for your operating system in Step 1 (<mpss-version>-linux.tar), where <mpss-4> is mpss-4.3.3 at the time this document was written.
  5. Install the host driver RPMs as detailed in Section 6 of readme.txt. Don’t skip the creation of configuration files for your coprocessor.
  6. Update the flash on your coprocessor(s) as detailed in Section 8 of readme.txt.
  7. Reboot the system.
  8. Start the Intel Xeon Phi coprocessor (you can set up the card to start with the host system; it will not do so by default), and then run micinfo to verify that it is set up properly:
    # systemctl start mpss
    # micctrl –w
    # /usr/bin/micinfo
    micinfo Utility Log
    Created On Mon Apr 10 12:14:08 2017
    
    System Info:
        Host OS                        : Linux
        OS Version                     : 3.10.0-327.el7.x86_64
        MPSS Version                   : 4.3.2.5151
        Host Physical Memory           : 128529 MB
    
    Device No: 0, Device Name: mic0 [x200]
    
    Version:
        SMC Firmware Version           : 121.27.10198
        Coprocessor OS Version         : 4.1.36-mpss_4.3.2.5151 GNU/Linux
        Device Serial Number           : QSKL64000441
        BIOS Version                   : GVPRCRB8.86B.0012.R02.1701111545
        BIOS Build date                : 01/11/2017
        ME Version                     : 3.2.2.4
    
    Board:
        Vendor ID                      : 0x8086
        Device ID                      : 0x2260
        Subsystem ID                   : 0x7494
        Coprocessor Stepping ID        : 0x01
        UUID                           : A03BAF9B-5690-E611-8D4F-001E67FC19A4
        PCIe Width                     : x16
        PCIe Speed                     : 8.00 GT/s
        PCIe Ext Tag Field             : Disabled
        PCIe No Snoop                  : Enabled
        PCIe Relaxed Ordering          : Enabled
        PCIe Max payload size          : 256 bytes
        PCIe Max read request size     : 128 bytes
        Coprocessor Model              : 0x57
        Coprocessor Type               : 0x00
        Coprocessor Family             : 0x06
        Coprocessor Stepping           : B0
        Board SKU                      : B0 SKU _NA_A
        ECC Mode                       : Enabled
        PCIe Bus Information           : 0000:03:00.0
        Coprocessor SMBus Address      : 0x00000030
        Coprocessor Brand              : Intel(R) Corporation
        Coprocessor Board Type         : 0x0a
        Coprocessor TDP                : 300.00 W
    
    Core:
        Total No. of Active Cores      : 68
        Threads per Core               : 4
        Voltage                        : 900.00 mV
        Frequency                      : 1.20 GHz
    
    Thermal:
        Thermal Dissipation            : Active
        Fan RPM                        : 6000
        Fan PWM                        : 100 %
        Die Temp                       : 38 C
    
    Memory:
        Vendor                         : INTEL
        Size                           : 16384.00 MB
        Technology                     : MCDRAM
        Speed                          : 6.40 GT/s
        Frequency                      : 6.40 GHz
        Voltage                        : Not Available

Installation Guide for Windows* Host:

  1. From the “Intel Manycore Platform Software Stack for Intel Xeon Phi Coprocessor x200 (https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-for-intel-xeon-phi-coprocessor-x200), navigate to the latest version of the Intel MPSS release for Microsoft Windows. Download “Readme file for Microsoft Windows” (readme-windows.pdf). Also download the “Release notes” (releaseNotes-windows.txt) and the “Intel MPSS User’s Guide” (MPSS_Users_Guide-windows.pdf).
  2. Install one of the following supported operating systems in the host:
    • Microsoft Windows 8.1 (64-bit)
    • Microsoft Windows® 10 (64-bit)
    • Microsoft Windows Server 2012 R2 (64-bit)
    • Microsoft Windows Server 2016 (64-bit)
  3. Log in as “administrator”.
  4. Install .NET Framework* 4.5 or higher on the system (http://www.microsoft.com/net/download), Python* 2.7.5 x86-64 or higher (Python 3.x is not supported), Pywin32 build or higher (https://sourceforge.net/projects/pywin32).
  5. Be sure to install PuTTY* and PuTTYgen*, which are used to log in to the card’s OS.
  6. Follow the preliminary steps as instructed in Section 2.2.1 of the Readme file.
  7. Restart the system.
  8. Download the drivers package mpss-4.*-windows.zip for your Windows operating system from the page described in Step 1.
  9. Unzip the zip file to get the Windows exec files (“mpss-4.*.exe” and “mpss-essentials-4*.exe”).
  10. Install the Windows Installer file “mpss-4.*.exe” as detailed in Section 3.2 of the User’s Guide. Note that if a previous version of the Intel Xeon Phi coprocessor stack is already installed, use Windows Control Panel to uninstall it prior to installing the current version. By default, Intel MPSS is installed in “c:\Program Files\Intel\MPSS”. Also, install “mpss-essentials-4*.exe”, the native binary utilities for the Intel Xeon Phi coprocessor. These are required when using offload programming or cross compilers.
  11. Confirm that the new Intel MPSS stack is successfully installed by looking at Control Panel > Programs > Programs and Features: Intel Xeon Phi (see the following illustrations).

    Programs > Programs and Features: Intel Xeon Phi " src="https://software.intel.com/sites/default/files/managed/40/98/x200-quick-start-figure-4.png" typeof="foaf:Image" data-fid="550160">

  12. Update the flash according to Section 2.2.3 of the readme-windows.pdf file.
  13. Reboot the system.
  14. Log in to the host and verify that the Intel Xeon Phi x200 coprocessors are detected by the Device Manager (Control Panel > Hardware > Device Manager, and click “System devices”):

    Hardware > Device Manager, and click “System devices”):" src="https://software.intel.com/sites/default/files/managed/e9/f7/x200-quick-start-figure-5.png" typeof="foaf:Image" data-fid="550162">
  15. Start the Intel Xeon Phi coprocessor (you can set up the card to start with the host system; it will not do so by default). Launch a command-prompt window and start the Intel MPSS stack:
        prompt> micctrl --start
  16. Run the command “micinfo” to verify that it is set up properly:
        prompt> micinfo.exe

Intel® Parallel Studio XE

After starting the Intel MPSS stack, users can write applications running on the coprocessor using Intel Parallel Studio XE.

Intel Parallel Studio XE is a software development suite that helps boost application performance by taking advantage of the ever-increasing processor core count and vector register width available in Intel Xeon processors, Intel Xeon Phi processors and coprocessors, and other compatible processors. Starting with the Intel Parallel Studio 2018 beta, the following Intel® products support program development on the Intel Xeon Phi coprocessor x200:

  • Intel® C Compiler/Intel® C++ Compiler/Intel® Fortran Compiler
  • Intel® Math Kernel Library (Intel® MKL)
  • Intel® Data Analytics Acceleration Library (Intel® DAAL)
  • Intel® Integrated Performance Primitives (Intel® IPP)
  • Intel® Cilk™ Plus
  • Intel® Threading Building Blocks (Intel® TBB)
  • Intel® VTune™ Amplifier XE
  • Intel® Advisor XE
  • Intel® Inspector XE
  • Intel® MPI Library
  • Intel® Trace Analyzer and Collector
  • Intel® Cluster Ready
  • Intel® Cluster Checker

To get started writing programs running on the coprocessor, you can get the code samples at https://software.intel.com/en-us/product-code-samples. The packages “Intel Parallel Studio XE for Linux - Sample Bundle”, and “Intel Parallel Studio XE for Windows - Sample Bundle” contain code samples for Linux and Windows, respectively.

Programming Models on Coprocessor

There are three programing models that can be used for the Intel Xeon Phi coprocessor x200: offload programing model, symmetric programing model, and native programing model.

  • Offload programing: The main application runs on the host, and offload selected, highly parallel portions of the program to the coprocessor(s) to take advantage of manycore architecture. The serial portion of the program still runs in the host to take advantage of big cores architecture.
  • Symmetric programming: The coprocessors and the host are treated as separate nodes. This model is suitable for distributed computing.
  • Native programming: The coprocessors are used as independent nodes, just like a host. Users compile the binary for the coprocessor in the host, transfer the binary, and log in the coprocessor to run the binary.

The figure below summarizes different programming models used for the Intel Xeon Phi coprocessor:

How to Use Cache Monitoring Technology in OpenStack*

$
0
0

Introduction

With an increasing number of workloads running simultaneously on a system, there is more pressure on shared resources such as the CPU, cache, network bandwidth, and memory. While this reduces workload performance, if one or more of the workloads is bursty in nature it also reduces performance determinism. An interfering workload is called a noisy neighbor, and for the purposes of this discussion a workload could be any software application, a container, or even a virtual machine (VM).

Intel® Resource Director Technology (Intel® RDT) provides hardware support to monitor and manage shared resources, such as the last level cache (LLC) (also called the L3 cache), and memory bandwidth. In conjunction with software support, starting with the operating system and going up the solution stack, this functionality is being made available to monitor and manage shared resources to isolate workloads and improve determinism. In particular, the cache monitoring technology (CMT) aspect of Intel RDT provides last-level cache usage information for a workload.

OpenStack* is an open source cloud operating system that controls datacenter resources, namely compute, storage, and networking. Users and administrators can access the resources through a web interface or RESTful API calls. For the purposes of this document, we assume that the reader has some knowledge of OpenStack, either as an operator/deployer, or as a developer.

Let us explore how to enable and use CMT, in the context of an OpenStack cloud, to detect cache-related workload interference and take remedial action(s).

Note 1: Readers of this article should have basic understanding of OpenStack and its deployment and configuration.

Note 2: All of the configurations and examples are based on the OpenStack Newton* release version (released in October 2016) and the Gnocchi* v3.0 release.

Enabling CMT in OpenStack*

To leverage CMT in OpenStack requires touching the Nova*, Ceilometer*, and optionally the Gnocchi and Aodh* projects. The Nova project concerns itself with scheduling and managing workloads on the compute hosts. Ceilometer and Gnocchi pertain to telemetry. The Ceilometer agent runs on the compute hosts, gathers configured items of telemetry, and pushes them out for storage and future retrieval. The actual telemetry data could be saved in Ceilometer’s own database or the Gnocchi time series database with indices. The latter is superior, in both storage efficiency and retrieval speed. OpenStack Aodh supports defining rule-action pairs, such as whether some telemetry crosses a threshold and, if so, whether to emit an alarm. Alarms in turn could trigger some kind of operator intervention.

Enabling CMT in Nova*

OpenStack Nova provides access to the compute resources via a RESTful API and a web dashboard. To enable the CMT feature in Nova, the following preconditions have to be met:

  • The compute node hardware must support the CMT feature. The following CPUs support CMT (but are not limited to): Intel® Xeon® processor E5 v3 and Intel Xeon processor E5 v4 families. Please verify that the CPU specification supports CMT.
  • The libvirt version installed on Nova compute nodes is version 2.0.0 or greater.
  • The hypervisor running on the Nova compute host is a kernel-based virtual machine.

If all of the above preconditions are satisfied, and Nova is currently running, edit the libvirt section of the Nova configuration file (by default it is /etc/nova/nova.conf):

[libvirt]
virt_type = kvm
enabled_perf_events = cmt

After saving the above modifications, restart the Nova compute service.;

Openstack-nova-compute is a service on each compute host.

On Ubuntu* and CentOS* 6.5 hosts, run the following commands to restart the Nova compute service:

# service openstack-nova-compute restart
# service openstack-nova-compute status

On CentOS 7 and Fedora* 20 hosts, run the following commands instead to restart the Nova compute service:

# systemctl restart openstack-nova-compute
# systemctl status openstack-nova-compute

Once Nova is restarted, any new VMs launched by Nova will have the CMT feature enabled.

If devstack is being used instead to install a fresh OpenStack environment, add the following to the devstack local.conf file:

[[post-config|$NOVA_CONF]]
[libvirt]
virt_type = kvm
enabled_perf_events = cmt, mbml, mbmt

After saving the above configuration, run devstack to start the installation.

Enabling CMT in Ceilometer*

Ceilometer is part of the OpenStack Telemetry project whose mission is to:

  • Reliably collect utilization data from each host and for the VMs running on those hosts.
  • Persist the data for subsequent retrieval and analysis.
  • Trigger actions when defined criteria are met.

To get the last-level cache usage of a running VM, Ceilometer must be installed, configured to collect the cpu_l3_cache metric, and be running. Ceilometer defaults to collecting the metric. The cpu_l3_cache metric is collected by the Ceilometer agent running on the compute host by periodically polling for VM utilization metrics on the host.

If devstack is being used to install Ceilometer along with other OpenStack services and components, add the following in the devstack local.conf file:

[[local|localrc]]
enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer
enable_plugin aodh git://git.openstack.org/openstack/aodh

After saving the above configuration, run devstack to start the installation. This will install Ceilometer as well as Aodh (OpenStack alarming service) in addition to other OpenStack services and components.

Storing the CMT Metrics

There are two options to save telemetry data; namely in Ceilometer’s own backend database or in Gnocchi’s (also a member of the OpenStack Telemetry project) database. Gnocchi provides a time-series database with a resource indexing service, which is vastly superior to the Ceilometer native storage with respect to performance at scale, better disk utilization, and faster data retrieval. We recommend installing Gnocchi and configuring storage with the same. To do so using devstack, modify the following devstack local.conf file as follows:

[[local|localrc]]
enable_plugin ceilometer git://git.openstack.org/openstack/ceilometer
CEILOMETER_BACKEND=gnocchi
enable_plugin aodh git://git.openstack.org/openstack/aodh
enable_plugin gnocchi git://git.openstack.org/openstack/gnocchi

After saving the above configuration, run devstack to start the installation.

Refer to Gnocchi documentation for information on other Gnocchi installation methods.

After installing Gnocchi and Ceilometer, confirm that the following configuration settings are in place:

In the Ceilometer configuration file (by default it is /etc/ceilometer/ceilometer.conf), make sure the options are listed as follows:

[DEFAULT]
meter_dispatchers = gnocchi
[dispatcher_gnocchi]
filter_service_activity = False
archive_policy = low
url = <url to the Gnocchi API endpoint>

In the Gnocchi dispatcher configuration file (by default it is /etc/ceilometer/gnocchi_resources.yaml), make sure that the cpu_l3_cache metric is added into the resource type instance’s metrics list:

… …
  - resource_type: instance
    metrics:
      - 'instance'
      - 'memory'
      - 'memory.usage'
      - 'memory.resident'
      - 'vcpus'
      - 'cpu'
      - 'cpu_l3_cache'
… …

If any modifications are made to the above configuration files, you must restart the Ceilometer collector so that the new configurations take effect.

Verify Things are Working

To verify that all of the above are working, test as follows:

  1. Create a new VM.

    $ openstack server create --flavor m1.tiny --image cirros-0.3.4-x86_64-uec abc

  2. Confirm that the VM has been created successfully.

    $ openstack server list

    ID

    Name

    Status

    Networks

    Image Name

    7e38a89b-c829-4fb9-b44a-35090fbc0866

    abc

    ACTIVE

    private=10.0.0.3

    cirros-0.3.4-x86_64-uec

  3. Wait for some time to allow the Ceilometer agent to collect the cpu_l3_cache metrics. The wait time is determined by the related pipeline defined in the /etc/ceilometer/pipeline.yaml file.
  4. Check to see if the related metrics are collected and stored.
    1. If the metric is stored in Ceilometer’s own database backend, use the following command:

      ID

      Resource ID

      Name

      Type

      Volume

      Unit

      Timestamp

      f42e275a-b36a-11e6-96b2-525400e9f0eb

      7e38a89b-c829-4fb9-b44a-35090fbc0866

      cpu_l3_cache

      gauge

      270336.0

      B

      2016-12-08T23:57:37.535615

      8e872286-b369-11e6-96b2-525400e9f0eb

      7e38a89b-c829-4fb9-b44a-35090fbc0866

      cpu_l3_cache

      gauge

      450560.0

      B

      2016-12-08T23:47:37.505369

      28e57758-b368-11e6-96b2-525400e9f0eb

      7e38a89b-c829-4fb9-b44a-35090fbc0866

      cpu_l3_cache

      gauge

      270336.0

      B

      2016-12-08T23:37:37.536424

      …...

      …...

      …...

      …...

      …...

      …...

      …...

    2. However, if the metric is stored in Gnocchi, access it as follows:

      $ gnocchi measures show --resource-id 9184470a-594e-4a46-a124-fa3aaaf412dc cpu_l3_cache --aggregation mean

      Timestamp

      Granularity

      Value

      2016-12-09T00:00:00+00:00

      86400.0

      282350.933333

      2016-12-09T01:00:00+00:00

      3600.0

      216268.8

      2016-12-09T01:45:00+00:00

      300.0

      180224.0

      2016-12-09T01:55:00+00:00

      300.0

      180224.0

      … ...

      … ...

      … ...

Using CMT in OpenStack

A noisy neighbor in the OpenStack environment could be a VM consuming resources in a manner that adversely affects one or more different VMs on the same compute node. Whether because of a lack of knowledge of workload characteristics, appropriate information during Nova scheduling, or a change in the workload characteristics (because of a spike in usage or a virus or other), a noisy situation may occur on a host. The cloud admin might want to detect and take some action, such as live migrating the greedy workload or terminating it. The OpenStack Aodh project) enables detecting scenarios and alerting to their existence using condition-action pairs. An Aodh rule that monitors VM cache usage crossing some threshold would automate detecting of noisy neighbor scenarios.

Below, we illustrate setting up an Aodh rule to detect noisy neighbors. The actual rule depends upon whether the CMT telemetry data is stored. We first cover storage in the Ceilometer database and then in the Gnocchi time series database.

Metrics Stored in Ceilometer Database

Below, we define, using the Aodh command-line utility, a threshold CMT metrics rule:

$ aodh --debug alarm create --name cpu_l3_cache -t threshold --alarm-action "log://" --repeat-actions True --comparison-operator "gt" --threshold 180224 --meter-name cpu_l3_cache --period 600 --statistic avg

Field

Value

alarm_actions

[u'log://']

alarm_id

e3673d39-90ed-4455-80f1-fd7e06e1f2b8

comparison_operator

gt

description

Alarm when cpu_l3_cache is gt a avg of 180224 over 600 seconds

enabled

True

evaluation_periods

1

exclude_outliers

False

insufficient_data_actions

[]

meter_name

cpu_l3_cache

name

cpu_l3_cache

ok_actions

[]

period

600

project_id

f1730972dd484b94b3b943d93f3ee856

repeat_actions

True

query

 

severity

low

state

insufficient data

state_timestamp

2016-12-08T23:59:05.712994

statistic

avg

threshold

180224

time_constraints

[]

timestamp

2016-12-08T23:59:05.712994

type

threshold

user_id

cfcd1ea48a1046b192dbd3f5af11290e

This creates an alarm rule named cpu_l3_cache that is triggered if, and only if, within a sliding window of 10 minutes (600 seconds), the VM’s average cpu_l3_cache metric is greater than 180224. If the alarm is triggered, it will be logged in the Aodh alarm notifier agent’s log. Alternately, instead of just logging the alarm event, a notifier may be used to push a notification to one or more configured endpoints. For example, we could use the http notifier by providing "http://<endpoint ip>:<endpoint port>" as the alarm-action parameter.

Metrics Stored in Gnocchi*

If the metrics are stored in Gnocchi, an Aodh alarm could be created through a gnocchi_resources_threshold rule such as the following, using the Aodh command-line utility:

$ aodh --debug alarm create -t gnocchi_resources_threshold --name test1 --alarm-action "log://alarm" --repeat-actions True --metric cpu_l3_cache --threshold 100000 --resource-id 9184470a-594e-4a46-a124-fa3aaaf412dc --aggregation-method mean --resource-type instance --granularity 300 --comparison-operator 'gt'

Field

Value

aggregation_method

mean

alarm_actions

[u'log://alarm']

alarm_id

71f48ee1-b92f-4982-92e4-4c520649a8e0

comparison_operator

gt

description

gnocchi_resources_threshold alarm rule

enabled

True

evaluation_periods

1

granularity

300

insufficient_data_actions

[]

metric

cpu_l3_cache

name

test1

ok_actions

[]

period

600

project_id

543aa2e8e17449149d5c101c55675005

repeat_actions

True

resource_id

9184470a-594e-4a46-a124-fa3aaaf412dc

resource_type

instance

state

insufficient data

state_timestamp

2016-12-09T05:57:07.089530

threshold

100000

time_constraints

[]

timestamp

2016-12-09T05:57:07.089530

type

gnocchi_resources_threshold

user_id

ca859810b379425085756faf6fd04ded

This creates an alarm named test1 if, and only if, within a sliding 10-minute window (600 seconds), the VM 9184470a-594e-4a46-a124-fa3aaaf412dc registers an average cpu_l3_cache metric greater than 180224. If triggered, an alarm is logged to the Aodh alarm notifier agent’s log output. Instead of the command-line utility the Aodh RESTful API could be used to define alarms; refer to http://docs.openstack.org/developer/aodh/webapi/v2.html for details.

While Gnocchi v3.0 is limited in its resource querying capabilities in comprehending metric type and thresholds, such enhancements are expected in future releases.

More About Intel® Resource Director Technology (Intel® RDT)

The Intel RDT family comprises, beyond CMT, other monitoring and resource allocation technologies. Those that will soon be available are:

  • Cache Allocation Technology (CAT) enables allocation of cache to workloads, either in exclusive or shared mode, to ensure performance despite co-resident (running on the same host) workloads. For instance, more cache can be allocated to a high-priority task that has a larger working set or, conversely, restricting cache usage for a streaming application that has a lower priority so that it does not interfere with higher priority tasks.
  • Memory Bandwidth Monitoring (MBM), along the lines of CMT, provides memory usage information for workloads.
  • Code Data Prioritization (CDP) enables separate control over code and data placement in the last-level cache.

To learn more visit http://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html.

In conclusion, we hope the above provides you with adequate information to start using CMT in an OpenStack cloud to gain deeper insights into workload characteristics to positively influence performance.

Intel® VTune™ Amplifier Disk I/O analysis with Intel® Optane Memory

$
0
0

This article will talk about Intel® VTune™ Amplifier I/O Analysis with Intel® Optane Memory. Several benchmark tools like crystaldisk, IOmeter, System Mark or PC Mark etc. are used to evaluate system I/O efficiency with usually a score number. For some power users, PC-gaming geeks might be satisfied with those numbers served for performance validation purpose. How about the further technical-depth information like slow I/O activities identification, detailed I/O queue depth visualization in timeline, I/O function APIs callstacks and even the correlation with other system metrics to give further debug or profiling information for a software developer? Software Developers need the clues to understand how I/O efficient his program performs. VTune tries to provide such insights with its new feature, Disk I/O Analysis Type.
 

A bit about I/O Performance metrics

First of all, there are some basics you might need to know; I/O Queue Depth, Read/Write Latency, I/O Bandwidth, they are the I/O metrics used to track I/O efficiency. I/O Queue Depth means how many I/O commands wait in a queue to be served. This queue depth (size) depends on application, driver, OS implementation or the definition of host controller interface’s spec., like AHCI or NVMe and etc.. Comparing to ACHI with a single queue design, NVMe has multiple queues design supports parallel operations.

Imagine that a software program issues multiple I/O requests pass through the framework, software libraries, VM, container, runtimes, OS’s I/O scheduler, driver to the host controller of I/O device. These requests can be temporarily delayed in any of these components due to different queue implementation and other reasons. Observing the change of system’s queue depth can help understand how busy system I/O utilization is and overall I/O access patterns. From OS perspective, high queue depth represents a state that system is working to consume pending I/O requests. Zero queue depth means I/O scheduler is idle. From Storage device perspective, high queue depth design shows the storage media or controller has the confidence to serve a bulk of I/O requests in a higher speed comparing to lower queue depth design. Read/Write Latency shows how quick storage device finishes or response I/O request. Its inverse also represents IOPS (I/O per second). As for I/O Bandwidth, it will be tightened to the capability offered by different host controller interfaces. For example, SATA 3.0 can achieve 600MB/s of the theoretical bandwidth and NVMe PCIe 3.0 x2 lanes can do ~1.87GB/s.

 

Optane+NAND SSD

 

We will expect the system I/O performance increase after adopting Intel® Optane Memory + Intel Rapid Storage technique.

Insight from VTune for a workload running on Optane enabled setup

IOAPI_time_ssdvsoptane [figure1]

The figure 1 shows two VTune results are based on a benchmark program, PCMark, running on “single SATA NAND SSD” vs “SATA NAND SSD + additional 16GB NVMe Optane module within IRST RAID 0 mode”. Besides the basics of VTune’s online help for Disk I/O analysis, you can also observe I/O APIs effective time by applying “Task Domain” grouping view. As VTune indicates, I/O API’s CPU time also gets improved with Optane’s acceleration. It make senses since most of I/O API calls are synchronous in this case and I/O media with Optane acceleration responses quickly.

Latency SSD vs Optane

[figure 2]

In figure 2, it shows how VTune measure the latency for single I/O operation. We compare 3rd FileRead operation of the test#3(importing pictures to Windows Photo Gallery) of benchmark workload on both cases. It shows Optane+SSD can help nearly 5 times gain for this read operation speed in 300us vs 60us.

On linux target, VTune also provides the Page fault metric. Page fault event usually invokes disk I/O to handle page swapping. To avoid frequent Disk I/O caused by page fault events, the typical direction is to keep more pages in the memory instead swap pages back to the disk. Intel® Memory Drive Technology provides a solution to expand memory capacity and Optane provides the best proximity to memory’s speed. And that’s transparent to application and OS, it also mitigates the Disk I/O penalty to further increase the performance. One common mistake is that using asynchronous I/O can always help application’s I/O performance. Asynchronous I/O is to actually add more responsiveness back to the application because asynchronous I/O does not need to put CPU to wait. Putting CPU to wait is the case when synchronous I/O API is used but I/O operation is not finished.

With all that software design suggestions above, the extra performance solution is to upgrade your hardware to faster media. Intel® Optane is Intel’s edge non-volatile memory technology enabling memory-like performance at storage-like capacity and cost. VTune can even help to juice out more software performance by providing insight analysis.

See also

Intel® Optane™ Technology

Intel® Rapid Storage Technology

Check Intel® VTune™ Amplifer in Intel® System Studio

Intel® VTune™ Amplifier online help - Disk Input and Output Analysis

How to use Disk I/O analysis in Intel® VTune™ Amplifier for systems

Memory Performance in a Nutshell

Use Intel® Optane™ Technology and Intel® 3D NAND SSDs to Build High-Performance Cloud Storage Solutions

$
0
0

Download Ceph configuration file  [1.9KB]

Introduction

As solid-state drives (SSDs) become more affordable, cloud providers are working to provide high-performance, highly reliable SSD-based storage for their customers. As one of the most open source scale-out storage solutions, Ceph faces increasing demand from customers who wish to use SSDs with Ceph to build high-performance storage solutions for their clouds.

The disruptive Intel® Optane™ Solid State Drive based on 3D XPoint™ technology fills the performance gap between DRAM and NAND-based SSDs. At the same time, Intel® 3D NAND TLC is reducing the cost gap between SSDs and traditional spindle hard drives, making all-flash storage an affordable option.

This article presents three Ceph all-flash storage system reference designs, and provides Ceph performance test results on the first Intel Optane and P4500 TLC NAND based all-flash cluster. This cluster delivers multi-million IOPS with extremely low latency as well as increased storage density with competitive dollar-per-gigabyte costs. Click on the link above for a Ceph configuration file with Ceph BlueStore tuning and optimization guidelines, including tuning for rocksdb to mitigate the impact of compaction.

What Motivates Red Hat Ceph* Storage All-Flash Array Development

Several motivations are driving the development of Ceph-based all-flash storage systems. Cloud storage providers (CSPs) are struggling to deliver performance at increasingly massive scale. A common scenario is to build an Amazon EBS-like service for an OpenStack*-based public/private cloud, leading many CSPs to adopt Ceph-based all-flash storage systems. Meanwhile, there is strong demand to run enterprise applications in the cloud. For example, customers are adapting OLTP workloads to run on Ceph when they migrate from traditional enterprise storage solutions. In addition to the major goal of leveraging the multi-purpose Ceph all-flash storage cluster to reduce TCO, performance is an important factor for these OLTP workloads. Moreover, with the steadily declining price of SSDs and efficiency-boosting technologies like deduplication and compression, an all-flash array is becoming increasingly acceptable.

Intel® Optane™ and 3D NAND Technology

Intel Optane technology provides an unparalleled combination of high throughput, low latency, high quality of service, and high endurance. It is a unique combination of 3D XPoint™ Memory Media, Intel Memory and Storage Controllers, Intel Interconnect IP and Intel® software1. Together these building blocks deliver a revolutionary leap forward in decreasing latency and accelerating systems for workloads demanding large capacity and fast storage.

Intel 3D NAND technology improves regular two-dimensional storage by stacking storage cells to increase capacity through higher density and lower cost per gigabyte, and offers the reliability, speed, and performance expected of solid-state memory3. It offers a cost-effective replacement for traditional hard-disk drives (HDDs) to help customers accelerate user experiences, improve the performance of apps and services across segments, and reduce IT costs.

Intel Ceph Storage Reference Architectures

Based on different usage cases and application characteristics, Intel has proposed three reference architectures (RAs) for Ceph-based all-flash arrays.

Standard configuration

Standard configuration is ideally suited for throughput optimized workloads that need high-capacity storage with good performance. We recommend using NVMe*/PCIe* SSD for journal and caching to achieve the best performance while balancing the cost. Table 1 describes the RA using 1x Intel® SSD DC P4600 Series as a journal or BlueStore* rocksdb write-ahead log (WAL) device, 12x up to 4 TB HDD for data, an Intel® Xeon® processor, and an Intel® Network Interface Card.

Example: 1x 1.6 TB Intel SSD DC P4600 as a journal, Intel® Cache Acceleration Software, 12 HDDs, Intel® Xeon® processor E5-2650 v4 .

Table 1. Standard configuration.

Ceph Storage Node  configuration – Standard

CPU

Intel® Xeon® processor E5-2650 v4

Memory

64 GB

NIC

Single 10Gb E, Intel® 82599 10 Gigabit Ethernet Controller or Intel® Ethernet Controller X550

Storage

Data: 12 x 4 TB HDD
Journal or WAL: 1x Intel® SSD DC P4600 1.6 TB
Caching: P4600

Caching Software

Intel® Cache Acceleration Software 3.0, option: Intel® Rapid Storage Technology enterprise/MD4.3; open source cache-like bcache/flashcache

TCO-Optimized Configuration

This configuration provides the best possible performance for workloads that need higher performance, especially for throughput, IOPS, and SLAs with medium storage capacity requirements, leveraging a mixed of NVMe and SATA SSDs.

Table 2. TCO-optimized configuration

Ceph Storage node –TCO Optimized

CPU

Intel® Xeon® processor E5-2690 v4

Memory

128 GB

NIC

Dual 10GbE (20 GB), Intel® 82599 10 Gigabit Ethernet Controller

Storage

Data: 4x Intel® SSD DC P4500 4, 8, or 16 TB or Intel DC SATA SSDs

Journal or WAL: 1x Intel® SSD DC P4600 Series 1.6 TB

IOPS-Optimized Configuration

The IOPS-optimized configuration provided best performance (throughput and latency) with Intel Optane Solid State Drives as Journal (FileStore) and WAL device (BlueStore) for a standalone Ceph cluster.

  • All NVMe/PCIe SSD Ceph system
  • Intel Optane Solid State Drive for FileStore Journal or BlueStore WAL
  • NVMe/PCIe SSD data, Intel Xeon processor, Intel® NICs
  • Example: 4x Intel SSD P4500 4, 8, or 16 TB for data, 1x Intel® Optane™ SSD DC P4800X 375 GB as journal (or WAL and database), Intel Xeon processor, Intel® NICs.

Table 3. IOPS optimized configuration

Ceph* Storage node –IOPS optimized

CPU

Intel® Xeon® processor E5-2699 v4

Memory

>= 128 GB

NIC

2x 40GbE (80 Gb), 4x Dual 10GbE (800 Gb), Intel® Ethernet Converged Network Adapter X710 family

Storage

Data: 4x Intel® SSD DC P4500 4, 8, or 16 TB

Journal or WAL : 1x Intel Optane SSD DC P4800X 375 GB

Notes

  • Journal: Ceph supports multiple storage back-end. The most popular one is FileStore, based on a filesystem (for example, XFS*) to store its data. In FileStore, Ceph OSDs use a journal for speed and consistency. Using SSD as a journal device will significantly improve Ceph cluster performance.
  • WAL: BlueStore is a new storage back-end designed to replace FileStore in the near future. It overcomes several limitations of XFS and POSIX* that exist in FileStore. BlueStore consumes raw partitions directly to store the data, but the metadata comes with an OSD, which will be stored in Rocksdb. Rocksdb uses a write-ahead log to ensure data consistency.
  • The RA is not a fixed configuration. We will continue to refresh it with latest Intel® products.

Ceph All-Flash Array performance

This section presents a performance evaluation of the IOPS-optimized configuration based on Ceph BlueStore.

System configuration

The test system described in Table 4 consisted of five Ceph storage servers, each fitted with two Intel® Xeon® processors E5-2699 v4 CPUs and 128 GB memory, plus 1x Intel® SSD DC P3700 2TB as a BlueStore WAL device, and 4x TB Intel® SSD DC P3520 2TB as a data drive. 1x Intel® Ethernet Converged Network Adapters X710 NIC 40 Gb NIC, two ports bonding together through bonding mode 6, used as separate cluster and public networks for Ceph, make up the system topology described in Figure 1. The test system also consisted of 5 client nodes, each fitted with two Intel Xeon processors E5-2699 v4, 64 GB memory, and 1x Intel Ethernet Converged Network Adapters X710 NIC 40 Gb NIC, two ports bonding together through bonding mode 6.

Ceph 12.0.0 (Luminous dev) was used, and each Intel SSD DC P3520 Series runs 4 OSD daemons. The rbd pool used for the testing was configured with 2 replica.

Table 4. System configuration.

Ceph Storage node – IOPS optimized

CPU

Intel® Xeon® processor E5-2699 v4 2.20 GHz

Memory

128 GB

NIC

1x 40 G Intel® Ethernet Converged Network Adapters X710, two ports bonding mode 6

Disks

1x Intel® SSD DC P3700 (2T) + 4x Intel® SSD DC P3520 2 TB

Software configuration

Ubuntu* 14.04, Ceph 12.0.0

Diagram of cluster topology
Figure 1. Cluster topology.

Testing methodology

To simulate a typical usage scenario, four test patterns were selected using fio with librbd. It consisted of 4K random read and write, and 64K sequential read and write. For each pattern, the throughput (IOPS or bandwidth) was measured as performance metrics with the number of volumes scaling; the volume size was 30 GB. To get stable performance, the volumes were pre-allocated to bypass the performance impact of thin-provisioning. OSD page cache was dropped before each run to eliminate page cache impact. For each test case, fio was configured with a 100 seconds warm up and 300 seconds data collection. Detailed fio testing parameters are included as part of the software configuration.

Performance overview

Table 5 shows a promising performance after tuning on this five-node cluster. 64K sequential read and write throughput is 5630 MB/s and 4200 MB/s respectively (maximums with the Intel Ethernet Converged Network Adapters X710 NIC in bonding mode 6). 4K random read throughput is 1312K IOPS with 1ms average latency, while 4 KB random write throughput is 331K IOPS with 4.8 ms average latency. The performance measured in the testing was roughly within expectations, except for a regression of 64K sequential write tests compared with previous Ceph releases, which requires further investigation and optimization.

Table 5. Performance overview.

Pattern

Throughput

Average Latency

64KB Sequential Write

4200 MB/s

18.9ms

64KB Sequential Read

5630 MB/s

17.7ms

4KB Random Write

331K IOPS

4.8ms

4KB Random Read

1312K IOPS

1.2ms

Scalability tests

Figures 2 to 5 show the graph of throughput for 4K random and 64K sequential workloads with different number of volumes, where each fio was running in the volume with a queue depth of 16.

Ceph demonstrated excellent 4K random read performance on the all-flash array reference architecture, as the total number of volumes increased from 1 to 100, the total 4K random read IOPS peaked around 1310 K IOPS, with an average latency around 1.2 ms. The total 4K random write IOPS peaked around 330K IOPS, with an average latency around 4.8 ms.

graphic of results for 4K Random read performance
Figure 2. 4K Random read performance.

graphic of results for 4K random write performance load line
Figure 3. 4K random write performance load line.

For 64K sequential read and write, as the total number of volumes increased from 1 to 100, the sequential read throughput peaked around 5630 MB/s, while sequential write peaked around 4200 MB/s. The sequential write throughput was lower than the previous Ceph release (11.0.2). It requires further investigation and optimization; stay tuned for further updates.

graphic of results for 64K sequential read throughput
Figure 4. 64K sequential read throughput

graphic of results for 64K sequential write throughput
Figure 5. 64K sequential write throughput

Latency Improvement with Intel® Optane™SSD

Fig 6 shows the latency comparison for 4K random write workloads with 1x Intel® SSD DC P3700 series 2.0 TB and 1x Intel Optane SSD DC P4800X series 375 GB drive as rocksdb & WAL device. The results proved with the Intel Optane SSD DC P4800X series 375 GB SSD as rocksdb and WAL drive in Ceph BlueData, the latency was significantly reduced:  a 226% reduction in 99.99% latency.

graphic of results for 4K random read and 4K random write latency comparison
Figure 6. 4K random read and 4K random write latency comparison

Summary

Ceph is one of most open source scale-out storage solutions, and there is growing interest among Cloud providers in building Ceph-based high-performance all-flash array storage solutions. We proposed three different reference architecture configurations targeting for different usage scenarios. The results for testing that simulated different workload pattern demonstrated that a Ceph all-flash system could deliver very high performance with excellent latency.

Software configuration

Fio configuration used for the testing

Take 4K random read for example.

[global]
    direct=1
    time_based
[fiorbd-randread-4k-qd16-30g-100-300-rbd]
    rw=randread
    bs=4k
    iodepth=16
    ramp_time=100
    runtime=300
    ioengine=rbd
    clientname=${RBDNAME}
    pool=${POOLNAME}
    rbdname=${RBDNAME}
    iodepth_batch_submit=1
    iodepth_batch_complete=1
    norandommap
  1. http://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html
  2. http://ceph.com
  3. http://www.intel.com/content/www/us/en/solid-state-drives/3d-nand-technology-animation.html

This sample source code is released under the Intel Sample Source Code License Agreement.

New Features in Intel® Xeon® Processor Scalable Family

$
0
0

Based on Intel® Core™ microarchitecture (code named Skylake) and manufactured on 14-nanometer process technology, these processors provide significant performance over the previous-generation Intel® Xeon® processor v4 product family. This Intel® Xeon® processor family introduces many new technologies and new instructions that will benefit integer operations and enhance security as well as a new feature to allocate the memory bandwidth.

A more in-depth discussion of the key features and architecture of the Intel® Xeon® processor Scalable family can be found in the technical overview document.

Key supported features

Intel® Memory Protection Extensions (Intel® MPX)

Intel® Memory Protection Extensions (Intel® MPX) checks for buffer overflows in software applications and checks to ensure that memory references intended at compile time do not become unsafe at runtime. Details on how to implement this feature under Windows® 10 can be found How to Protect Apps from Buffer Overflow Attacks. For developers working in Linux* environments, the article Intel® Memory Protection Extensions (Intel® MPX) to Linux*" will guide you through using Intel MPX.

More information about Intel MPX can be found in this enabling guide.

Intel® QuickAssist Technology (Intel® QAT)

Intel® QuickAssist Technology (Intel® QAT)  helps accelerate compression and cryptographic tasks by offloading the data to hardware capable of optimizing those functions. Intel QuickAssist Technology can be used for:

  • Bulk cryptography: Symmetric encryption and authentication, and cipher operations.
  • Public key cryptography: Asymmetric encryption, digital signatures, and key exchange.
  • Compression: Lossless data compression for data in flight and at rest.

An introduction to how Intel QuickAssist Technology benefits network functions can be found here. More about how this technology helps improve compression tasks is discussed in this collateral,which talks about Intel QuickAssist Technology and its compression services and shows what Intel QuickAssist Technology APIs to use and its execution flow. These videos show performance improvement using Intel QuickAssist Technology and explain how to find performance issue and troubleshoot installation problem.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512 )

Intel® Advanced Vector Extensions 512 (Intel® AVX-512 ) is a new SIMD instruction set operating on 512-bit registers. It is a set of new instructions that can accelerate performance for applications in areas such as scientific simulations, financial analytics, and artificial intelligence. Intel AVX-512 is able to do that since it can pack 8 double-precision and 16 single-precision floating numbers within the 512-bit vectors, as well as 8 64-bit and 16 32-bit integers. To understand where to apply these instructions, read the article that discusses the use of Intel AVX-512 instructions in implementing the math functions. In order to use Intel AVX-512 in a high-level language such as C/C++, read this document that lists all intrinsic functions for Intel AVX-512.

The following example shows how to use intrinsic functions to process arrays of bits.

More detail about Intel AVX-512 can be found in Intel® Architecture Instruction Set Extensions Programming Reference.

Intel® Omni-Path Architecture (Intel® OPA)

Intel® Omnii-Path Architecture (Intel® OPA) offers low latency, low power consumption, and high throughput. A good overview of Intel OPA can be found in this document. Customers who want to implement Intel OPA will find this quick start guide useful. This guide outlines the basic steps for getting the Intel OPA cluster up and running. For more detail in installing software for Intel OPA, customers can refer to the Intel® Omni-Path Fabric (Intel® OP Fabric) Software installation guide, which shows how to install Intel OPA software and configure the Intel OPA chassis, switches, and so on. After the software has been installed, the next step is to set up and administer the fabric, and then fine-tune it to run efficiently on the customer system.

Reliability Accessibility Serviceability (RAS)

The Intel Xeon processors v5 introduces two new RAS features called Advanced Error Detection and Correction (AEDC) and Local Machine Check exception (LMCE).

AEDC improves fault detection within the Core Execution Engine using residue checking and parity protection techniques. AEDC tries to correct the fault by retrying the instructions. If the retry fails a fatal MCERR is triggered.

LMCE improves the Machine Check Architecture (MCA) recovery-execution path event and increases the possibility of recovery. It increases the possibility of recovery by letting the thread consuming uncorrected data handle the error without broadcasting the fatal MCERR events to the rest of threads in the system that will prevent system recovery.

More information can be found in the Intel Xeon processor scalable family new reliability features article.

Note: LMCE is an advanced RAS feature that only exists in four-socket systems or above.


New Reliability, Availability, and Serviceability (RAS) Features in the Intel® Xeon® Processor Family

$
0
0

Introduction

Intel® Xeon® processor Scalable family is introducing several new Reliability, Availability, and Serviceability (RAS) features across the product lineup (SKUs to be designated as Bronze, Silver, Gold, and Platinum). The newly added features can help enhance the end-user experience with the platform’s ability to recover from bad data consumption, capabilities in detecting bad instruction and retrying the transaction in attempt to recover. The processor also offers a new innovative approach to map out failing DRAM devices to help prolong the useable life of DIMMs.

Adaptive Double DRAM Device Correction (ADDDC), Advanced Error Detection and Correction (AEDC), Local Machine Check Exception (LMCE) are the features this collateral will explore.

Adaptive Double DRAM Device Correction (ADDDC)

Intel® Xeon® processor introduces an innovative approach in managing errors that the DDR4 DRAM DIMM may induce through the life of the product. ADDDC is deployed at runtime to dynamically map out the failing DRAM device and continue to provide SDDC ECC coverage on the DIMM, translating to longer DIMM longevity. The operation occurs at the fine granularity of DRAM Bank and/or Rank to have minimal impact on the overall system performance.

With the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble.

Dependent on the processor SKU, each DDR4 channel supports one to two regions that can manage one or two faulty DRAMs, at Bank and/or full Rank granularity. The dynamic nature of the operation makes the performance implications of the lockstep operation on the system to be material only after the DRAM device is detected to be failing. The overall lockstep impact on system performance is now a function of the number of bad DRAM devices on the channel, with the worst-case scenario of two bad Ranks on every DDR4 channel.

The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out.

Advanced Error Detection and Correction (AEDC)

AEDC improves the fault coverage within the core execution engine by utilizing proprietary residue code fault-detection checking to identify and correct errors the processor may encounter within its internal pipelines within the execution engine (arrays and logic). AEDC will attempt to correct the fault by retrying the instruction. The successfully corrected retry is considered as a corrected event; otherwise, fatal MCERR is logged and signaled.

AEDC technology in the processor is self-contained. It uses the existing error signaling and logs to flag errors, and needs no special assistance from the operating system to become operational. AEDC is offered across all product SKUs.

Local Machine Check Exception (LMCE)

LMCE is a new RAS operation that localizes handling of bad data consumption to the core that’s executed on the bad data. By localizing error handling in such manner, the system can prevent multiple machine check condition from occurring and improves on the performance of the MCA Recovery — Execution Path.

By localizing error signaling, each remote core that comes across bad data can also invoke its own LMCE, each attempting recovery without interfering with the operation of other cores. LMCE can help successful recovery from a number of corner cases and improve successful recovery flows.

MCA Recovery — Execution Path

The MCA Recovery — Execution Path feature offers the capability for a system to continue to operate even when the processor is unable to correct data errors within the memory sub-system and allows software layers (operating system, VMM, DBMS, and applications) to participate in system recovery.

The recovery can occur on SRAR error types, and the machine check architecture protocol requires the Machine check error (MCERR) to be broadcast to all threads and establish a rendezvous point. In cases where the cores consume the bad data within near proximity of one another, each thread signaling MCERR Error creates multiple MCERR conditions that result in an undesirable system shutdown.

LMCE can help overcome such conditions by localizing MCERR signaling to the consuming thread only, allowing each thread to recover from the bad data it consumed. This change in the protocol requires the operating system to also be aware of the LMCE-ready platform, and then opt-in to support LMCE flows.

How LMCE is enabled

LMCE support requires processor, UEFI code, and operating system support for the operation. By default, the operation is disabled and can be enabled only if the ingredients are available in each of the stacks. The following steps need to be taken before LMCE can be used:

  1. The hardware indicates to UEFI code that LMCE support is available in the SKU.
  2. In a firmware-first model, the UEFI code must comprehend LMCE flow and signal platform readiness to support the flow to the operating system.
  3. The operating system needs to comprehend LMCE flows and check the platform readiness to support LMCE. If the operating system is not aware of this feature then LMCE remains OFF.

More information about LMCE can be found in Intel® 64 and IA-32 Architecture Software Developer Manuals.

Conclusion

Intel Xeon processors continue to enhance system RAS feature offerings across all segments of the computing industry. Intel® Xeon® platforms utilizing any one of processor SKUs, Bronze, Silver, Platinum or Gold SKUs can benefit from the enhancements. The new capabilities translates to higher system reliability and availability achieved through innovative error detection and retry mechanisms, improvements to recovery methodology, and performance optimized memory subsystem capable of prolonging useful life of the installed DDR4 DIMMs.

References

  1. Intel® Run Sure Technology
  2. Application of Residue Code for Error Detection
  3. Error Code Detection and Correction

Intel® Xeon® Processor Scalable Family Technical Overview

$
0
0

Executive Summary

Intel uses a tick-tock model associated with its generation of processors. The new generation, the Intel® Xeon® processor Scalable family (formerly code-named Skylake-SP), is a “tock” based on 14nm process technology. Major architecture changes take place on a “tock,” while minor architecture changes and a die shrink occur on a “tick.”

Tick-Tock model
Figure 1. Tick-Tock model.

Intel Xeon processor Scalable family on the Purley platform is a new microarchitecture with many additional features compared to the previous-generation Intel® Xeon® processor E5-2600 v4 product family (formerly Broadwell microarchitecture). These features include increased processor cores, increased memory bandwidth, non-inclusive cache, Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Memory Protection Extensions (Intel® MPX), Intel® Ultra Path Interconnect (Intel® UPI), and sub-NUMA clusters.

In previous generations two and four socket processor families were segregated into different product lines. One of the big changes with the Intel Xeon processor Scalable family is that it includes all the processor models associated with this new generation. The processors from Intel Xeon processor Scalable family are scalable from a two-socket configuration to an eight-socket configuration. They are Intel’s platform of choice for the most scalable and reliable performance with the greatest variety of features and integrations designed to meet the needs of the widest variety of workloads.

New branding for processor models
Figure 2. New branding for processor models.

A two-socket Intel Xeon processor Scalable family configuration can be found within all the levels of bronze through platinum, while a four-socket configuration will only be found at the gold through platinum levels, and the eight-socket configuration will only be found at the platinum level. The bronze level has the least amount of features and as you move towards platinum more features are added. All available features are available across the entire range of processor socket count (two through eight) at the platinum level.

Introduction

This paper discusses the new features and enhancements available in Intel Xeon processor Scalable family and what developers need to do to take advantage of them.

Intel® Xeon® processor Scalable family Microarchitecture Overview

Block Diagram of the Intel® Xeon® processor scalable family microarchitecture
Figure 3. Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.

The Intel Xeon processor Scalable family on the Purley platform provides up to 28 cores, which bring additional computing power to the table compared to the 22 cores of its predecessor. Additional improvements include a non-inclusive last-level cache, a larger 1MB L2 cache, faster 2666 MHz DDR4 memory, an increase to six memory channels per CPU, new memory protection features, Intel® Speed Shift Technology, on-die PMAX detection, integrated Fabric via Intel® Omni-Path Architecture (Intel® OPA), Internet Wide Area RDMA Protocol (iWARP)*, memory bandwidth allocation, Intel® Virtual RAID on CPU (Intel® VROC), and more.

Table 1. Generational comparison of the Intel Xeon processor Scalable family to the Intel® Xeon® processor E5-2600 and E7-4600 product families.

Table 1 generational comparison

Intel Xeon processor Scalable family feature overview

The rest of this paper discusses the performance improvements, new capabilities, security enhancements, and virtualization enhancements in the Intel Xeon processor Scalable family.

Table 2. New features and technologies of the Intel Xeon processor Scalable family.

Table 2 New features and technologies

Skylake Mesh Architecture

On the previous generations of Intel® Xeon® processor families (formerly Haswell and Broadwell) on the Grantley platform, the processors, the cores, last-level cache (LLC), memory controller, IO controller and inter-socket Intel® QuickPath Interconnect (Intel® QPI) ports are connected together using a ring architecture, which has been in place for the last several generations of Intel® multi-core CPUs. As the number of cores on the CPU increased with each generation, the access latency increased and available bandwidth per core diminished. This trend was mitigated by dividing the chip into two halves and introducing a second ring to reduce distances and to add additional bandwidth.

Platform ring architecture
Figure 4. Intel® Xeon® processor E5-2600 product family (formerly Broadwell-EP) on Grantley platform ring architecture.

With additional cores per processor and much higher memory and I/O bandwidth in the Intel® Xeon® processor Scalable family, the additional demands on the on-chip interconnect could become a performance limiter with the ring-based architecture. Therefore, the Intel Xeon processor Scalable family introduces a mesh architecture to mitigate the increased latencies and bandwidth constraints associated with previous ring-based architecture. The Intel Xeon processor Scalable family also integrates the caching agent, the home agent, and the IO subsystem on the mesh interconnect in a modular and distributed way to remove bottlenecks in accessing these functions. Each core and LLC slice has a combined Caching and Home Agent (CHA), which provides scalability of resources across the mesh for Intel® Ultra Path Interconnect (Intel® UPI) cache coherency functionality without any hotspots.

The Intel Xeon processor Scalable family mesh architecture encompasses an array of vertical and horizontal communication paths allowing traversal from one core to another through a shortest path (hop on vertical path to correct row, and hop across horizontal path to correct column). The CHA located at each of the LLC slices maps addresses being accessed to specific LLC bank, memory controller, or IO subsystem, and provides the routing information required to reach its destination using the mesh interconnect.

Intel Xeon processor Scalable family mesh architecture
Figure 5. Intel Xeon processor Scalable family mesh architecture.

In addition to the improvements expected in the overall core-to-cache and core-to-memory latency, we also expect to see improvements in latency for IO initiated accesses. In the previous generation of processors, in order to access data in LLC, memory or IO, a core or IO would need to go around the ring and arbitrate through the switch between the rings if the source and targets are not on the same ring. In Intel Xeon processor Scalable family, a core or IO can access the data in LLC, memory, or IO through the shortest path over the mesh.

Intel® Ultra Path Interconnect (Intel® UPI)

The previous generation of Intel® Xeon® processors utilized Intel QPI, which has been replaced on the Intel Xeon processor Scalable family with Intel UPI. Intel UPI is a coherent interconnect for scalable systems containing multiple processors in a single shared address space. Intel Xeon processors that support Intel UPI, provide either two or three Intel UPI links for connecting to other Intel Xeon processors and do so using a high-speed, low-latency path to the other CPU sockets. Intel UPI uses a directory-based home snoop coherency protocol, which provides an operational speed of up to 10.4 GT/s, improves power efficiency through an L0p state low-power state, provides improved data transfer efficiency over the link using a new packetization format, and has improvements at the protocol layer such as no preallocation to remove scalability limits with Intel QPI.

Typical two- socket configuration
Figure 6. Typical two- socket configuration.

Typical four-socket ring configuration
Figure 7. Typical four-socket ring configuration.

Typical four-socket crossbar configuration
Figure 8. Typical four-socket crossbar configuration.

Typical eight-socket configuration
Figure 9. Typical eight-socket configuration.

Intel® Ultra Path Interconnect Caching and Home Agent

Previous implementations of Intel Xeon processors provided a distributed Intel QPI caching agent located with each core and a centralized Intel QPI home agent located with each memory controller. Intel Xeon processor Scalable family processors implement a combined CHA that is distributed and located with each core and LLC bank, and thus provides resources that scale with the number of cores and LLC banks. CHA is responsible for tracking of requests from the core and responding to snoops from local and remote agents as well as resolution of coherency across multiple processors.

Intel UPI removes the requirement on preallocation of resources at the home agent, which allows the home agent to be implemented in a distributed manner. The distributed home agents are still logically a single Intel UPI agent that is address-interleaved across different CHAs, so the number of visible Intel UPI nodes is always one, irrespective of the number of cores, memory controllers used, or the sub-NUMA clustering mode. Each CHA implements a slice of the aggregated CHA functionality responsible for a portion of the address space mapped to that slice.

Sub-NUMA Clustering

A sub-NUMA cluster (SNC) is similar to a cluster-on-die (COD) feature that was introduced with Haswell, though there are some differences between the two. An SNC creates two localization domains within a processor by mapping addresses from one of the local memory controllers in one half of the LLC slices closer to that memory controller and addresses mapped to the other memory controller into the LLC slices in the other half. Through this address-mapping mechanism, processes running on cores on one of the SNC domains using memory from the memory controller in the same SNC domain observe lower LLC and memory latency compared to latency on accesses mapped to locations outside of the same SNC domain.

Unlike a COD mechanism where a cache line could have copies in the LLC of each cluster, SNC has a unique location for every address in the LLC, and it is never duplicated within the LLC banks. Also, localization of addresses within the LLC for each SNC domain applies only to addresses mapped to the memory controllers in the same socket. All addresses mapped to memory on remote sockets are uniformly distributed across all LLC banks independent of the SNC mode. Therefore even in the SNC mode, the entire LLC capacity on the socket is available to each core, and the LLC capacity reported through the CPUID is not affected by the SNC mode.

Figure 10 represents a two-cluster configuration that consists of SNC Domain 0 and 1 in addition to their associated core, LLC, and memory controllers. Each SNC domain contains half of the processors on the socket, half of the LLC banks, and one of the memory controllers with three DDR4 channels. The affinity of cores, LLC, and memory within a domain are expressed using the usual NUMA affinity parameters to the OS, which can take SNC domains into account in scheduling tasks and allocating memory to a process for optimal performance.

SNC requires that memory is not interleaved in a fine-grain manner across memory controllers. In addition, SNC mode has to be enabled by BIOS to expose two SNC domains per socket and set up resource affinity and latency parameters for use with NUMA primitives.

Sub-NUMA cluster domains
Figure 10. Sub-NUMA cluster domains.

Directory-Based Coherency

Unlike the prior generation of Intel Xeon processors that supported four different snoop modes (no-snoop, early snoop, home snoop, and directory), the Intel Xeon processor Scalable family of processors only supports the directory mode. With the change in cache hierarchy to a non-inclusive LLC, the snoop resolution latency can be longer depending on where in the cache hierarchy a cache line is located. Also, with much higher memory bandwidth, the inter-socket Intel UPI bandwidth is a much more precious resource and could become a bottleneck in system performance if unnecessary snoops are sent to remote sockets. As a result, the optimization trade-offs for various snoop modes are different in Intel Xeon processor Scalable family compared to previous Intel Xeon processors, and therefore the complexity of supporting multiple snoop modes is not beneficial.

The Intel Xeon processor Scalable family carries forward some of the coherency optimizations from prior generations and introduces some new ones to reduce the effective memory latency. For example, some of the directory caching optimizations such as IO directory cache and HitME cache are still supported and further enhanced on the Intel Xeon processor Scalable family. The opportunistic broadcast feature is also supported, but it is used only with writes to local memory to avoid memory access due to directory lookup.

For IO directory cache (IODC), the Intel Xeon processor Scalable family provides an eight-entry directory cache per CHA to the cache directory state of IO writes from remote sockets. IO writes usually require multiple transactions to invalidate a cache line from all caching agents followed by a writeback to put updated data in memory or home sockets LLC. With the directory information stored in memory, multiple accesses may be required to retrieve and update directory state. IODC reduces accesses to memory to complete IO writes by keeping the directory information cached in the IODC structure.

HitME cache is another capability in the CHA that caches directory information for speeding up cache-to-cache transfer. With the distributed home agent architecture of the CHA, the HitME cache resources scale with number of CHAs.

Opportunistic Snoop Broadcast (OSB) is another feature carried over from previous generations into the Intel Xeon processor Scalable family. OSB broadcasts snoops when the Intel UPI link is lightly loaded, thus avoiding a directory lookup from memory and reducing memory bandwidth. In the Intel Xeon processor Scalable family, OSB is used only for local InvItoE (generated due to full-line writes from the core or IO) requests since data read is not required for this operation. Avoiding directory lookup has a direct impact on saving memory bandwidth.

Cache Hierarchy Changes

Generational cache comparison
Figure 11. Generational cache comparison.

In the previous generation the mid-level cache was 256 KB per core and the last level cache was a shared inclusive cache with 2.5 MB per core. In the Intel Xeon processor Scalable family, the cache hierarchy has changed to provide a larger MLC of 1 MB per core and a smaller shared non-inclusive 1.375 MB LLC per core. A larger MLC increases the hit rate into the MLC resulting in lower effective memory latency and also lowers demand on the mesh interconnect and LLC. The shift to a non-inclusive cache for the LLC allows for more effective utilization of the overall cache on the chip versus an inclusive cache.

If the core on the Intel Xeon processor Scalable family has a miss on all the levels of the cache, it fetches the line from memory and puts it directly into MLC of the requesting core, rather than putting a copy into both the MLC and LLC as was done on the previous generation. When the cache line is evicted from the MLC, it is placed into the LLC if it is expected to be reused.

Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On the previous-generation CPUs, the shared LLC itself took care of this task.

Even with the changed cache hierarchy in Intel Xeon processor Scalable family, the effective cache available per core is roughly the same as the previous generation for a usage scenario where different applications are running on different cores. Because of the non-inclusive nature of LLC, the effective cache capacity for an application running on a single core is a combination of MLC cache size and a portion of LLC cache size. For other usage scenarios, such as multithreaded applications running across multiple cores with some shared code and data, or a scenario where only a subset of the cores on the socket are used, the effective cache capacity seen by the applications may seem different than previous-generation CPUs. In some cases, application developers may need to adapt their code to optimize it with the changed cache hierarchy on the Intel Xeon processor Scalable family of processors.

Page Protection Keys

Because of stray writes, memory corruption is an issue with complex multithreaded applications. For example, not every part of the code in a database application needs to have the same level of privilege. The log writer should have write privileges to the log buffer, but it should have only read privileges on other pages. Similarly, in an application with producer and consumer threads for some critical data structures, producer threads can be given additional rights over consumer threads on specific pages. 

The page-based memory protection mechanism can be used to harden applications. However, page table changes are costly for performance since these changes require Translation Lookaside Buffer (TLB) shoot downs and subsequent TLB misses. Protection keys provide a user-level, page-granular way to grant and revoke access permission without changing page tables.

Protection keys provide 16 domains for user pages and use bits 62:59 of the page table leaf nodes (for example, PTE) to identify the protection domain (PKEY). Each protection domain has two permission bits in a new thread-private register called PKRU. On a memory access, the page table lookup is used to determine the protection domain (PKEY) of the access, and the corresponding protection domain-specific permission is determined from PKRU register content to see if access and write permission is granted. An access is allowed only if both protection keys and legacy page permissions allow the access. Protection keys violations are reported as page faults with a new page fault error code bit. Protection keys have no effect on supervisor pages, but supervisor accesses to user pages are subject to the same checks as user accesses.

Diagram of memory data access with protection key
Figure 12. Diagram of memory data access with protection key.

In order to benefit from protection keys, support is required from the virtual machine manager, OS, and complier. Utilizing this feature does not cause a performance impact because it is an extension of the memory management architecture.

Intel® Memory Protection Extensions (Intel® MPX)

C/C++ pointer arithmetic is a convenient language construct often used to step through an array of data structures. If an iterative write operation does not take into consideration the bounds of the destination, adjacent memory locations may get corrupted. Such unintended modification of adjacent data is referred as a buffer overflow. Buffer overflows have been known to be exploited, causing denial-of-service (DoS) attacks and system crashes. Similarly, uncontrolled reads could reveal cryptographic keys and passwords. More sinister attacks, which do not immediately draw the attention of the user or system administrator, alter the code execution path such as modifying the return address in the stack frame to execute malicious code or script.

Intel’s Execute Disable Bit and similar hardware features from other vendors have blocked buffer overflow attacks that redirected the execution to malicious code stored as data. Intel® MPX technology consists of new Intel® architecture instructions and registers that compilers can use to check the bounds of a pointer at runtime before it is used. This new hardware technology is supported by the compiler.

New Intel® Memory Protection Extensions instructions and example of their effect on memory
Figure 13. New Intel® Memory Protection Extensions instructions and example of their effect on memory.

For additional information see Intel® Memory Protection Extensions Enabling Guide.

Mode-Based Execute (MBE) Control

MBE provides finer grain control on execute permissions to help protect the integrity of the system code from malicious changes. It provides additional refinement within the Extended Page Tables (EPT) by turning the Execute Enable (X) permission bit into two options:

  • XU for user pages
  • XS for supervisor pages

The CPU selects one or the other based on permission of the guest page and maintains an invariant for every page that does not allow it to be writable and supervisor-executable at the same time. A benefit of this feature is that a hypervisor can more reliably verify and enforce the integrity of kernel-level code. The value of the XU/XS bits is delivered through the hypervisor, so hypervisor support is necessary.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Generational overview of Intel® Advanced Vector Extensions technology
Figure 14. Generational overview of Intel® Advanced Vector Extensions technology.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512) was originally introduced with the Intel® Xeon Phi™ processor product line (formerly Knights Landing). There are certain Intel AVX-512 instruction groups (AVX512CD and AVX512F) that are common to the Intel® Xeon Phi™ processor product line and the Intel Xeon processor Scalable family. However the Intel Xeon processor Scalable family introduces new Intel AVX-512 instruction groups (AVX512BW and AVX512DQ) as well as a new capability (AVX512VL) to expand the benefits of the technology. The AVX512DQ instruction group is focused on new additions for benefiting high-performance computing (HPC) workloads such as oil and gas, seismic modeling, financial services industry, molecular dynamics, ray tracing, double-precision matrix multiplication, fast Fourier transform and convolutions, and RSA cryptography. The AVX512BW instruction group supports Byte/Word operations, which can benefit some enterprise applications, media applications, as well as HPC. AVX512VL is not an instruction group but a feature that is associated with vector length orthogonality.

Broadwell, the previous processor generation, has up to two floating point FMAs (Fused Multiple Add) per core and this has not changed with the Intel Xeon processor Scalable family. However the Intel Xeon processor Scalable family doubles the number of elements that can be processed compared to Broadwell as the FMAs on the Intel Xeon processor Scalable family of processors have been expanded from 256 bits to 512 bits.

Generation feature comparison of Intel® Advanced Vector Extensions technology
Figure 15. Generation feature comparison of Intel® Advanced Vector Extensions technology.

Intel AVX-512 instructions offer the highest degree of support to software developers by including an unprecedented level of richness in the design of the instructions. This includes 512-bit operations on packed floating-point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, additional gather/scatter support, high-speed math instructions, and compact representation of large displacement value. The following sections cover some of the details of the new features of Intel AVX-512.

AVX512DQ

The doubleword and quadword instructions, indicated by the AVX512DQ CPUID flag enhance integer and floating-point operations, consisting of additional instructions that operate on 512-bit vectors whose elements are 16 32-bit elements or 8 64-bit elements. Some of these instructions provide new functionality such as the conversion of floating point numbers to 64-bit integers. Other instructions promote existing instructions such as with the vxorps instruction to use 512-bit registers.

AVX512BW

The byte and word instructions, indicated by the AVX512BW CPUID flag, enhance integer operations, extending write-masking and zero-masking to support smaller element sizes. The original Intel AVX-512 Foundation instructions supported such masking with vector element sizes of 32 or 64 bits, because a 512-bit vector register could hold at most 16 32-bit elements, so a write mask size of 16 bits was sufficient.

An instruction indicated by an AVX512BW CPUID flag requires a write mask size of up to 64 bits because a 512-bit vector register can hold 64 8-bit elements or 32 16-bit elements. Two new mask types (_mmask32 and _mmask64) along with additional maskable intrinsics have been introduced to support this operation.

AVX512VL

An additional orthogonal capability known as Vector Length Extensions provide for most Intel AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. Vector Length Extensions can currently be applied to most Foundation Instructions and the Conflict Detection Instructions, as well as the new Byte, Word, Doubleword, and Quadword instructions. These Intel AVX-512 Vector Length Extensions are indicated by the AVX512VL CPUID flag. The use of Vector Length Extensions extends most Intel AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. The use of Vector Length Extensions allows the capabilities of EVEX encodings, including the use of mask registers and access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.

Mask Registers

In previous generations of Intel® Advanced Vector Extensions and Intel® Advanced Vector Extensions 2, the ability to mask bits was limited to load and store operations. In Intel AVX-512 this feature has been greatly expanded with eight new opmask registers used for conditional execution and efficient merging of destination operands. The width of each opmask register is 64-bits, and they are identified as k0–k7. Seven of the eight opmask registers (k1–k7) can be used in conjunction with EVEX-encoded Intel AVX-512 Foundation Instructions to provide conditional processing, such as with vectorized remainders that only partially fill the register. While the Opmask register k0 is typically treated as a “no mask” when unconditional processing of all data elements is desired. Additionally, the opmask registers are also used as vector flags/element level vector sources to introduce novel SIMD functionality as seen in new instructions such as VCOMPRESSPS. Support for the 512-bit SIMD registers and the opmask registers is managed by the operating system using XSAVE/XRSTOR/XSAVEOPT instructions. (see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).

Example of opmask register k1
Figure 16. Example of opmask register k1.

Embedded Rounding

Embedded Rounding provides additional support for math calculations by allowing the floating point rounding mode to be explicitly specified for an individual operation, without having to modify the rounding controls in the MXCSR control register. In previous SIMD instruction extensions, rounding control is generally specified in the MXCSR control register, with a handful of instructions providing per-instruction rounding override via encoding fields within the imm8 operand. Intel AVX-512 offers a more flexible encoding attribute to override MXCSR-based rounding control for floating-pointing instruction with rounding semantic. This rounding attribute embedded in the EVEX prefix is called Static (per instruction) Rounding Mode or Rounding Mode override. Static rounding also implies exception suppression (SAE) as if all floating point exceptions are disabled, and no status flags are set. Static rounding enables better accuracy control in intermediate steps for division and square root operations for extra precision, while the default MXCSR rounding mode is used in the last step. It can also help in cases where precision is needed the least significant bit such as in range reduction for trigonometric functions.

Embedded Broadcast

Embedded broadcast provides a bit-field to encode data broadcast for some load-op instructions such as instructions that load data from memory and perform some computational or data movement operation. A source element from memory can be broadcasted (repeated) across all elements of the effective source operand, without requiring an extra instruction. This is useful when we want to reuse the same scalar operand for all operations in a vector instruction. Embedded broadcast is only enabled on instructions with an element size of 32 or 64 bits and not on byte and word instructions.

Quadword Integer Arithmetic

Quadword integer arithmetic removes the need for expensive software emulation sequences. These instructions include gather/scatter with D/Qword indices, and instructions that can partially execute, where k-reg mask is used as a completion mask.

Table 3. Quadword integer arithmetic instructions.

Table 3 Quadword integer arithmetic instructions

Math Support

Math Support is designed to aid with math library writing and to benefit financial applications. Data types that are available include PS, PD, SS, and SS. IEEE division/square root formats, DP transcendental primitives, and new transcendental support instructions are also included.

Table 4. Math support instructions.

Table 4 Math support instructions

New Permutation Primitives

Intel AVX-512 introduces new permutation primitives such as 2-source shuffles with 16/32-entry table lookups with transcendental support, matrix transpose, and a variable VALIGN emulation.

Table 5. 2-Source shuffles instructions.

Table 5 2-Source shuffles instructions

Example of a 2-source shuffles operation
Figure 17. Example of a 2-source shuffles operation.

Expand and Compress

Expand and Compress allows vectorization of conditional loops. Similar to FORTRAN pack/unpack intrinsic it also provides memory fault suppression, can be faster than using gather/scatter, and also has opposite operation capability for compress. The figure below shows an example of an expand operation.

Expand instruction and diagram
Figure 18. Expand instruction and diagram.

Bit Manipulation

Intel AVX-512 provides support for bit manipulation operations on mask and vector operands including Vector rotate. These operations can be used to manipulate mask registers and they have some application with cryptography algorithms.

Table 6. Bit manipulation instructions.

Bit manipulation instructions

Universal Ternary Logical Operation

A universal ternary logical operation is another feature of Intel AVX-512 that provides a way to mimic an FPGA cell. The VPTERNLOGD and VPTERNLOGQ instructions operate on dword and qword elements and take three-bit vectors of the respective input data elements to form a set of 32/64 indices, where each 3-bit value provides an index into an 8-bit lookup table represented by the imm8 byte of the instruction. The 256 possible values of the imm8 byte is constructed as a 16x16 Boolean logic table, which can be filled with simple or compound Boolean logic expressions.

Conflict Detection Instructions

Intel AVX-512 introduces new conflict detection instructions. This includes the VPCONFLICT instruction along with a subset of supporting instructions. The VPCONFLICT instruction allows for detection of elements with previous conflicts in a vector of indexes. It can generate a mask with a subset of elements that are guaranteed to be conflict free. The computation loop can be re-executed with the remaining elements until all the indexes have been operated on.

Table 7. Conflict detection instructions.

Conflict detection instructions

VPCONFLICT{D,Q} zmm1{k1}{z}, zmm2/B(mV), For every element in ZMM2, compare it against everybody and generate a mask identifying the matches, but ignoring elements to the left of the current one, that is “newer.”

Diagram of mask generation for VPCONFLICT
Figure 19. Diagram of mask generation for VPCONFLICT.

In order to benefit from CDI, use Intel compilers version 16.0 in Intel® C++ Composer XE 2016 which will recognize potential run-time conflicts and generate VPCONFLICT loops automatically

Transcendental Support

Additional 512-bit instruction extensions have been provided to accelerate certain transcendental mathematic computations and can be found in the instructions VEXP2PD, VEXP2PS, VRCP28xx, and VRSQRT28xx, also known as Intel AVX-512 Exponential and Reciprocal instructions. These can benefit some finance applications.

Compiler Support

Intel AVX-512 optimizations are included in Intel compilers version 16.0 in Intel C++ Composer XE 2016 and the GNU* Compiler Collection (GCC) 5.0 (NASM 2.11.08 and binutils 2.25). Table 8 summarizes compiler arguments for optimization on the Intel Xeon processor Scalable family microarchitecture with Intel AVX-512.

Table 8. Summary of Intel Xeon processor Scalable family compiler optimizations.

Table 8 Summary of Intel Xeon processor Scalable family compiler optimizations

For more information see Intel® Architecture Instruction Set Extensions Programming Reference

Time Stamp Counter (TSC) Enhancement for Virtualization

The Intel Xeon processor Scalable family introduces a new TSC scaling feature to assist with migration of a virtual machine across different systems. In previous Intel Xeon processors, the TSC of a VM cannot automatically adjust itself to compensate for a processor frequency difference as it migrates from one platform to another. The Intel Xeon processor Scalable family enhances TSC virtualization support by adding a scaling feature in addition to the offsetting feature available in prior-generation CPUs. For more details on this feature see Intel® 64 SDM (search for “TSC Scaling”, e.g., Vol 3A – Sec 24.6.5, Sec 25.3, Sec 36.5.2.6).

Intel® Resource Director Technology (Intel® RDT)

Intel® Resource Director Technology (Intel® RDT) is a set of technologies designed to help monitor and manage shared resources. See Optimize Resource Utilization with Intel® Resource Director Technology for an animation illustrating the key principles behind Intel RDT. Intel RDT already has several existing features that provide benefit such as Cache Monitoring Technology (CMT), Cache Allocation Technology (CAT), Memory Bandwidth Monitoring (MBM), and Code Data Prioritization (CDP). The Intel Xeon processor Scalable family on the Purley platform introduces a new feature called Memory Bandwidth Allocation (MBA) which has been added to provide a per-thread memory bandwidth control. Through software the amount of memory bandwidth consumption of a thread or core can be limited. This feature can be used in conjunction with MBM to isolate a noisy neighbor. Chapter 17.16 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) covers programming details on Intel RDT features. Using this feature requires enabling at the OS or VMM level, and the Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x) feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.

Memory Bandwidth Monitoring (MBM)

Conceptual diagram of using Memory Bandwidth Monitoring
Memory Bandwidth Allocation (MBA)

 

Conceptual diagram of using Memory Bandwidth Monitoring
Figure 20. Conceptual diagram of using Memory Bandwidth Monitoring to identify noisy neighbor (core 0) and then using Memory Bandwidth Allocation to prioritize memory bandwidth.

Intel® Speed Shift Technology

Broadwell introduced Hardware Power Management (HWPM), a new optional processor power management feature in the hardware that liberates the OS from making decisions about processor frequency. HWPM allows the platform to provide information on all available constraints, allowing the hardware to choose optimal operating point. Operating independently, the hardware uses information that is not available to software and is able to make a more optimized decision in regard to the p-states and c-states. The Intel Xeon processor Scalable family on the Purley platform expands on this feature by providing a broader range of states that it can affect as well as a finer level of granularity and microarchitecture observability via the Package Control Unit (PCU). On Broadwell the HWPM was autonomous also known as Out-of-Band (OOB) mode and oblivious to the operating system, the Intel Xeon processor Scalable family allows for this as well but also offers the option for a collaboration between the HWPM and the operating system, known as native mode. The operating system can directly control the tuning of the performance and power profile when and where it is desired, while elsewhere the PCU can take autonomous control in the absence of constraints placed by the operating system. In native mode The Intel Xeon processor Scalable family is able to optimize frequency control for legacy operating systems, while providing new usage models for modern operating systems. The end user can set these options within the BIOS; see your OEM BIOS guide for more information. Modern operating systems that provide full integration with native mode include Linux* starting with kernel 4.10 and Windows Server* 2016.

PMax Detection

A processor implemented detection circuit provides faster detection and response to PMax level load events. Previously PMax detection circuits resided in either the power supply unit (PSU) or on the system board, while the new detection circuit on the Intel Xeon processor Scalable family resides primarily on the processor side. In general, the PMax detection circuit provided with the Intel Xeon processor Scalable family allows for faster PMax detection and response time as compared to the prior-generation PMax detection methods. PMax detection allows for the processor to be throttled back when it detects that power limits are being hit. This can assist with PMax spikes associated with virus applications while in turbo mode, prior to the PSU reacting. A faster response time due to PMax load events potentially allows for possible power cost savings. The end user can set PMax detection within the BIOS; see your OEM BIOS guide for more information.

Intel® Omni-Path Architecture (Intel® OPA)

Intel® Omni-Path Architecture (Intel® OPA), an element of Intel® Scalable System Framework, delivers the performance for tomorrow’s high performance computing (HPC) workloads and the ability to scale to tens of thousands of nodes—and eventually more—at a price competitive with today’s fabrics. The Intel OPA 100 Series product line is an end-to-end solution of PCIe* adapters, silicon, switches, cables, and management software. As the successor to Intel® True Scale Fabric, this optimized HPC fabric is built upon a combination of enhanced IP and Intel® technology.

For software applications, Intel OPA will maintain consistency and compatibility with existing Intel True Scale Fabric and InfiniBand* APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux distribution releases. Intel True Scale Fabric customers will be able to migrate to Intel OPA through an upgrade program.

The Intel Xeon processor Scalable family on the Purley platform supports Intel OPA in one of two forms: through the use of an Intel® Omni-Path Host Fabric Interface 100 Series add-in card or through a specific processor model line (SKL-F) found within the Intel Xeon processor Scalable family that has a Host Fabric Interface integrated into the processor. The fabric integration on the processor has its own dedicated pathways on the motherboard and doesn’t impact the PCIe lanes available for add-in cards. The architecture is able to provide up to 100 Gb/s per processor socket.

Intel is working with the open source community to provide all host components with changes being pushed upstream in conjunction with Delta Package releases. OSVs are working in conjunction with Intel to incorporate into future OS distributions. While existing Message Passing Interface (MPI) programs and MPI libraries for Intel True Scale Fabric that use PSM will work as-is with Intel Omni-Path Host Fabric Interface without recompiling, although recompiling can expose additional benefit.

For software support Intel Download Center and complier support can be found in Intel® Parallel Studio XE 2017

Intel QuickAssist Technology

Intel® QuickAssist Technology (Intel® QAT) accelerates and compresses cryptographic workloads by offloading the data to hardware capable of optimizing those functions. This makes it easier for developers to integrate built-in cryptographic accelerators into network, storage, and security applications. In the case of the Intel Xeon processor Scalable family on the Purley platform, Intel QAT is integrated into the hardware of the Intel® C620 series chipset (formerly Lewisburg) on the Purley platform and offers outstanding capabilities including 100 Gbs Crypto, 100Gbs Compression, 100kops RSA, and 2k Decrypt. Segments that can benefit from the technology include the following:

  • Server: secure browsing, email, search, big-data analytics (Hadoop), secure multi-tenancy, IPsec, SSL/TLS, OpenSSL
  • Networking: firewall, IDS/IPS, VPN, secure routing, Web proxy, WAN optimization (IP Comp), 3G/4G authentication
  • Storage: real-time data compression, static data compression, secure storage.

Supported Algorithms include the following:

  • Cipher Algorithms: Null, ARC4, AES (key sizes 128,192, 256), DES, 3DES, Kasumi, Snow3G, and ZUC
  • Hash/Authentication Algorithms Supported: MD5, SHA1, SHA-2 (output sizes 224,256,384,512), SHA-3 (output size 256 only), Advanced Encryption Standard (key sizes 128, 192, 256), Kasumi, Snow 3G, and ZUC
  • Authentication Encryption (AEAD) Algorithm: AES (key sizes 128, 192, 256)
  • Public Key Cryptography Algorithms: RSA, DSA, Diffie-Hellman (DH), Large Number Arithmetic, ECDSA, ECDH, EC, SM2 and EC25519

ZUC and SHA-3 are new algorithms that have been included in the third generation of Intel QuickAssist Technology found on the Intel® C620 series chipset.

Intel® Key Protection Technology (Intel® KPT) is a new supplemental feature of Intel QAT that can be found on the Intel Xeon processor Scalable family on the Purley platform with the Intel® C620 series chipset. Intel KPT has been developed to help secure cryptographic keys from platform level software and hardware attacks when the key is stored and used on the platform. This new feature focuses on protecting keys during runtime usage and is embodied within tools, techniques, and the API framework.

For a more detailed overview see Intel® QuickAssist Technology for Storage, Server, Networking and Cloud-Based Deployments. Programming and optimization guides can be found on the 01 Intel Open Source website.

Internet Wide Area RDMA Protocol (iWARP)

iWarp is a technology that allows network traffic managed by the NIC to bypass the kernel, which thus reduces the impact on the processor due to the absence of network-related interrupts. This is accomplished by the NICs communicating with each other via queue pairs to deliver traffic directly into the application user space. Large storage blocks and virtual machine migration tend to place more burden on the CPU due to the network traffic. This is where iWARP can be of benefit. Through the use of the queue pairs it is already known where the data needs to go and thus it is able to be placed directly into the application user space. This eliminates extra data copies between the kernel space and the user space that would normally occur without iWARP.

For more information see an information video on Accelerating Ethernet with iWARP Technology

iWARP comparison block diagram
Figure 21. iWARP comparison block diagram.

The Purley platform has an Integrated Intel Ethernet Connection X722 with up to 4x10 GbE/1 Gb connections that provide iWARP support. This new feature can benefit various segments including network function virtualization and software-defined infrastructure. It can also be combined with the Data Plane Development Kit to provide additional benefits with packet forwarding.

iWARP uses VERB APIs to talk to each other instead of traditional sockets. For Linux* OFA OFED provides VERB APIs, while Windows* uses Network Direct APIs. Contact your Linux distribution to see if it supports OFED verbs, while on Windows support is provided starting with Windows Server 2012 R2 or newer.

New and Enhanced RAS Features

The Intel Xeon processor Scalable family on the Purley platform provides several new features as well as enhancements of some existing features associated with the RAS (Reliability, Availability, and Serviceability) and Intel® Run Sure Technology. Two levels of support are provided with the Intel Xeon processor Scalable family: Standard RAS and Advanced RAS. Advanced RAS includes all of the Standard RAS features along with additional features.

In previous generations there could be limitations in RAS features based on the processor socket count (2–8). This has changed and all of the RAS features are available on a two-socket version of the platform or greater depending on the level (bronze through platinum) of the processors. Listed below is a summary of the new and enhanced RAS features from the previous generation.

Table 9. RAS feature summary table.

Table 9 RAS feature summary table

Intel® Virtual RAID on CPU (Intel® VROC)

Intel® VROC replaces third-party raid cards
Figure 22. Intel® VROC replaces third-party raid cards.

Intel VROC is a software solution that integrates with a new hardware technology called Intel® Volume Management Device (Intel® VMD) to provide a compelling hybrid RAID solution for NVMe* (Non-Volatile Memory Express*) solid-state drives (SSDs). The CPU has onboard capabilities that work more closely with the chipset to provide quick access to the directly attached NVMe SSDs on the PCIe lanes of the platform. The major features that help to make this possible are Intel® Rapid Storage Technology enterprise (Intel® RSTe) version 5.0, Intel VMD, and the Intel provided NVMe driver.

Intel RSTe is a driver and application package that allows for administration of the RAID features. It has been updated (version 5.0) on the Purley platform to take advantage of all of the new features. The NVMe driver allows restrictions that might have been placed on it by an operating system to be bypassed. This means that features like hot insert could be available even if the OS doesn’t provide it, and the driver can also provide support for third-party vendor NVMe non-Intel SSDs.

Intel VMD is a new technology introduced with the Intel Xeon processor Scalable family primarily to improve the management of high-speed SSDs. Previously SSDs were attached to a SATA or other interface types and managing them through software was acceptable. When we move toward directly attaching the SSDs to a PCIe interface in order to improve bandwidth, software management of those SSDs adds more delays. Intel VMD uses hardware to mitigate these management issues rather than completely relying on software.

Some of the major RAID features provided by Intel VROC include the protected write-back cache, isolated storage devices from the OS (error handling), and protection of RAID 5 data from a RAID write hole issue through the use of software logging, which can eliminate the need for a battery backup unit. Direct attached NVMe RAID volumes are RAID bootable, have Hot Insert and Surprise Removal capability, provide LED management options, 4K native NVMe SSD support, and multiple management options including remote access from a webpage, interaction at the UEFI level for pre-OS tasks, and a GUI interface at the OS level.

Boot Guard

Boot Guard adds another level of protection to the Purley platform by performing a cryptographic Root of Trust for Measurement (RTM) of the early firmware platform storage device such as the trusted platform module or Intel® Platform Trust Technology (Intel® PTT). It can also cryptographically verify early firmware using OEM-provided policies. Unlike Intel® Trusted Execution Technology (Intel® TXT), Boot Guard doesn’t have any software requirements; it is enabled at the factory, and it cannot be disabled. Boot Guard operates independently of Intel® TXT but it is also compatible with it. Boot Guard reduces the chance of malware exploiting the hardware or software components.

Boot Guard secure boot options
Figure 23. Boot Guard secure boot options.

BIOS Guard 2.0

BIOS Guard is an augmentation of existing chipset-based BIOS flash protection capabilities. The Purley platform adds the fault tolerant boot block update capability. The BIOS flash is segregated into a protected and unprotected regions. Purley bypasses the top-swap feature and flash range register locks/protections, for explicitly enabled signed scripts, to facilitate the fault-tolerant boot block update. This feature protects the BIOS flash from modification without the platform manufacturer’s authorization, as well as during BIOS updates. It can also help defend the platform from low-level DOS attacks.

For more details see Intel® Hardware-based Security Technologies for Intelligent Retail Devices.

BIOS Guard 2.0 block diagram
Figure 24. BIOS Guard 2.0 block diagram.

Intel® Processor Trace

Intel® Processor Trace (Intel® PT) is an exciting feature with improved support on the Intel Xeon processor Scalable family that can be enormously helpful in debugging, because it exposes an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters.

Intel PT provides the context around all kinds of events. Performance profilers can use Intel PT to discover the root causes of “response-time” issues—performance issues that affect the quality of execution, if not the overall runtime.

Further, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific back-edges and loop tripcounts, is easy to extract and report.

Debuggers can use Intel PT to reconstruct the code flow that led to the current location, whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over. They may even allow navigating in the recorded execution history via reverse stepping commands.

Another important use case is debugging stack corruptions. When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results. Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.

Operating systems could include Intel PT into core files. This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an OS crash occurs, the trace can be saved as part of an OS crash dump mechanism and then used later to reconstruct the failure.

Intel PT can also help to narrow down data races in multi-threaded operating systems and user program code. It can log the execution of all threads with a rough time indication. While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.

To utilize Intel PT you need Intel® Vtune™ Amplifier version 2017.

For more information see Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior and Processor tracing by James Reinders.

Intel® Node Manager

Intel® Node Manager (Intel® NM) is a core set of power management features that provide a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel NM reports vital platform information, such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls, such as helping with reduction of overall power consumption or maximizing rack loading, to limit platform power in compliance with IT policy This feature can be found across Intel’s product segments, including the Intel Xeon processor Scalable family, providing consistency within the data center.

The Intel Xeon processor Scalable family on the Purley platform includes the fourth generation of Intel NM, which extends control and reporting to a finer level of granularity than on the previous generation. To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Programmer’s Reference Kit is simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.

Table 10. Intel Node Manager fourth-generation features

Table 10 Intel® Node Manager fourth-generation features

The Author: David Mulnix is a software engineer and has been with Intel Corporation for over 20 years. His areas of focus has included software automation, server power, and performance analysis, and he has contributed to the development support of the Server Efficiency Rating ToolTM.

Contributors: Akhilesh Kumar and Elmoustapha Ould-ahmed-vall

Resources

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM)

Intel® Architecture Instruction Set Extensions Programming Reference

Intel® Resource Director Technology (Intel® RDT)

Optimize Resource Utilization with Intel® Resource Director Technology

Intel® Memory Protection Extensions Enabling Guide

Intel® Scalable System Framework

Intel® Run Sure Technology

Intel® Hardware-based Security Technologies for Intelligent Retail Devices

Processor tracing by James Reinders

Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior

Intel® Node Manager Website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit for Intel® Node Manager

How to set up Intel® Node Manager

Intel® Performance Counter Monitor (Intel® PCM) a better way to measure CPU utilization

Intel® Memory Latency Checker (Intel® MLC) a Linux* tool available for measuring the DRAM latency on your system

Intel® VTune™ Amplifier 2017 a rich set of performance insight into hotspots, threading, locks & waits, OpenCL bandwidth and more, with profiler to visualize results

The Intel® Xeon® processor-based server refresh savings estimator

Python mpi4py on Intel® True Scale and Omni-Path Clusters

$
0
0

Python users of the mpi4py package, leveraging capabilities for distributed computing on supercomputers with the Intel® True Scale or Intel® Omni-Path interconnects might run into issues with the default configuration of mpi4py.

The mpi4py package is using matching probes (MPI_Mpobe) for the receiving function recv() instead of regular MPI_Recv operations per default. These matching probes from the MPI 3.0 standard however are not supported for all fabrics, which may lead to a hang in the receiving function.

Therefore, users are recommended to leverage the OFI fabric instead of TMI for Omni-Path systems. For Intel® MPI, the configuration could look like the following environment variable setting.:

I_MPI_FABRICS=ofi

Users utilizing True Scale or Omni-Path systems via the TMI fabric, might alternatively switch off the usage of matching probe operations withing the mpi4py recv() function.

This can be established via

mpi4py.rc.recv_mprobe = False

right after importing the mpi4py package.

 

 

Optimizing Computer Applications for Latency: Part 2: Tuning Applications

$
0
0

For applications such as high frequency trading (HFT), search engines and telecommunications, it is essential that latency can be minimized. My previous article Optimizing Computer Applications for Latency, looked at the architecture choices that support a low latency application. This article builds on that to show how latency can be measured and tuned within the application software.

Using Intel® VTuneTM Amplifier

Intel® VTuneTM Amplifier XE can collect and display a lot of useful data about an application’s performance. You can run a number of pre-defined collections (such as parallelism and memory analysis) and see thread synchronization on a timeline. You can break down activity by process, thread, module, function, or core, and break it down by bottleneck too (memory bandwidth, cache misses, and front-end stalls).

Intel VTune can be used to identify many important performance issues, but it struggles with analyzing intervals measured in microseconds. Intel VTune uses periodic interrupts to collect data and save it. The frequency of those interrupts is limited to roughly one collection point per 100 microseconds per core. While you can filter the data to observe some of the outliers, the data on any single outlier will be limited, and some might be missed by the sampling frequency.

You can download a free trial of Intel VTune AmplifierXE. Read about the VTune AmplifierXE capabilities.

Figure 1. Intel® VTune™ Amplifier XE 2017, showing hotspots (above) and concurrency (below) analyses

Using Intel® Processor Trace Technology

The introduction of Intel® Processor Trace (Intel® PT) technology in the Broadwell architecture, for example in the Intel® Xeon® processor E5-2600 v4, makes it possible to analyze outliers in low latency applications. Intel® PT is a hardware feature that logs information about software execution with minimal impact on system execution. It supports control flow tracing, so decoder software can be used to determine the exact flow of the software execution, including branches taken and not taken, based on the trace log. Intel PT can store both cycle count and timestamp information for deep performance analysis. If you can time stamp other measurements, traces, and screenshots you can synchronize the Intel PT data with them. The granularity of a capture is a basic block. Intel PT is supported by the “perf” performance analysis tool in Linux*.

Typical Low Latency Application Issues

Low latency applications can suffer from the same bottlenecks as any kind of application, including:

  • Using excessive system library calls (such as inefficient memory allocations or string operations)

  • Using outdated instruction sets, because of obsolete compilers or compiler options

  • Memory and other runtime issues leading to execution stalls

On top of those, latency-sensitive applications have their own specific issues. Unlike in high-performance computing (HPC) applications, where loop bodies are usually small, the loop body in a low latency application usually covers a packet processing instruction path. In most cases, this leads to heavy front-end stalls because the decoded instructions for the entire packet processing path do not fit into the instruction (uop) cache. That means instructions have to be decoded on the fly for each loop iteration. Between 40 and 50 per cent of CPU cycles can stall due to the lack of instructions to execute.

Another specific problem is due to inefficient thread synchronization. The impact of this usually increases with a higher packet/transaction rate. Higher latency may lead to a limited throughput as well, making the application less able to handle bursts of activity. One example I’ve seen in customer code is guarding a single-threaded queue with a lock to use it in a multithreaded environment. That’s hugely inefficient. Using a good multithreaded queue, we’ve been able to improve throughput from 4,000 to 130,000 messages per second. Another common issue is using thread synchronization primitives that go to kernel sleep mode immediately or too soon. Every wake-up from kernel sleep takes at least 1.2 microseconds.

One of the goals of a low latency application is to reduce the quantity and extent of outliers. Typical reasons for jitter (in descending order) are:

  • Thread oversubscriptions, accounting for a few milliseconds

  • Runtime garbage collector activities, accounting for a few milliseconds

  • Kernel activities, accounting for up to 100s of microseconds

  • Power-saving states:

    • CPU C-states, accounting for 10s to 100s of microseconds

    • Memory states

    • PCI-e states

  • Turbo mode frequency switches, accounting for 7 microseconds

  • Interrupts, IO, timers: responsible for up to a few microseconds

Tuning the Application

Application tuning should begin by tackling any issues found by Intel VTune. Start with the top hotspots and, where possible, eliminate or reduce excessive activities and CPU stalls. This has been widely covered by others before, so I won’t repeat their work here. If you’re new to Intel VTune, there’s a Getting Started guide.

In this article, we will focus on the specifics for low latency applications. The biggest issue arises from front-end stalls in the instruction decoding pipeline. This issue is difficult to address, and results from the loop body being too big for the uop cache. One approach that might help is to split the packet processing loop and process it by a number of threads passing execution from one another. There will be a synchronization overhead, but if the instruction sequence fits into a few uop caches (each thread bound to different cores, one cache per thread), it may well be worth the exercise.

Thread synchronization issues are somewhat difficult to monitor. Intel VTune Amplifier has a collection that captures all standard thread sync events (Windows* Thread API, pthreads* API, Intel® Threading Building Blocks and OpenMP*). It helps to understand what is going on in the application, but deeper analysis is required to see if a thread sync model introduces any limitations. This is non-trivial exercise requiring quite some expertise. The best advice is to use a highly performant threading solution.

An interesting topic is thread affinities. For complex systems with multiple thread synchronization patterns along the workflow, setting the best affinities may bring some benefit. A synchronization object is a variable or data structure, plus its associated lock/release functionality. Threads synchronized on a particular object should be pinned to a core of the same socket, but they don’t need to be on the same core. Generally the goal of this exercise is to keep thread synchronization on a particular object local to one of the sockets, because cross-socket thread sync is much costlier.

Tackling Outliers in Virtual Machines

If the application runs in a Java* or .NET* virtual machine, the virtual machine needs to be tuned. The garbage collector settings are particularly important. For example, try tuning the tenuring threshold to avoid unnecessary moves of long-lived objects. This often helps to reduce latency and cut outliers down.

One useful technology introduced in the Intel® Xeon® processor E5-2600 v4 product family is Cache Allocation Technology. It allows a certain amount of last level cache (LLC) to be dedicated to a particular core, process, or thread, or to a group of them. For example, a low latency application might get exclusive use of part of the cache so anything else running on the system won’t be able to evict its data.

Another interesting technique is to lock the hottest data in the LLC “indefinitely”. This is a particularly useful technique for outlier reduction. The hottest data is usually considered to be the data that’s accessed most often, but for low latency applications it can instead be the data that is on a critical latency path. A cache miss costs roughly 50 to 100 nanoseconds, so a few cache misses can cause an outlier. By ensuring that critical data is locked in the cache, we can reduce the number and intensity of outliers.

For more information on Cache Allocation Technology, see Using Hardware Features in Intel® Architecture to Achieve High Performance in NFV.           

Exercise

Let’s play with a code sample implementing a lockless single-producer single-consumer queue. Downloadtext/x-csrcDownload

To start, grab the source code for the test case from the download link above. Build it like this:

gcc spsc.c -lpthread –lm -o spsc

Or

icc spsc.c -lpthread –lm -o spsc

Here’s how you run the spsc test case:

./spsc 100000 10 100000

The parameters are: numofPacketsToSend bufferSize numofPacketsPerSecond. You can experiment with different numbers.

Let’s check how the latency is affected by CPU power-saving settings. Set everything in the BIOS to maximum performance, as described in Part 1 of this series. Specifically, CPU C-states must be set to off and the correct power mode should be used, as described in the Kernel Tuning section. Also, ensure that cpuspeed is off.

Next, set the CPU scaling governor to powersave. In this code, the index i goes up to the number of cores:

for ((i=0; i<23; i++)); do echo powersave > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor;  done

Then set all threads to stay on the same NUMA node using taskset, and run the test case:

taskset –c 0,1 ./spsc 1000000 10 1000000 

On a server based on the Intel® Xeon® Processor E5-2697 v2, running at 2.70GHz, we see the following results for average latency with and without outliers, the highest and lowest latency, the number of outliers and the standard deviation (with and without outliers). All measurements are in microseconds:

taskset -c 0,1 ./spsc 1000000 10 1000000

Avg lat = 0.274690, Avg lat w/o outliers = 0.234502, lowest lat = 0.133645, highest lat = 852.247954, outliers = 4023

Stdev = 0.001214, stdev w/o outliers = 0.001015

Now set the performance mode (overwriting the powersave mode) and run the test again:

for ((i=0; i<23; i++)); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor;  done

taskset -c 0,1 ./spsc 1000000 10 1000000

Avg lat = 0.067001, Avg lat w/o outliers = 0.051926, lowest lat = 0.045660, highest lat = 422.293023, outliers = 1461

Stdev = 0.000455, stdev w/o outliers = 0.000560

As you can see all the performance metrics improved significantly when we enabled performance mode: the average, the lowest and highest latency, and the number of outliers. (Table 1 summarizes all the results from this exercise for easy comparison).

Let’s compare how the latency is affected by a NUMA locality. I’m assuming you have a machine with more than one processor. We’ve already run the test case bound to a single NUMA node.

Let’s run the test case over two nodes:

taskset -c 8,16 ./spsc 1000000 10 1000000

Avg lat = 0.248679, Avg lat w/o outliers = 0.233011, lowest lat = 0.069047, highest lat = 415.176207, outliers = 1926

Stdev = 0.000901, stdev w/o outliers = 0.001103

All of the metrics, except for the highest latency, are better on a single NUMA node. This results from the cost of communicating with another node, because data needs to be transferred over Intel® QuickPath Interconnect (Intel® QPI) link and over all parts of the cache coherency mechanism.

Don’t be surprised that the highest latency is lower on two nodes. You can run the test multiple times and verify that the highest latency outliers are roughly the same for both one node and two nodes. The lower value shown here for two nodes is most likely a coincidence. The outliers are two to three orders of magnitude higher than the average latency, which shows that NUMA locality doesn’t matter for the highest latency. The outliers are caused by kernel activities that are not related to NUMA.

Test

Avg Lat

Avg Lat w/o Outliers

lowest Lat

highest Lat

outliers

Stdev

Stdev w/o Outliers

Powersave

0.274690

0.234502

0.133645

852.247954

4023

0.001214

0.001015

Performance

1 node

0.067001

0.051926

0.045660

422.293023

1461

0.000455

0.000560

Performance

2 nodes

0.248679

0.233011

0.069047

415.176207

1926

0.000901

0.001103

Table 1: The results of the latency tests conducted under different conditions, measured in microseconds.

I also recommend playing with Linux perf to monitor outliers. Intel PT support starts with Kernel 4.1. You need to add timestamps (start, stop) for all latency intervals, identify a particular outlier and then drill down into perf data to see what was going on during the interval of the outlier.

For more information, see https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt.

Conclusion

This two-part article has summarized some of the approaches you can take, and tools you can use, when tuning applications and hardware for low latency. Using the worked example here, you can quickly see the impact of NUMA locality and powersave mode, and you can use the test case to experiment with other settings, and quickly see the impact they can have on latency.

Intel® Media SDK GStreamer* Getting Started Guide

$
0
0

Intel® Media SDK GStreamer* Installation Process

1 Overview

This document provides the system requirements, installation instructions, issues and limitations. System Requirements:

  • Intel® Core™ Processor: SkyLake, Broadwell.
  • Fedora* 24 / 25
  • Intel® Media Server Studio 2017 R2.

2 Install Fedora* 24 / 25

2.1 Download Fedora*

Go to the Fedora* download site and download Workstation OS image:

Fedora* 24: http://mirror.nodesdirect.com/fedora/releases/24/Workstation/x86_64/iso/Fedora*-Workstation-Live-x86_64-24-1.2.iso
Fedora* 25: http://mirror.nodesdirect.com/fedora/releases/25/Workstation/x86_64/iso/Fedora*-Workstation-Live-x86_64-25-1.3.iso

2.2 Create the installation USB

Get an imaging tool like Rufus to create the USB bootable image

2.3 Install Fedora* 24 / 25 on the system

For Fedora 25, you may log on to the system with "GNOME on Xorg" option in the Gnome login manager. This is because the default desktop for Fedora 25 uses Wayland, and the renderer plugin (mfxsink) native Wayland backend is not very well supported by the Fedora Wayland desktop. In this case, you should use the Wayland EGL backend in mfxsink for native Wayland rendering in Fedora 25 Wayland.

2.4 Configure the Fedora system (optional)

In case the user is behind a VPN, you may use the following method to set up the network proxy:

vi /etc/dnf/dnf.conf
# Add the following lines:
proxy=http://<proxy address>:<port>

Enable sudo privileges:

$ su
Password:
# vi /etc/sudoers
Find one line such like
  root    ALL=(ALL)    ALL
Then add one line for the normal user who wants to use sudo, e.g. for normal user "user"
  user    ALL=(ALL)    ALL

2.5 Install rpm fusion

Fedora* 24:

wget <http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-24.noarch.rpm> -e use_proxy=yes -e  http_proxy=<proxy_address>:<port>
sudo rpm -ivh rpmfusion-free-release-24.noarch.rpm

Fedora* 25:

wget <http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-25.noarch.rpm> -e use_proxy=yes -e  http_proxy=<proxy_address>:<port>
sudo rpm -ivh rpmfusion-free-release-25.noarch.rpm

2.6 Update system

sudo dnf update

3 Install Intel® Media Server Studio

3.1 Download Intel® Media Server Studio (2017 R2) Community Edition

Go to software.intel.com/intel-media-server-studio and download the tar.gz file

3.2 Installing the user-space modules

Note: Before starting the following command sequence, note that the last cp command may reset the system, the system may freeze for awhile and logout automatically. This is expected, continue logging in and resume the installation procedure. Create a folder for installation, for example “development”, download the tar file MediaServerStudioEssentials2017R2.tar.gz to this folder.

# cd ~
# mkdir development
# cd development
# tar -vxf MediaServerStudioEssentials2017R2.tar.gz
# cd MediaServerStudioEssentials2017R2/
# tar -vxf SDK2017Production16.5.1.tar.gz
# cd SDK2017Production16.5.1/Generic/
# tar -vxf intel-linux-media_generic_16.5.1-59511_64bit.tar.gz
# sudo cp -rdf etc/* /etc
# sudo cp -rdf opt/* /opt
# sudo cp -rdf lib/* /lib
# sudo cp -rdf usr/* /usr

3.3 Install the custom kernel module package

3.3.1 Install the build tools

# sudo dnf install kernel-headers kernel-devel bc wget bison ncurses-devel hmaccalc zlib-devel binutils-devel elfutils-libelf-devel rpm-build redhat-rpm-config asciidoc hmaccalc perl-ExtUtils-Embed pesign xmlto audit-libs-devel binutils-devel elfutils-devel elfutils-libelf-devel newt-devel numactl-devel pciutils-devel python-devel zlib-devel mesa-dri-drivers openssl-devel

3.3.2 Download and build the kernel

# cd ~/development
# wget  https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.4.tar.xz -e use_proxy=yes -e  https_proxy= https://<proxy address>:<port>
# tar -xvf linux-4.4.tar.xz
# cp /opt/intel/mediasdk/opensource/patches/kmd/4.4/intel-kernel-patches.tar.bz2 .
# tar -xjf intel-kernel-patches.tar.bz2
# cd linux-4.4/
# vi patch.sh
(Added: “for i in ../intel-kernel-patches/*.patch; do patch -p1 < $i; done”)
# chmod +x patch.sh
# ./patch.sh
# make olddefconfig
# echo "CONFIG_NVM=y">> .config
# echo "CONFIG_NVM_DEBUG=n">> .config
# echo "CONFIG_NVM_GENNVM=n">> .config
# echo "CONFIG_NVM_RRPC=n">> .config
# make -j 8
# sudo make modules_install
# sudo make install

3.3.3 Validate the kernel change

Reboot the system with kernel 4.4 and check the kernel version

# uname –r
4.4.0

3.3.4 Validate the Intel® Media SDK installation

The vainfo utility should show the Media SDK iHD driver details (installed in /opt/intel/mediasdk) and several codec entry points that indicate the system support for various codec formats.

$ vainfo
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.99 (libva 1.67.0.pre1)
vainfo: Driver version: 16.5.1.59511-ubit
vainfo: Supported profile and entrypoints
 VAProfileH264ConstrainedBaseline: VAEntrypointVLD
 VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
 VAProfileH264Main : VAEntrypointVLD
 VAProfileH264Main : VAEntrypointEncSlice
 VAProfileH264High : VAEntrypointVLD
 VAProfileH264High : VAEntrypointEncSlice

Prebuilt samples are available for installation smoke testing in MediaSamples_Linux_*.tar.gz

# cd ~/development/MediaServerStudioEssentials2017R2/
# tar -vxf MediaSamples_Linux_2017R2.tar.gz
# cd MediaSamples_Linux_2017R2_b634/samples/_bin/x64/
# ./sample_multi_transcode -i::h264 ../content/test_stream.264 -o::h264 out.264

This test should pass on successful installation.

4 Install GStreamer*

4.1 Install GStreamer and corresponding plugins packages

# sudo dnf install gstreamer1 gstreamer1-devel gstreamer1-plugins-base gstreamer1-plugins-base-devel gstreamer1-plugins-good gstreamer1-plugins-ugly gstreamer1-plugins-bad-free gstreamer1-plugins-bad-freeworld gstreamer1-plugins-bad-free-extras gstreamer1-libav gstreamer1-plugins-bad-free-devel gstreamer1-plugins-base-tools

4.2 Validate the installation

# gst-launch-1.0 –-version
# gst-launch-1.0 -v fakesrc num_buffers=5 ! fakesink
# gst-play-1.0 sample.mkv

5 Build the GStreamer Media SDK plugin

5.1 Install the GStreamer Media SDK plugin dependencies

# sudo dnf install gcc-c++ glib2-devel libudev-devel libwayland-client-devel libwayland-cursor-devel mesa-libEGL-devel mesa-libGL-devel mesa-libwayland-egl-devel mesa-libGLES-devel libstdc++-devel cmake libXrandr-devel

5.2 Download the GStreamer Media SDK plugin

Go to github.com/01org/gstreamer-media-SDK and download the package to a "development" folder.

5.3 Build and install the plugin

# cd development/gstreamer-media-SDK-master/
# mkdir build
# cd build
# cmake .. -DCMAKE_INSTALL_PREFIX=/usr/lib64/gstreamer-1.0/plugins
# make
# sudo make install

5.4 Validate the installation

# gst-inspect-1.0 mfxvpp
# gst-inspect-1.0 mfxdecode
# gst-play-1.0 sample.mkv
# gst-launch-1.0 filesrc location=/path/to/BigBuckBunny_320x180.mp4 ! qtdemux ! h264parse ! mfxdecode ! fpsdisplaysink video-sink=mfxsink

You can go to the following site to download the clip: download.blender.org/peach/bigbuckbunny_movies/

Viewing all 216 articles
Browse latest View live