AVX2 Optimized Alpha Blend

Published at 26th July, 2015

AVX2 is a new SIMD instruction set introduced in Intel(R) Haswell in 2013. It can calculate on 256-bit vectors. Ideally you can get ~2x performance boost by switching to AVX2 from SSE.

However AVX2 has two lanes and a set of AVX2 permutations are applied to high and low 128 bit parts separately, so if you just replace SSE instructions with AVX2 instructions and xmm with ymm you will not get a correct result.

Here I wrote a sample of using AVX2 to accelerate the calculation of alpha blend.

// some pre-calculations
void alpha_blend_init()
  __declspec(align(32)) uint32_t shuffle[] = { 0xFF06FF06, 0xFF06FF06, 0xFF0EFF0E, 0xFF0EFF0E };
  uint32_t * p_shuffle = &shuffle[0];

  __asm {
    MOV ECX, 0x100;
    VMOVD xmm6, ECX;
    VPBROADCASTW ymm6, xmm6;
    MOV ECX, p_shuffle;
    VMOVDQA xmm5, [ECX];
    VPERM2I128 ymm5, ymm5, ymm5, 0

// calculate 4 pixel at a time
void alpha_blend(uint32_t * back4px, uint32_t * fore4px, uint32_t * out4px)
  __asm {
    MOV EBX, back4px;
    VMOVDQA xmm0, [EBX];
    MOV ECX, fore4px;
    VMOVDQA xmm1, [ECX];
    VPERMQ ymm0, ymm0, 0xDC;
    VPERMQ ymm1, ymm1, 0xDC;
    VPUNPCKLBW ymm0, ymm0, ymm7;  // ymm0 = back
    VPUNPCKLBW ymm1, ymm1, ymm7;  // ymm1 = fore
    VPSHUFB ymm2, ymm1, ymm5;     // ymm2 = alpha(fore)
    VPSUBW ymm3, ymm6, ymm2;      // ymm3 = 256 - alpha
    VPMULLW ymm1, ymm1, ymm2;     // ymm1 = fore * alpha
    VPMULLW ymm0, ymm0, ymm3;     // ymm0 = back * (256 - alpha)
    VPADDW ymm0, ymm0, ymm1;
    VPSRLW ymm0, ymm0, 8;
    VPACKUSWB ymm1, ymm7, ymm0;
    VPERMQ ymm1, ymm1, 0x0D;
    MOV ECX, out4px;
    VMOVDQA [ECX], xmm1;

The code is based on the following alpha blend formula, assuming that we are going to blend a image containing Alpha channel (A) over a bitmap (B).

C_rgb = ( A_rgb * A_a + B_rgb * (256 - A_a) ) / 256


__declspec(align(32)) uint32_t back[] = { 0x00112233, 0x00000000, 0x00FFFFFF, 0x00888888 };
__declspec(align(32)) uint32_t fore[] = { 0x80FFFFFF, 0x00FFFFFF, 0x80000000, 0xFF112233 };
__declspec(align(32)) uint32_t result[] = { 0,0,0,0 };

// loop begin
alpha_blend(back, fore, result);
// loop end


  1. You may want to put alpha_blend_init() before the main loop. It pre-calculates some magic values which will later be used.

  2. Pointers passed to alpha_blend() should be explicitly aligned on 32-byte boundary.

  3. In the original formula, the last step should be devided by 255. In order to calculate faster, I use SHR 8 here which is devided by 256. So the composited color channels will be 0x1 larger or smaller than the correct ones sometimes. If you really care about the correctness, you need to correct the formula and modify the code. However if you are using it to render GUIs, mostly it will be fine!

  4. Again, my assumption is using it to render GUI on a bitmap (may be in an Operating System). So the calculation doesn't take care about the composited alpha channel. You may need to add extra instructions to fix the alpha channel if you think it matters.

Permutations Visualization


Read more:

Programming using AVX2. Permutations

Detect hardware support of AVX2: Intel(R) Manual Vol.1 14-32