风の愿 » SIMD

AVX2 Optimized Alpha Blend

AVX2 is a new SIMD instruction set introduced in Intel(R) Haswell in 2013. It can calculate on 256-bit vectors. Ideally you can get ~2x performance boost by switching to AVX2 from SSE.

However AVX2 has two lanes and a set of AVX2 permutations are applied to high and low 128 bit parts separately, so if you just replace SSE instructions with AVX2 instructions and xmm with ymm you will not get a correct result.

Here I wrote a sample of using AVX2 to accelerate the calculation of alpha blend.

// some pre-calculations
void alpha_blend_init()
{
  __declspec(align(32)) uint32_t shuffle[] = { 0xFF06FF06, 0xFF06FF06, 0xFF0EFF0E, 0xFF0EFF0E };
  uint32_t * p_shuffle = &shuffle[0];

  __asm {
    VZEROALL;
    MOV ECX, 0x100;
    VMOVD xmm6, ECX;
    VPBROADCASTW ymm6, xmm6;
    MOV ECX, p_shuffle;
    VMOVDQA xmm5, [ECX];
    VPERM2I128 ymm5, ymm5, ymm5, 0
  }
}

// calculate 4 pixel at a time
void alpha_blend(uint32_t * back4px, uint32_t * fore4px, uint32_t * out4px)
{
  __asm {
    MOV EBX, back4px;
    VMOVDQA xmm0, [EBX];
    MOV ECX, fore4px;
    VMOVDQA xmm1, [ECX];
    VPERMQ ymm0, ymm0, 0xDC;
    VPERMQ ymm1, ymm1, 0xDC;
    VPUNPCKLBW ymm0, ymm0, ymm7;  // ymm0 = back
    VPUNPCKLBW ymm1, ymm1, ymm7;  // ymm1 = fore
    VPSHUFB ymm2, ymm1, ymm5;     // ymm2 = alpha(fore)
    VPSUBW ymm3, ymm6, ymm2;      // ymm3 = 256 - alpha
    VPMULLW ymm1, ymm1, ymm2;     // ymm1 = fore * alpha
    VPMULLW ymm0, ymm0, ymm3;     // ymm0 = back * (256 - alpha)
    VPADDW ymm0, ymm0, ymm1;
    VPSRLW ymm0, ymm0, 8;
    VPACKUSWB ymm1, ymm7, ymm0;
    VPERMQ ymm1, ymm1, 0x0D;
    MOV ECX, out4px;
    VMOVDQA [ECX], xmm1;
  }
}

The code is based on the following alpha blend formula, assuming that we are going to blend a image containing Alpha channel (A) over a bitmap (B).

C_rgb = ( A_rgb * A_a + B_rgb * (256 - A_a) ) / 256

Read More »
Published @ 26th July, 2015
1