SSE with GCC and not aligned addresses

I am not familiar with the AT&T ASM programming and the GCC style of inline ASM. It took me some time to get it working. Here are the results. Some sample code that can add or multiply C/C++ float vectors using SSE instructions.

The vectors do not have to have 16-bit aligned addresses. That is why I needed this code. The provided along with GCC functions do not allow that. If you want to operate on an array and move the pointer by 1 unit, not by 4, than MOVUPS instead of MOVAPS necessary.

The code:

   /** 
    * The input parameters 'a' and 'b' should be 4 element float vectors. 
    * The result should point to a 4 element float vector as well. 
    * The result is the SSL additon r=a+b. 
    */ 
   inline void addSSE(float* a, float* b, float* result) { 
           __asm__ __volatile__ 
           ( 
                   "movups (%[a]), %%xmm0 \n\t" 
                   "movups (%[b]), %%xmm1 \n\t" 
                   "addps  %%xmm1, %%xmm0 \n\t" 
                   "movups %%xmm0, %[result] \n\t" 
                   : [result] "=m" (*result) 
                   : [a] "r" (a), [b] "r" (b) 
                   : "%xmm0", "%xmm1" 
           ); 
   } 

   /**
     * The input parameters 'a' and 'b' should be 4 element float vectors.
     * The result should point to a 4 element float vector as well.
     * The result is the SSL multiplication r=a*b.
     */
    inline void mulSSE(float* a, float* b, float* result) {
            __asm__ __volatile__
            (
                    "movups (%[a]), %%xmm0 \n\t"
                    "movups (%[b]), %%xmm1 \n\t"
                    "mulps  %%xmm1, %%xmm0 \n\t"
                    "movups %%xmm0, %[result] \n\t"
                    : [result] "=m" (*result)
                    : [a] "r" (a), [b] "r" (b)
                    : "%xmm0", "%xmm1"
            );
    }

Posted by wojtek Mon, 05 May 2008 11:54:00 GMT




Comments

Leave a response

  1. bob almost 2 years later:

    v4sf a = __builtin_ia32_loadups(&float_array[n])

    will force the use of movups as opposed to movaps, so it will work even if &float_array[n] is not 16 byte aligned

    (of course, your code accomplishes the same thing, but you can get away without the inline asm)

  2. bob almost 2 years later:

    There are underscores in that function name, the comment system has hidden them, it should be:

    _ _ builtin _ ia32 _ loadups

    remove the spaces

Leave a comment