SSE with GCC and not aligned addresses
I am not familiar with the AT&T ASM programming and the GCC style of inline ASM. It took me some time to get it working. Here are the results. Some sample code that can add or multiply C/C++ float vectors using SSE instructions.
The vectors do not have to have 16-bit aligned addresses. That is why I needed this code. The provided along with GCC functions do not allow that. If you want to operate on an array and move the pointer by 1 unit, not by 4, than MOVUPS instead of MOVAPS necessary.
The code:
/**
* The input parameters 'a' and 'b' should be 4 element float vectors.
* The result should point to a 4 element float vector as well.
* The result is the SSL additon r=a+b.
*/
inline void addSSE(float* a, float* b, float* result) {
__asm__ __volatile__
(
"movups (%[a]), %%xmm0 \n\t"
"movups (%[b]), %%xmm1 \n\t"
"addps %%xmm1, %%xmm0 \n\t"
"movups %%xmm0, %[result] \n\t"
: [result] "=m" (*result)
: [a] "r" (a), [b] "r" (b)
: "%xmm0", "%xmm1"
);
}
/**
* The input parameters 'a' and 'b' should be 4 element float vectors.
* The result should point to a 4 element float vector as well.
* The result is the SSL multiplication r=a*b.
*/
inline void mulSSE(float* a, float* b, float* result) {
__asm__ __volatile__
(
"movups (%[a]), %%xmm0 \n\t"
"movups (%[b]), %%xmm1 \n\t"
"mulps %%xmm1, %%xmm0 \n\t"
"movups %%xmm0, %[result] \n\t"
: [result] "=m" (*result)
: [a] "r" (a), [b] "r" (b)
: "%xmm0", "%xmm1"
);
}
v4sf a = __builtin_ia32_loadups(&float_array[n])
will force the use of movups as opposed to movaps, so it will work even if &float_array[n] is not 16 byte aligned
(of course, your code accomplishes the same thing, but you can get away without the inline asm)
There are underscores in that function name, the comment system has hidden them, it should be:
_ _ builtin _ ia32 _ loadups
remove the spaces