OpenMP - reduction

作者：已經改掉暱稱的米奇│2021-07-30 13:04:22│巴幣：0│人氣：254

假設現在有個迴圈，迴圈中會把某個變數不停加上某個值

接著想要透過 openmp 把迴圈平行化，像是這樣

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int add(int);

int main(void) {
    int a = 0;

    #pragma omp parallel for
    for (int i = 0; i < 100000; i++) {
        a = a + add(1);
    }

    printf("a = %d\n", a);    // should be 100000

    return 0;
}

int add(int x) {
    for (int i = 0; i < 10000; i++) {
        // do nothing, just waste time
    }
    return x;
}

compile 並執行，會發現 a 的結果是錯的

$ gcc test1.c -std=c99 -fopenmp
$ time ./a.out
a = 97855

real    0m0.303s
user    0m2.371s
sys     0m0.006s

應該要輸出 100000，結果卻只有 97855

這是因為 data race 的關係

也就是不同 thread 可能會同時試圖修改 a，導致最後計算的結果出錯

要解決這個情況，最簡單的方式就是透過 critical 或是 atomic

限制一次只能有一個 thread 執行 a = a + add(1) 這行

（但這個情況下用 critical 反而會更慢）

另一個解決方法，可以把整個 i = 0 ～ 99999 的迴圈，拆給 N個 thread 獨立去處理

各 thread 把各自的部分算好之後，最後再把這 N 個 thread 的結果加起來

這樣就避免了 data race 的問題

像是這樣

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <omp.h>

// define number of threads
#define N 5

int add(int);

int main(void) {
    int a = 0;
    int tmpThreads[N];

    // initialize tmpThreads
    for (int i = 0; i < N; i++) {
        tmpThreads[i] = 0;
    }

    #pragma omp parallel for num_threads(N)
    for (int i = 0; i < 100000; i++) {
        int tid = omp_get_thread_num();
        tmpThreads[tid] += add(1);
    }

    // sum up the results of each thread
    for (int i = 0; i < N; i++) {
        a += tmpThreads[i];
    }

    printf("a = %d\n", a);

    return 0;
}

int add(int x) {
    for (int i = 0; i < 10000; i++) {
        // do nothing, just waste time
    }
    return x;
}

執行結果：

$ gcc test2.c -std=c99 -fopenmp
$ time ./a.out
a = 100000

real    0m0.456s
user    0m2.211s
sys     0m0.000s

上面這個做法（把迴圈拆給多個 thread 獨立處理，最後再全部加起來）

基本上就是 openmp 的 reduction 的精神

使用 reduction 的話會像這樣

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int add(int);

int main(void) {
    int a = 0;

    #pragma omp parallel for reduction(+:a)
    for (int i = 0; i < 100000; i++) {
        a = a + add(1);
    }

    printf("a = %d\n", a);

    return 0;
}

int add(int x) {
    for (int i = 0; i < 10000; i++) {
        // do nothing, just waste time
    }
    return x;
}

執行結果：

$ gcc test2.c -std=c99 -fopenmp
$ time ./a.out
a = 100000

real    0m0.293s
user    0m2.342s
sys     0m0.000s

可以發現結果是正確的，並且執行時間也確實有比較快（沒有 openmp 的話約 2 秒）

reduction 就是使用在類似這樣的場合

也就是類似這種情況

for (int i=0; i < N; i++) {
a = a + something;
}

直接平行迴圈會出現 data race 的問題

那就可以用 reduction 來處理

#pragma omp parallel for reduction(+:a)
for (int i=0; i < N; i++) {
a = a + something;
}

更廣泛一點

如果迴圈中是類似這樣的計算

targetVar = targetVar <operator> <expr>