从标志变量想到的

标志变量是我们写代码时，常常要用到的。

如果你对一个图片进行Z字形扫描，你可能要用一个变量direct来指明当前扫描的方向，比如direct = 0表示从左向右，direct = 1表示从右到左，所以你每扫描一行，你需要将标志变量更换一次,以使扫描来回进行，看到有人这么写：

for (... ...)
{
    ... ...
    if (direct == 1) {
        direct = 0;
    } else {
        direct = 1;
    }
}

当时非常惊讶，窃以为效率低下，且代码臃肿，就自己试了几种方法来进行测试：

方法1 direct = !direct
方法2 direct = (++direct)&1
方法3 direct = (direct == 1 ? 0 : 1)
方法4 即为上面提到的if...else方法

测试的代码如下，相当简单，我就不注解了：

#include <iostream>
using namespace std;
int main()
{
    int direct = 0;
    for (int i = 0; i != 10000000; i++)
    {
        //direct = !direct;//方法1
        // real 0m20.994s
        //user  0m1.760s
        //sys   0m0.088s
        //direct = (++direct)&1;//方法2
        // real 0m21.048s
        //user  0m1.940s
        //sys   0m0.108s
        //direct = (direct == 1 ? 0 : 1);//方法3
        // real 0m20.955s
        //user  0m1.808s
        //sys   0m0.104s
        //if (__builtin_expect(direct == 0, 1))//方法5
        //real  0m21.111s
        //user  0m1.864s
        //sys   0m0.116s
        if (direct == 0)//方法4
        // real 0m21.114s
        //user  0m1.940s
        //sys   0m0.096s
              direct = 1;
        else
              direct = 0;
        cout<<direct;
    }
    cout<<endl;
    return 0;
}

我是用linux下的time命令进行测试的，输出包含三个时间，分别为real、user、sys，我代码中用注释标识的便是各种情况下所使用的时间。

注意一点，这里请不要用real时间来比较，看了下面的式子，你就明白了：

sys = 内核态用时
user＝用户态用时
real＝内核态用时＋用户态用时＋IO操作用时＋其他

要比较程序执行的时间，当然要比较用在CPU上的时间了，说不定在某个倒霉的时刻，你的某个后台程序进行了一次无聊的操作，占用了IO，导致你的程序卡了5分钟，你的real就是5分钟，你肯定不会想到用5分钟来进行比较。所以，我们需要用sys+user来进行比较。

从实验結果很容易发现，方法1，3要比方法4好得多，你可能会想，这差别并不大啊？不过，程序员的优劣可就体现在这细节上。

你可能注意到了，我上面还注明了一个方法5，这是从一位学长那学到的，我也来介绍介绍，以下是他给我的讲解：

The idea is very simple. Execution of your program uses pipelining. An IF statement in hardware pipelining causes 1 bubble (latency), this is mainly because branch statement can only be executed after the result in the condition is evaluated. The solution is via branch prediction. That is, processor go ahead to evaluate the branch statement according to its prediction on the condition evaluation. Later if the branch prediction is correct (i.e. matching the concrete result of the condition evaluation), the prediction is correct and one bubble is saved. Otherwise, processor flushes the whole pipeline and redo the evaluation of the branch according to the concrete condition value.

So the problem becomes, how to minimize the number of wrong predictions. One famous solution is through multi-level branch predictor. e.g. 2-bit branch predictor, the intuition here is, we keep one bit as index for the prediction history, and we do prediction using the other bit, referring to history, so that we can make more precise results. As experimented with SPEC2000 Benchmarks, branch prediction precision could arrive as high as 97%. And for your case (bit flipping), the precision could be as high as 100%. That's the intuition why using if statement could speed up the program because of branch predictor.

You definitely need to understand more on branch prediction in this case, andyou could go and read your Hennessy and Patterson on that Chapter.

I'm a person whose research group cares a lot on performance, e.g. parallel and scalable high performance computation, static analysis and compiler optimization, so this is also my interests.

你可能不想看这段英文哦，呵呵，大体是说了关于分支预测的问题，要知道CPU都采用流水线工作，当步入了错误的代码分支，那么要步入正确的分支就必定刷新流水线的操作，这一操作是费时费资源的，所以为了减少这种浪费，就要有一套好的分支预测方案。现代CPU在当进入一个循环体中的条件判断語句时，会用几个bit位来记录上次的分支情况，当下次循环时，很大概率上会是同样的条件判断結果，这样就保证了CPU有了一定的机器学习能力，它常常能做出更好的选择。

然而现在情况是，我们的direct变量会一个劲的变，如果CPU按照上次的经验进行选择，那么可怜的CPU总是做出错误的选择，除非它相当聪明能够意识到这种规律的所在。所以这里我加入了方法5，使用的是gcc内置函数__built_expect()如果你看过linux源码，你就知道likely() 和unlikely()就是对于这个内置函数封装的宏，这个函数能够使程序员有机会帮助CP U做出更好的选择，这里我使用(__builtin_expect(direct == 0, 1))告诉CPU总是选择 direct==0的分支进行预先运行，这样一来，CPU从100％的错误选择进化到了50％的错误选择，所以程序运行变得快了些。

我想，如果你对CPU的工作知之甚少，你很可能不知道我正在说什么，呵呵，不过，你可以找点资料看看后，你会觉得这是很容易理解的。

发表于 2010年03月26日 06:10 评论：0 阅读：2827

上一篇：我求最大公约数的方法
下一篇：关于Young氏矩阵中存在性问题