Welcome! Log In Create A New Profile

Advanced

Kernel benchmarks: [GoFlex] vs [A10 3.0.8+] vs. [A10 3.0.36+ defconfig] vs. [A10 3.0.36+ a10linux_config]

Posted by gnexus 
My A10 3.0.36+ kernel configuration and testing is mostly done now. But some benchmarking is still needed. We need to know that at least we are not losing any performance with the new kernel config. Any gains would be a nice bonus. So lets see how the three kernels compare, and also compare that to some other ARM platforms.

After a bit of research I decided that lmbench would be the most convenient choice for a benchmark suite. I will use that to compare the three kernels, and also add in the the Seagate GoFlex for a good picture of the type of performance those users might expect on the A10. I will also try to link to a few other online lmbench results for other ARM platforms.

Stay tuned for the results!



Edited 1 time(s). Last edit at 07/19/2012 06:01AM by gnexus.
Re: Kernel benchmarks: 3.0.8+ vs. 3.0.36+ defconfig vs. 3.0.36+ a10linux_config
July 18, 2012 05:32PM
Of course first lmbench needs to compile properly. . .
that may be more difficult than compiling the kernel.

it's not exactly an Armhf-friendly program.
Re: Kernel benchmarks: 3.0.8+ vs. 3.0.36+ defconfig vs. 3.0.36+ a10linux_config
July 19, 2012 05:15AM
Actually I should admit that lmbench does compile in Sid armhf. But the memsize binary does not work. So it's useless. . .
My next step was to compile it on my GoFlex since I want to benchmark that platform too. As an initial benchmark for that platform I will say that compiling lmbench on the GoFlex takes about 3-4 times what it does on the A10. So we already know before the starting gate has even opened that Kirkwood has already lost the race big time. Sorry Kirkwood. . .

But we knew that already, didn't we. That's why we have A10s! I'm just hoping lmbench works on the GoFlex. Then hopefully I can use the binary on the A10. Until lmbench gets a patch to compile natively on armhf that is the best we will get for now.

Lmbench is all compiled now. It is only suggesting the use of 85MB for the memory size. So it looks like the memsize binary built okay. It works!

Here goes it:
It is asking me for the hdd. That will be cheating on the GoFlex. It is using a very fast 2TB 7200RPM eSATA drive. Thus the hdd benchmarks will be highly skewed toward the GoFlex. But I'm certain that the dismal showing for the processor will certainly make up for that in spades. For some reason benchmarking is taking a VERY long time on the GoFlex. . .

3 days later (actually it only took 0.05 days): It's finished! Here are the results:
Wait. Where are the results? Duh. They are in the results folder. Why are they also not on screen? Ooh. Tons of data!
************************************************************************************************************************************
[lmbench3.0 results for Linux DS02 3.2.0-2-kirkwood #1 Sat Jun 2 13:45:52 UTC 2012 armv5tel GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[DISKS: /dev/sda ]
[DISK_DESC: [/dev/sda:WDC WD2002FAEX] ]
Simple syscall: 0.2020 microseconds
Simple read: 0.8106 microseconds
Simple write: 0.7403 microseconds
Simple stat: 3.4465 microseconds
Simple fstat: 1.1711 microseconds
Simple open/close: 6.3901 microseconds
Select on 10 fd's: 1.5198 microseconds
Select on 100 fd's: 10.2851 microseconds
Select on 250 fd's: 24.5319 microseconds
Select on 500 fd's: 49.3756 microseconds
Select on 10 tcp fd's: 1.7731 microseconds
Select on 100 tcp fd's: 20.6685 microseconds
Select on 250 tcp fd's: 61.4722 microseconds
Select on 500 tcp fd's: 104.8602 microseconds
Signal handler installation: 0.8378 microseconds
Signal handler overhead: 2.5305 microseconds
Protection fault: 0.4152 microseconds
Pipe latency: 53.3706 microseconds
AF_UNIX sock stream latency: 62.1190 microseconds
Process fork+exit: 898.7273 microseconds
Process fork+execve: 2913.5000 microseconds
Process fork+/bin/sh -c: 6702.5000 microseconds
integer bit: 0.85 nanoseconds
integer add: 0.97 nanoseconds
integer mul: 2.61 nanoseconds
integer div: 149.21 nanoseconds
integer mod: 36.61 nanoseconds
int64 bit: 1.70 nanoseconds
uint64 add: 1.91 nanoseconds
int64 mul: 7.07 nanoseconds
int64 div: 439.60 nanoseconds
int64 mod: 286.19 nanoseconds
float add: 36.52 nanoseconds
float mul: 30.52 nanoseconds
float div: 163.00 nanoseconds
double add: 51.88 nanoseconds
double mul: 46.69 nanoseconds
double div: 541.32 nanoseconds
float bogomflops: 332.25 nanoseconds
double bogomflops: 805.17 nanoseconds
integer bit parallelism: 1.05
integer add parallelism: 1.33
integer mul parallelism: 2.70
integer div parallelism: 1.00
integer mod parallelism: 1.00
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.00
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 112887 KB/sec
Pagefaults on /var/tmp/XXX: 4296.4660 microseconds

"mappings
0.524288 49
1.048576 61
2.097152 82
4.194304 136
8.388608 227
16.777216 439
33.554432 811
67.108864 1717

"File system latency
0k 294 27546 56903
1k 159 13315 32737
4k 158 14189 31828
10k 95 9214 23759

"Seek times for /dev/sda
I left these out as it is very long and irrelevant for most users. But they ranged between 1.05 and 11.40ms.

"Zone bandwidth for /dev/sda
Also left out as it was very long and irrelevant. But they ranged between 123.13 and 73.42.


UDP latency using localhost: 92.2983 microseconds
TCP latency using localhost: 123.6383 microseconds
TCP/IP connection cost to localhost: 325.5161 microseconds

Socket bandwidth using localhost
0.000001 0.51 MB/sec
0.000064 11.35 MB/sec
0.000128 20.51 MB/sec
0.000256 33.17 MB/sec
0.000512 52.05 MB/sec
0.001024 70.53 MB/sec
0.001437 72.70 MB/sec
10.000000 111.10 MB/sec

Avg xfer: 3.2KB, 41.8KB in 11.7690 millisecs, 3.55 MB/sec
AF_UNIX sock stream bandwidth: 215.24 MB/sec
Pipe bandwidth: 232.80 MB/sec
"read bandwidth
0.000512 156.82
0.001024 255.84
0.002048 369.83
0.004096 494.01
0.008192 455.84
0.016384 357.83
0.032768 356.78
0.065536 328.36
0.131072 255.96
0.262144 189.86
0.524288 168.16
1.05 164.69
2.10 164.43
4.19 161.10
8.39 164.25
16.78 162.88
33.55 164.73
67.11 95.02

"read open2close bandwidth
0.000512 38.92
0.001024 74.18
0.002048 131.70
0.004096 220.31
0.008192 273.31
0.016384 290.92
0.032768 309.27
0.065536 308.82
0.131072 224.85
0.262144 193.14
0.524288 171.73
1.05 170.88
2.10 161.93
4.19 164.51
8.39 162.91
16.78 164.49
33.55 164.80
67.11 95.85


"Mmap read bandwidth
0.000512 979.41
0.001024 1025.62
0.002048 1064.08
0.004096 1080.56
0.008192 1069.89
0.016384 1071.27
0.032768 694.18
0.065536 656.91
0.131072 558.94
0.262144 420.23
0.524288 301.46
1.05 290.87
2.10 285.95
4.19 288.19
8.39 279.10
16.78 282.85
33.55 283.13
67.11 101.93

"Mmap read open2close bandwidth
0.000512 14.39
0.001024 27.69
0.002048 53.33
0.004096 98.85
0.008192 148.73
0.016384 199.30
0.032768 255.88
0.065536 284.19
0.131072 293.08
0.262144 250.88
0.524288 223.10
1.05 212.61
2.10 216.47
4.19 216.65
8.39 215.34
16.78 215.66
33.55 216.19
67.11 99.68


"libc bcopy unaligned
0.000512 331.84
0.001024 305.79
0.002048 309.02
0.004096 323.35
0.008192 328.57
0.016384 335.41
0.032768 316.60
0.065536 305.83
0.131072 283.59
0.262144 221.05
0.524288 194.69
1.05 183.38
2.10 180.99
4.19 175.63
8.39 180.79
16.78 177.73
33.55 184.68

"libc bcopy aligned
0.000512 341.92
0.001024 341.04
0.002048 336.22
0.004096 329.06
0.008192 348.32
0.016384 338.04
0.032768 329.45
0.065536 317.55
0.131072 305.77
0.262144 262.91
0.524288 224.86
1.05 215.06
2.10 208.20
4.19 213.04
8.39 211.11
16.78 211.29
33.55 209.53

Memory bzero bandwidth
0.000512 670.03
0.001024 624.52
0.002048 693.83
0.004096 662.49
0.008192 662.01
0.016384 673.20
0.032768 668.04
0.065536 658.59
0.131072 657.28
0.262144 644.32
0.524288 639.08
1.05 661.74
2.10 664.39
4.19 658.29
8.39 655.26
16.78 645.28
33.55 651.73
67.11 645.07

"unrolled bcopy unaligned
0.000512 311.89
0.001024 337.81
0.002048 337.40
0.004096 329.10
0.008192 336.81
0.016384 328.81
0.032768 306.58
0.065536 307.43
0.131072 291.15
0.262144 237.79
0.524288 193.77
1.05 184.72
2.10 185.65
4.19 180.42
8.39 181.62
16.78 189.45
33.55 196.23

"unrolled partial bcopy unaligned
0.000512 1373.95
0.001024 1260.74
0.002048 1254.31
0.004096 1361.93
0.008192 1352.46
0.016384 1295.03
0.032768 1018.37
0.065536 974.21
0.131072 672.68
0.262144 446.62
0.524288 342.34
1.05 298.78
2.10 298.27
4.19 289.98
8.39 299.43
16.78 280.90
33.55 325.67

Memory read bandwidth
0.000512 1090.75
0.001024 1095.46
0.002048 1104.92
0.004096 1082.51
0.008192 1096.88
0.016384 1088.98
0.032768 714.57
0.065536 651.89
0.131072 572.83
0.262144 383.32
0.524288 311.45
1.05 288.18
2.10 285.48
4.19 286.95
8.39 282.75
16.78 283.81
33.55 283.27
67.11 283.30

Memory partial read bandwidth
0.000512 4068.67
0.001024 4171.29
0.002048 4199.84
0.004096 4304.41
0.008192 4358.81
0.016384 4134.54
0.032768 1422.88
0.065536 1178.89
0.131072 1032.84
0.262144 544.24
0.524288 409.60
1.05 363.84
2.10 371.28
4.19 361.14
8.39 367.99
16.78 354.94
33.55 354.77
67.11 349.66

Memory write bandwidth
0.000512 342.44
0.001024 316.94
0.002048 342.83
0.004096 336.25
0.008192 344.33
0.016384 336.69
0.032768 339.45
0.065536 334.91
0.131072 330.31
0.262144 334.27
0.524288 327.19
1.05 330.52
2.10 327.42
4.19 332.43
8.39 328.04
16.78 329.14
33.55 329.83
67.11 330.57

Memory partial write bandwidth
0.000512 1392.79
0.001024 1242.47
0.002048 1359.89
0.004096 1380.84
0.008192 1323.12
0.016384 1328.87
0.032768 1304.15
0.065536 1335.31
0.131072 1307.45
0.262144 1251.28
0.524288 1293.80
1.05 1267.49
2.10 1285.47
4.19 1285.41
8.39 1284.23
16.78 1297.34
33.55 1268.07
67.11 1270.47

Memory partial read/write bandwidth
0.000512 3257.94
0.001024 3322.75
0.002048 3386.92
0.004096 3367.46
0.008192 3426.73
0.016384 3354.66
0.032768 1318.17
0.065536 1093.86
0.131072 511.21
0.262144 330.23
0.524288 248.41
1.05 219.57
2.10 215.07
4.19 211.33
8.39 211.49
16.78 212.85
33.55 210.84
67.11 209.87



"size=0k ovr=4.41
2 24.08
4 24.61
8 24.56
16 25.22
24 25.39
32 25.95
64 28.03
96 28.47

"size=4k ovr=8.69
2 27.38
4 29.11
8 29.49
16 31.99
24 32.42
32 35.37
64 39.33
96 40.98

"size=8k ovr=13.11
2 30.36
4 32.28
8 33.85
16 37.78
24 42.93
32 47.12
64 50.99
96 52.60

"size=16k ovr=24.51
2 31.88
4 38.11
8 51.62
16 57.16
24 66.06
32 69.03
64 70.64
96 71.32

"size=32k ovr=59.21
2 39.64
4 51.84
8 69.45
16 92.89
24 95.26
32 96.64
64 94.69
96 94.41

"size=64k ovr=113.96
2 56.01
4 110.06
8 147.09
16 159.92
24 163.46
32 158.32
64 153.92
96 152.30

tlb: 8 pages

Memory load parallelism
0.000512 3.01
0.001024 3.50
0.002048 3.04
0.004096 3.30
0.008192 3.27
0.016384 7.72
0.032768 1.00
0.065536 1.11
0.131072 1.08
0.262144 1.00
0.524288 1.19
1.048576 1.00
2.097152 1.06
4.194304 1.00
8.388608 1.00
16.777216 1.20
33.554432 1.24
67.108864 1.40

STREAM copy latency: 35.85 nanoseconds
STREAM copy bandwidth: 446.30 MB/sec
STREAM scale latency: 80.56 nanoseconds
STREAM scale bandwidth: 198.60 MB/sec
STREAM add latency: 149.15 nanoseconds
STREAM add bandwidth: 160.91 MB/sec
STREAM triad latency: 163.24 nanoseconds
STREAM triad bandwidth: 147.03 MB/sec
STREAM2 fill latency: 12.11 nanoseconds
STREAM2 fill bandwidth: 660.84 MB/sec
STREAM2 copy latency: 35.10 nanoseconds
STREAM2 copy bandwidth: 455.90 MB/sec
STREAM2 daxpy latency: 149.38 nanoseconds
STREAM2 daxpy bandwidth: 160.67 MB/sec
STREAM2 sum latency: 72.40 nanoseconds
STREAM2 sum bandwidth: 110.49 MB/sec

Memory load latency
"stride=16
0.00049 2.825
0.00098 2.845
0.00195 2.814
0.00293 2.830
0.00391 2.780
0.00586 2.793
0.00781 2.779
0.01172 2.779
0.01562 2.837
0.02344 8.729
0.03125 10.177
0.04688 11.730
0.06250 12.360
0.09375 13.102
0.12500 15.070
0.18750 21.356
0.25000 23.123
0.37500 34.249
0.50000 38.709
0.75000 41.664
1.00000 42.372
1.50000 42.887
2.00000 43.076
3.00000 43.240
4.00000 43.538
6.00000 43.617
8.00000 43.653
12.00000 43.334
16.00000 43.450
24.00000 43.336
32.00000 43.451
48.00000 43.497
64.00000 43.490

"stride=32
0.00049 2.811
0.00098 2.785
0.00195 2.770
0.00293 2.829
0.00391 2.820
0.00586 2.800
0.00781 2.791
0.01172 2.809
0.01562 2.853
0.02344 14.596
0.03125 17.605
0.04688 20.420
0.06250 21.690
0.09375 27.235
0.12500 33.990
0.18750 33.476
0.25000 49.188
0.37500 65.351
0.50000 70.913
0.75000 80.106
1.00000 82.615
1.50000 83.097
2.00000 82.892
3.00000 80.924
4.00000 83.112
6.00000 82.907
8.00000 83.484
12.00000 83.687
16.00000 83.241
24.00000 83.908
32.00000 83.560
48.00000 83.659
64.00000 83.732

"stride=64
0.00049 2.825
0.00098 2.822
0.00195 2.823
0.00293 2.822
0.00391 2.816
0.00586 2.825
0.00781 2.801
0.01172 2.849
0.01562 2.902
0.02344 14.572
0.03125 17.315
0.04688 20.119
0.06250 21.342
0.09375 22.325
0.12500 23.428
0.18750 41.993
0.25000 48.708
0.37500 65.993
0.50000 73.926
0.75000 77.764
1.00000 80.365
1.50000 82.921
2.00000 83.470
3.00000 82.539
4.00000 83.543
6.00000 83.822
8.00000 83.276
12.00000 83.902
16.00000 81.197
24.00000 83.395
32.00000 83.393
48.00000 83.782
64.00000 83.799

"stride=128
0.00049 2.824
0.00098 2.819
0.00195 2.838
0.00293 3.700
0.00391 4.661
0.00586 2.624
0.00781 2.689
0.01172 2.711
0.01562 2.719
0.02344 13.398
0.03125 16.096
0.04688 19.162
0.06250 20.076
0.09375 21.284
0.12500 32.366
0.18750 37.205
0.25000 43.019
0.37500 56.211
0.50000 61.812
0.75000 69.510
1.00000 71.346
1.50000 75.173
2.00000 76.623
3.00000 78.574
4.00000 78.675
6.00000 79.688
8.00000 79.358
12.00000 79.959
16.00000 79.160
24.00000 79.553
32.00000 79.153
48.00000 79.792
64.00000 79.800

"stride=256
0.00049 2.687
0.00098 2.686
0.00195 2.661
0.00293 2.684
0.00391 2.687
0.00586 2.689
0.00781 2.687
0.01172 2.677
0.01562 2.739
0.02344 13.158
0.03125 16.147
0.04688 18.961
0.06250 20.036
0.09375 26.392
0.12500 31.640
0.18750 37.858
0.25000 49.387
0.37500 54.761
0.50000 65.584
0.75000 74.385
1.00000 75.854
1.50000 79.237
2.00000 80.146
3.00000 82.767
4.00000 83.511
6.00000 83.809
8.00000 84.345
12.00000 84.306
16.00000 84.992
24.00000 84.385
32.00000 84.542
48.00000 84.821
64.00000 84.452

"stride=512
0.00049 2.687
0.00098 2.692
0.00195 2.684
0.00293 2.691
0.00391 2.669
0.00586 2.681
0.00781 2.682
0.01172 2.701
0.01562 2.695
0.02344 13.231
0.03125 15.704
0.04688 19.117
0.06250 20.328
0.09375 26.261
0.12500 24.178
0.18750 34.359
0.25000 52.597
0.37500 63.936
0.50000 74.519
0.75000 81.778
1.00000 84.997
1.50000 88.436
2.00000 89.555
3.00000 91.715
4.00000 92.823
6.00000 94.080
8.00000 93.728
12.00000 94.360
16.00000 94.244
24.00000 94.463
32.00000 94.439
48.00000 93.902
64.00000 93.943

"stride=1024
0.00098 2.694
0.00195 2.691
0.00293 2.692
0.00391 2.684
0.00586 2.682
0.00781 2.692
0.01172 2.694
0.01562 2.752
0.02344 13.146
0.03125 15.750
0.04688 19.718
0.06250 21.145
0.09375 24.173
0.12500 28.140
0.18750 33.309
0.25000 66.868
0.37500 82.610
0.50000 90.736
0.75000 99.158
1.00000 103.630
1.50000 107.541
2.00000 108.902
3.00000 111.691
4.00000 112.147
6.00000 113.951
8.00000 113.417
12.00000 114.193
16.00000 113.951
24.00000 114.261
32.00000 114.271
48.00000 114.364
64.00000 115.189


Random load latency
"stride=16
0.00049 2.675
0.00098 2.663
0.00195 2.669
0.00293 2.672
0.00391 2.650
0.00586 2.638
0.00781 2.631
0.01172 2.632
0.01562 2.642
0.02344 15.224
0.03125 23.300
0.04688 28.577
0.06250 28.389
0.09375 35.162
0.12500 42.471
0.18750 49.413
0.25000 176.233
0.37500 196.322
0.50000 209.527
0.75000 219.459
1.00000 222.973
1.50000 224.033
2.00000 224.939
3.00000 223.484
4.00000 224.127
6.00000 224.397
8.00000 224.520
12.00000 224.430
16.00000 224.468
24.00000 224.508
32.00000 224.640
48.00000 224.280
64.00000 224.930



Edited 1 time(s). Last edit at 07/19/2012 05:58AM by gnexus.
Re: Kernel benchmarks: 3.0.8+ vs. 3.0.36+ defconfig vs. 3.0.36+ a10linux_config
July 19, 2012 07:36AM
The next step is to copy over the compiled lmbench archive to the Mele and extract it. It would be better to actually use a version compiled on the Mele. That is the way lmbench is supposed to be run. But we can't do that yet. Maybe later when a lmbench patch is made. For now I'll just have to try the GoFlex compiled version and hope that it works. I have other things to do rather than worry about patching lmbench to make it Debian armhf compatible. Lets see if lmbench now runs on the Mele.

Copying the binaries is not a solution to the build problem. Lmbench wants to compile again to run on a different gcc. So it will likely be broken again.
../scripts/config-run aborted: Not enough memory, only 1MB available.
Yes. It is broken. Time to try changing some compiler flags. Yes. That does help! Using the following works:
export CFLAGS="-march=armv5te"
Then it is also using the same flags as on the GoFlex. That also makes for an even, if a bit skewed, comparison in favor of the GoFlex. It is running now! 180MB OK That means that it is using 180MB for memory vs. 85MB on the GoFlex. The reason for that is the GoFlex only has 128MB of DDR memory vs. 512MB on the Mele.

Much less than 3 days later (it was 0.0208 days) here are the results for the 3.0.36+ a10linux_config:
***************************************************************************************************************************************
[lmbench3.0 results for Linux T-01 3.0.36+ #28 PREEMPT Tue Jul 17 21:43:19 IST 2012 armv7l GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m]
[DISKS: /dev/mmcblk0 ]
[DISK_DESC: [/dev/mmcblk0:Class 10 SD card] ]
Simple syscall: 0.3456 microseconds
Simple read: 0.7078 microseconds
Simple write: 0.8872 microseconds
Simple stat: 3.6667 microseconds
Simple fstat: 1.2315 microseconds
Simple open/close: 6.8858 microseconds
Select on 10 fd's: 1.4178 microseconds
Select on 100 fd's: 9.7291 microseconds
Select on 250 fd's: 22.8756 microseconds
Select on 500 fd's: 38.9378 microseconds
Select on 10 tcp fd's: 1.6138 microseconds
Select on 100 tcp fd's: 14.9423 microseconds
Select on 250 tcp fd's: 37.0928 microseconds
Select on 500 tcp fd's: 74.6780 microseconds
Signal handler installation: 0.8697 microseconds
Signal handler overhead: 4.1299 microseconds
Protection fault: 0.5323 microseconds
Pipe latency: 17.7527 microseconds
AF_UNIX sock stream latency: 33.4287 microseconds
Process fork+exit: 711.5065 microseconds
Process fork+execve: 2405.0000 microseconds
Process fork+/bin/sh -c: 5878.0000 microseconds
integer bit: 1.17 nanoseconds
integer add: 1.00 nanoseconds
integer mul: 5.96 nanoseconds
integer div: 74.49 nanoseconds
integer mod: 25.26 nanoseconds
int64 bit: 1.18 nanoseconds
uint64 add: 1.57 nanoseconds
int64 mul: 11.07 nanoseconds
int64 div: 281.13 nanoseconds
int64 mod: 189.04 nanoseconds
float add: 8.87 nanoseconds
float mul: 9.96 nanoseconds
float div: 32.86 nanoseconds
double add: 8.87 nanoseconds
double mul: 10.96 nanoseconds
double div: 56.75 nanoseconds
float bogomflops: 104.26 nanoseconds
double bogomflops: 175.48 nanoseconds
integer bit parallelism: 1.49
integer add parallelism: 1.71
integer mul parallelism: 2.90
integer div parallelism: 1.19
integer mod parallelism: 1.09
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.03
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 3101 KB/sec
Pagefaults on /var/tmp/XXX: 4.2495 microseconds

"mappings
0.524288 30
1.048576 45
2.097152 72
4.194304 140
8.388608 275
16.777216 544
33.554432 1009
67.108864 2033
134.217728 4390

"File system latency
0k 1907 17116 11183
1k 408 3723 5931
4k 880 8292 18935
10k 575 5212 2574

"Seek times for /dev/mmcblk0
2144.3 2.71
. . .
2.3 14.79

"Zone bandwidth for /dev/mmcblk0
1.1 5.67
. . .
2136.0 6.51

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 984.8660 microseconds
TCP latency using localhost: 70.0010 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
(null): RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 225.4808 microseconds

Socket bandwidth using localhost
0.000001 0.24 MB/sec
0.000064 11.46 MB/sec
0.000128 13.29 MB/sec
0.000256 26.19 MB/sec
0.000512 44.90 MB/sec
0.001024 74.54 MB/sec
0.001437 72.29 MB/sec
10.000000 8.78 MB/sec

Avg xfer: 3.2KB, 41.8KB in 117 millisecs, 355.90 KB/sec
AF_UNIX sock stream bandwidth: 215.11 MB/sec
Pipe bandwidth: 338.18 MB/sec

"read bandwidth
0.000512 201.90
0.001024 345.61
0.002048 529.00
0.004096 729.85
0.008192 811.23
0.016384 810.64
0.032768 739.41
0.065536 711.84
0.131072 478.12
0.262144 336.98
0.524288 264.93
1.05 261.44
2.10 260.47
4.19 261.15
8.39 263.13
16.78 247.90
33.55 166.57
67.11 260.10
134.22 257.54

"read open2close bandwidth
0.000512 48.53
0.001024 89.00
0.002048 167.61
0.004096 307.10
0.008192 435.86
0.016384 546.34
0.032768 603.67
0.065536 625.48
0.131072 491.56
0.262144 333.51
0.524288 268.49
1.05 254.93
2.10 257.72
4.19 261.72
8.39 264.24
16.78 259.70
33.55 260.38
67.11 251.93
134.22 252.93


"Mmap read bandwidth
0.000512 1802.99
0.001024 1701.99
0.002048 1937.79
0.004096 1945.65
0.008192 1964.57
0.016384 1974.89
0.032768 1971.98
0.065536 1631.95
0.131072 1535.32
0.262144 928.45
0.524288 462.52
1.05 395.89
2.10 375.10
4.19 368.30
8.39 364.68
16.78 365.87
33.55 365.88
67.11 365.62
134.22 358.63

"Mmap read open2close bandwidth
0.000512 19.53
0.001024 38.51
0.002048 74.90
0.004096 141.94
0.008192 231.86
0.016384 346.96
0.032768 460.82
0.065536 551.00
0.131072 545.26
0.262144 391.83
0.524288 273.78
1.05 259.55
2.10 261.89
4.19 262.19
8.39 265.23
16.78 269.98
33.55 270.02
67.11 268.64
134.22 267.08


"libc bcopy unaligned
0.000512 1205.16
0.001024 1201.63
0.002048 1203.50
0.004096 1204.03
0.008192 1196.18
0.016384 1204.21
0.032768 1213.01
0.065536 1161.10
0.131072 1119.14
0.262144 641.79
0.524288 293.15
1.05 251.79
2.10 194.75
4.19 240.26
8.39 222.57
16.78 236.61
33.55 239.92
67.11 238.96

"libc bcopy aligned
0.000512 1192.24
0.001024 1199.69
0.002048 1202.54
0.004096 874.94
0.008192 1186.60
0.016384 1208.81
0.032768 1209.15
0.065536 1169.34
0.131072 1144.57
0.262144 663.10
0.524288 302.07
1.05 251.99
2.10 243.07
4.19 239.92
8.39 227.52
16.78 226.22
33.55 228.38
67.11 240.15

Memory bzero bandwidth
0.000512 1207.75
0.001024 1212.40
0.002048 1210.01
0.004096 1215.26
0.008192 1197.45
0.016384 1209.42
0.032768 1170.84
0.065536 1164.24
0.131072 1160.48
0.262144 1168.83
0.524288 1171.14
1.05 1154.79
2.10 1174.99
4.19 1185.86
8.39 1179.85
16.78 1178.29
33.55 1172.76
67.11 1172.96
134.22 1171.28

"unrolled bcopy unaligned
0.000512 1198.84
0.001024 1215.87
0.002048 1198.53
0.004096 1212.75
0.008192 1216.32
0.016384 1212.59
0.032768 1208.44
0.065536 1164.84
0.131072 1164.62
0.262144 707.72
0.524288 301.48
1.05 244.91
2.10 241.29
4.19 239.79
8.39 238.05
16.78 239.79
33.55 239.80
67.11 228.92

"unrolled partial bcopy unaligned
0.000512 1200.38
0.001024 1201.24
0.002048 1199.13
0.004096 1198.91
0.008192 1201.39
0.016384 1134.72
0.032768 1175.94
0.065536 1165.44
0.131072 1172.26
0.262144 724.60
0.524288 398.49
1.05 394.86
2.10 374.14
4.19 352.11
8.39 374.06
16.78 384.25
33.55 382.68
67.11 382.91

Memory read bandwidth
0.000512 1953.51
0.001024 1973.85
0.002048 1982.54
0.004096 1989.04
0.008192 1982.68
0.016384 1988.98
0.032768 1914.07
0.065536 1628.50
0.131072 1553.02
0.262144 1021.82
0.524288 458.14
1.05 397.13
2.10 377.87
4.19 369.47
8.39 366.10
16.78 366.84
33.55 360.41
67.11 365.82
134.22 358.98

Memory partial read bandwidth
0.000512 7242.95
0.001024 7511.36
0.002048 7643.66
0.004096 7644.15
0.008192 7652.52
0.016384 7696.98
0.032768 7559.97
0.065536 3991.40
0.131072 3498.25
0.262144 1301.20
0.524288 545.15
1.05 440.76
2.10 416.99
4.19 406.27
8.39 402.63
16.78 401.92
33.55 401.30
67.11 401.46
134.22 392.80

Memory write bandwidth
0.000512 1212.76
0.001024 1058.47
0.002048 1209.33
0.004096 1073.95
0.008192 1107.93
0.016384 1098.34
0.032768 1131.87
0.065536 1161.18
0.131072 1169.22
0.262144 1169.92
0.524288 1164.42
1.05 1171.78
2.10 1185.14
4.19 1182.85
8.39 1180.42
16.78 1178.32
33.55 1176.10
67.11 1173.46
134.22 1168.20

Memory partial write bandwidth
0.000512 1212.08
0.001024 1209.82
0.002048 1216.00
0.004096 1213.12
0.008192 1111.93
0.016384 1141.04
0.032768 1186.76
0.065536 1192.43
0.131072 1180.39
0.262144 1183.04
0.524288 1172.55
1.05 1165.10
2.10 1166.76
4.19 1182.78
8.39 1179.15
16.78 1182.18
33.55 1178.44
67.11 1172.49
134.22 1173.15

Memory partial read/write bandwidth
0.000512 4996.35
0.001024 5112.75
0.002048 5179.87
0.004096 5213.50
0.008192 5184.21
0.016384 5216.75
0.032768 5210.92
0.065536 3387.93
0.131072 2983.44
0.262144 1075.41
0.524288 379.81
1.05 318.10
2.10 313.61
4.19 313.39
8.39 312.89
16.78 311.31
33.55 311.80
67.11 313.78
134.22 308.18



"size=0k ovr=3.61
2 5.35
4 7.10
8 7.45
16 8.49
24 9.97
32 10.32
64 13.14
96 14.12

"size=4k ovr=5.80
2 5.77
4 7.75
8 9.50
16 12.06
24 15.58
32 17.04
64 23.64
96 25.89

"size=8k ovr=7.84
2 5.80
4 8.42
8 10.92
16 18.15
24 22.58
32 28.47
64 35.77
96 37.67

"size=16k ovr=12.15
2 7.30
4 10.38
8 14.88
16 34.44
24 47.29
32 52.66
64 57.74
96 58.27

"size=32k ovr=22.08
2 12.82
4 16.90
8 43.21
16 84.58
24 92.50
32 94.21
64 95.72
96 96.06

"size=64k ovr=47.29
2 11.64
4 60.16
8 141.74
16 161.50
24 163.79
32 165.72
64 165.03
96 164.81

tlb: 32 pages

Memory load parallelism
0.001024 3.33
0.002048 3.00
0.004096 3.01
0.008192 3.01
0.016384 3.00
0.032768 3.94
0.065536 1.05
0.131072 1.06
0.262144 1.00
0.524288 1.00
1.048576 1.02
2.097152 1.04
4.194304 1.00
8.388608 1.00
16.777216 1.00
33.554432 1.00
67.108864 1.00
134.217728 1.00

STREAM copy latency: 33.28 nanoseconds
STREAM copy bandwidth: 480.80 MB/sec
STREAM scale latency: 42.11 nanoseconds
STREAM scale bandwidth: 379.92 MB/sec
STREAM add latency: 48.82 nanoseconds
STREAM add bandwidth: 491.64 MB/sec
STREAM triad latency: 38.86 nanoseconds
STREAM triad bandwidth: 617.55 MB/sec
STREAM2 fill latency: 6.83 nanoseconds
STREAM2 fill bandwidth: 1170.85 MB/sec
STREAM2 copy latency: 33.23 nanoseconds
STREAM2 copy bandwidth: 481.43 MB/sec
STREAM2 daxpy latency: 66.59 nanoseconds
STREAM2 daxpy bandwidth: 360.41 MB/sec
STREAM2 sum latency: 16.97 nanoseconds
STREAM2 sum bandwidth: 471.32 MB/sec

Memory load latency
"stride=16
0.00049 3.024
0.00098 3.175
0.00195 3.004
0.00293 3.004
0.00391 2.997
0.00586 2.994
0.00781 3.924
0.01172 5.224
0.01562 2.999
0.02344 3.030
0.03125 4.543
0.04688 4.442
0.06250 4.724
0.09375 5.530
0.12500 7.665
0.18750 8.953
0.25000 15.208
0.37500 27.040
0.50000 31.524
0.75000 36.096
1.00000 37.709
1.50000 40.198
2.00000 40.269
3.00000 40.602
4.00000 41.285
6.00000 41.409
8.00000 41.627
12.00000 41.630
16.00000 41.745
24.00000 41.681
32.00000 41.684
48.00000 41.735
64.00000 41.655
96.00000 41.657
128.00000 41.712

"stride=32
0.00049 3.023
0.00098 3.023
0.00195 3.024
0.00293 3.023
0.00391 3.002
0.00586 3.002
0.00781 2.997
0.01172 2.994
0.01562 2.991
0.02344 2.990
0.03125 3.029
0.04688 7.883
0.06250 9.138
0.09375 10.267
0.12500 10.764
0.18750 13.063
0.25000 22.025
0.37500 48.095
0.50000 58.225
0.75000 67.949
1.00000 71.274
1.50000 74.310
2.00000 75.102
3.00000 76.477
4.00000 77.332
6.00000 77.958
8.00000 78.364
12.00000 78.853
16.00000 78.531
24.00000 78.544
32.00000 78.384
48.00000 78.506
64.00000 78.449
96.00000 78.473
128.00000 78.476

"stride=64
0.00049 3.022
0.00098 3.023
0.00195 3.022
0.00293 3.022
0.00391 3.023
0.00586 3.023
0.00781 3.002
0.01172 3.004
0.01562 2.998
0.02344 3.057
0.03125 3.873
0.04688 8.878
0.06250 9.792
0.09375 10.903
0.12500 11.658
0.18750 13.150
0.25000 36.591
0.37500 91.624
0.50000 112.584
0.75000 130.276
1.00000 138.831
1.50000 144.983
2.00000 146.843
3.00000 150.392
4.00000 151.529
6.00000 153.494
8.00000 154.039
12.00000 154.312
16.00000 154.411
24.00000 154.352
32.00000 154.330
48.00000 154.331
64.00000 154.130
96.00000 154.326
128.00000 154.377

"stride=128
0.00049 3.023
0.00098 3.022
0.00195 3.024
0.00293 3.022
0.00391 3.024
0.00586 3.024
0.00781 3.023
0.01172 3.024
0.01562 3.004
0.02344 3.142
0.03125 3.571
0.04688 8.800
0.06250 9.899
0.09375 10.946
0.12500 11.367
0.18750 13.869
0.25000 33.896
0.37500 89.858
0.50000 111.619
0.75000 133.526
1.00000 139.648
1.50000 145.139
2.00000 147.481
3.00000 150.220
4.00000 151.770
6.00000 153.036
8.00000 154.119
12.00000 154.487
16.00000 155.151
24.00000 155.080
32.00000 155.098
48.00000 155.009
64.00000 154.988
96.00000 155.069
128.00000 155.079

"stride=256
0.00049 3.023
0.00098 3.022
0.00195 3.024
0.00293 3.023
0.00391 3.023
0.00586 3.023
0.00781 3.024
0.01172 3.024
0.01562 3.024
0.02344 3.055
0.03125 3.712
0.04688 8.420
0.06250 9.823
0.09375 10.846
0.12500 11.476
0.18750 26.745
0.25000 47.533
0.37500 92.367
0.50000 114.121
0.75000 134.083
1.00000 140.943
1.50000 146.650
2.00000 149.459
3.00000 152.015
4.00000 153.404
6.00000 154.952
8.00000 155.910
12.00000 156.531
16.00000 156.848
24.00000 156.747
32.00000 157.038
48.00000 157.210
64.00000 156.988
96.00000 157.113
128.00000 157.464

"stride=512
0.00049 3.023
0.00098 3.024
0.00195 3.024
0.00293 3.023
0.00391 3.024
0.00586 3.024
0.00781 3.023
0.01172 3.023
0.01562 3.022
0.02344 3.099
0.03125 3.652
0.04688 8.591
0.06250 9.645
0.09375 10.796
0.12500 11.479
0.18750 16.760
0.25000 38.305
0.37500 90.680
0.50000 115.253
0.75000 138.896
1.00000 145.338
1.50000 150.930
2.00000 153.743
3.00000 156.514
4.00000 158.131
6.00000 159.337
8.00000 159.974
12.00000 160.781
16.00000 161.343
24.00000 161.262
32.00000 161.430
48.00000 161.696
64.00000 161.696
96.00000 161.746
128.00000 162.059

"stride=1024
0.00098 3.023
0.00195 3.022
0.00293 3.024
0.00391 3.023
0.00586 3.023
0.00781 3.023
0.01172 3.023
0.01562 3.023
0.02344 3.028
0.03125 3.298
0.04688 8.419
0.06250 9.400
0.09375 10.776
0.12500 11.431
0.18750 21.547
0.25000 50.601
0.37500 99.612
0.50000 119.300
0.75000 144.772
1.00000 152.790
1.50000 158.767
2.00000 161.637
3.00000 164.455
4.00000 165.760
6.00000 167.272
8.00000 168.207
12.00000 168.906
16.00000 169.029
24.00000 169.229
32.00000 169.369
48.00000 169.193
64.00000 169.297
96.00000 169.530
128.00000 170.194


Random load latency
"stride=16
0.00049 3.023
0.00098 3.023
0.00195 3.002
0.00293 3.003
0.00391 2.996
0.00586 2.994
0.00781 2.989
0.01172 2.989
0.01562 3.014
0.02344 12.944
0.03125 7.965
0.04688 9.698
0.06250 10.885
0.09375 12.725
0.12500 13.023
0.18750 69.545
0.25000 103.393
0.37500 156.970
0.50000 180.740
0.75000 200.702
1.00000 209.175
1.50000 220.140
2.00000 230.984
3.00000 246.941
4.00000 260.309
6.00000 279.835
8.00000 291.583
12.00000 299.840
16.00000 308.805
24.00000 318.752
32.00000 318.275
48.00000 316.165
64.00000 314.973
96.00000 310.587
128.00000 309.296
Re: Kernel benchmarks: 3.0.8+ vs. 3.0.36+ defconfig vs. 3.0.36+ a10linux_config
July 19, 2012 08:51AM
Now that lmbench works on the Mele the next step is to try another kernel. I don't have another 3.0.36+ kernel on the Mele. So lets try the 3.0.8+ kernel that is still also on the Mele. I'll now reboot it with that kernel and see how it goes.

Yuck! I really don't want to use that kernel ever again. But hey, maybe it runs faster and we need to use it as a base instead. You never know about those things until you actually try them. There could be regressions in the later kernel code. I really hope that is not the case. If it is I may need to rebase our kernel on the 3.0.8+ code. We will find out soon enough.
Then we will know for certain.
export CFLAGS="-march=armv5te"
make rerun
A shorter time later (but the tests still took 0.0208 days), since it did not need to compile again, here are the results:
*****************************************************************************************************************************************
[lmbench3.0 results for Linux T-01 3.0.8+ #2 PREEMPT Fri Mar 2 14:28:08 CST 2012 armv7l GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m]
[DISKS: /dev/mmcblk0 ]
[DISK_DESC: [/dev/mmcblk0:Class 10 SD card] ]
Simple syscall: 0.7358 microseconds
Simple read: 0.6010 microseconds
Simple write: 1.5025 microseconds
Simple stat: 4.5320 microseconds
Simple fstat: 1.6553 microseconds
Simple open/close: 11.8318 microseconds
Select on 10 fd's: 2.2144 microseconds
Select on 100 fd's: 9.5779 microseconds
Select on 250 fd's: 21.4944 microseconds
Select on 500 fd's: 41.5501 microseconds
Select on 10 tcp fd's: 2.3739 microseconds
Select on 100 tcp fd's: 16.3408 microseconds
Select on 250 tcp fd's: 39.3709 microseconds
Select on 500 tcp fd's: 78.2454 microseconds
Signal handler installation: 1.2251 microseconds
Signal handler overhead: 5.7499 microseconds
Protection fault: 1.6385 microseconds
Pipe latency: 22.1122 microseconds
AF_UNIX sock stream latency: 37.5378 microseconds
Process fork+exit: 889.5439 microseconds
Process fork+execve: 2806.2308 microseconds
Process fork+/bin/sh -c: 6631.6250 microseconds
integer bit: 1.00 nanoseconds
integer add: 1.00 nanoseconds
integer mul: 5.97 nanoseconds
integer div: 72.22 nanoseconds
integer mod: 22.90 nanoseconds
int64 bit: 1.01 nanoseconds
uint64 add: 1.34 nanoseconds
int64 mul: 11.00 nanoseconds
int64 div: 276.46 nanoseconds
int64 mod: 189.03 nanoseconds
float add: 8.87 nanoseconds
float mul: 9.96 nanoseconds
float div: 32.83 nanoseconds
double add: 8.87 nanoseconds
double mul: 10.95 nanoseconds
double div: 56.72 nanoseconds
float bogomflops: 104.18 nanoseconds
double bogomflops: 133.75 nanoseconds
integer bit parallelism: 1.49
integer add parallelism: 1.71
integer mul parallelism: 2.90
integer div parallelism: 1.19
integer mod parallelism: 1.09
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.03
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 4261 KB/sec
Pagefaults on /var/tmp/XXX: 6.0592 microseconds

"mappings
0.524288 49
1.048576 75
2.097152 127
4.194304 238
8.388608 460
16.777216 910
33.554432 1802
67.108864 3685
134.217728 7753

"File system latency
0k 1265 11475 14030
1k 800 4710 11240
4k 848 7725 11940
10k 585 5326 9859

"Seek times for /dev/mmcblk0
2144.3 1.22
. . .
2.3 2.41

"Zone bandwidth for /dev/mmcblk0
1.1 15.09
. . .
2136.0 13.97

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 50.2277 microseconds
TCP latency using localhost: 76.0248 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
(null): RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 245.8371 microseconds

Socket bandwidth using localhost
0.000001 0.16 MB/sec
0.000064 5.49 MB/sec
0.000128 9.78 MB/sec
0.000256 19.30 MB/sec
0.000512 31.80 MB/sec
0.001024 67.65 MB/sec
0.001437 74.35 MB/sec
10.000000 77.81 MB/sec

Avg xfer: 3.2KB, 41.8KB in 9.8100 millisecs, 4.26 MB/sec
AF_UNIX sock stream bandwidth: 297.70 MB/sec
Pipe bandwidth: 328.46 MB/sec

"read bandwidth
0.000512 133.05
0.001024 230.23
0.002048 242.94
0.004096 515.17
0.008192 563.21
0.016384 576.32
0.032768 528.67
0.065536 494.98
0.131072 423.39
0.262144 306.77
0.524288 271.02
1.05 258.20
2.10 255.76
4.19 250.65
8.39 251.49
16.78 250.92
33.55 252.63
67.11 242.53
134.22 247.97

"read open2close bandwidth
0.000512 31.55
0.001024 60.83
0.002048 114.42
0.004096 200.96
0.008192 251.35
0.016384 368.43
0.032768 425.19
0.065536 459.33
0.131072 360.59
0.262144 274.17
0.524288 253.58
1.05 250.64
2.10 249.91
4.19 248.92
8.39 256.66
16.78 254.96
33.55 256.57
67.11 247.60
134.22 250.01


"Mmap read bandwidth
0.000512 1804.30
0.001024 1891.07
0.002048 1937.14
0.004096 1946.87
0.008192 1964.93
0.016384 1974.39
0.032768 1971.89
0.065536 1630.37
0.131072 1541.70
0.262144 832.69
0.524288 482.64
1.05 406.62
2.10 386.54
4.19 380.90
8.39 380.57
16.78 380.89
33.55 380.76
67.11 380.90
134.22 372.92

"Mmap read open2close bandwidth
0.000512 14.43
0.001024 28.88
0.002048 56.21
0.004096 108.08
0.008192 176.77
0.016384 263.13
0.032768 350.30
0.065536 415.54
0.131072 434.72
0.262144 322.59
0.524288 257.19
1.05 243.19
2.10 243.66
4.19 245.01
8.39 247.11
16.78 249.34
33.55 250.01
67.11 249.22
134.22 243.12


"libc bcopy unaligned
0.000512 1217.50
0.001024 1078.27
0.002048 1218.45
0.004096 1217.08
0.008192 1197.80
0.016384 1140.63
0.032768 1183.23
0.065536 1170.13
0.131072 1152.10
0.262144 624.66
0.524288 290.07
1.05 253.32
2.10 249.59
4.19 246.30
8.39 246.19
16.78 244.95
33.55 247.95
67.11 242.66

"libc bcopy aligned
0.000512 1217.66
0.001024 1217.06
0.002048 1214.14
0.004096 1105.70
0.008192 1145.08
0.016384 1118.91
0.032768 1175.38
0.065536 1170.92
0.131072 1180.96
0.262144 895.44
0.524288 287.74
1.05 254.42
2.10 245.17
4.19 248.18
8.39 246.58
16.78 246.26
33.55 248.82
67.11 239.90

Memory bzero bandwidth
0.000512 1203.99
0.001024 1202.38
0.002048 1215.42
0.004096 1201.72
0.008192 1115.84
0.016384 1145.43
0.032768 1179.47
0.065536 1181.17
0.131072 1175.49
0.262144 1175.71
0.524288 1176.93
1.05 1178.35
2.10 1167.92
4.19 1172.20
8.39 1175.73
16.78 1177.22
33.55 1175.14
67.11 1173.67
134.22 1170.82

"unrolled bcopy unaligned
0.000512 1080.11
0.001024 1211.74
0.002048 1217.45
0.004096 1215.90
0.008192 1215.66
0.016384 1215.96
0.032768 1187.33
0.065536 1186.33
0.131072 1168.98
0.262144 745.18
0.524288 304.60
1.05 258.04
2.10 250.74
4.19 249.65
8.39 248.87
16.78 249.56
33.55 248.95
67.11 246.40

"unrolled partial bcopy unaligned
0.000512 1218.28
0.001024 1217.94
0.002048 1218.08
0.004096 1219.28
0.008192 1217.04
0.016384 1217.22
0.032768 1213.35
0.065536 1192.39
0.131072 1169.77
0.262144 824.02
0.524288 308.54
1.05 261.92
2.10 267.77
4.19 263.24
8.39 262.89
16.78 257.93
33.55 260.07
67.11 252.26

Memory read bandwidth
0.000512 1936.98
0.001024 1958.73
0.002048 1982.50
0.004096 1988.19
0.008192 1979.24
0.016384 1986.95
0.032768 1937.30
0.065536 1635.02
0.131072 1543.45
0.262144 1041.17
0.524288 482.01
1.05 412.06
2.10 390.59
4.19 382.31
8.39 381.04
16.78 381.16
33.55 381.66
67.11 381.67
134.22 372.66

Memory partial read bandwidth
0.000512 7237.15
0.001024 7508.74
0.002048 7647.98
0.004096 7717.49
0.008192 7661.80
0.016384 7724.20
0.032768 7596.43
0.065536 3995.01
0.131072 3500.37
0.262144 1541.65
0.524288 556.93
1.05 457.73
2.10 431.96
4.19 420.72
8.39 419.43
16.78 420.86
33.55 420.31
67.11 420.10
134.22 411.39

Memory write bandwidth
0.000512 1212.98
0.001024 1210.56
0.002048 1080.32
0.004096 1218.27
0.008192 1213.16
0.016384 1213.78
0.032768 1214.07
0.065536 1176.63
0.131072 1188.02
0.262144 1182.72
0.524288 1186.44
1.05 1179.40
2.10 1175.52
4.19 1169.57
8.39 1175.51
16.78 1176.82
33.55 1175.50
67.11 1173.55
134.22 1171.38

Memory partial write bandwidth
0.000512 1079.89
0.001024 1216.89
0.002048 1210.95
0.004096 1217.10
0.008192 1209.28
0.016384 1215.03
0.032768 1190.38
0.065536 1175.18
0.131072 1177.92
0.262144 1178.49
0.524288 1189.85
1.05 1186.21
2.10 1178.45
4.19 1171.41
8.39 1176.46
16.78 1176.73
33.55 1175.80
67.11 1172.43
134.22 1171.88

Memory partial read/write bandwidth
0.000512 4994.88
0.001024 5117.05
0.002048 5180.67
0.004096 5217.00
0.008192 5182.35
0.016384 5212.08
0.032768 5210.92
0.065536 3318.72
0.131072 2939.17
0.262144 1329.59
0.524288 392.24
1.05 334.04
2.10 327.73
4.19 327.32
8.39 327.06
16.78 327.37
33.55 328.31
67.11 329.61
134.22 323.59



"size=0k ovr=5.86
2 4.65
4 6.31
8 6.86
16 7.76
24 9.03
32 10.21
64 12.62
96 13.94

"size=4k ovr=8.03
2 4.64
4 6.91
8 8.35
16 10.28
24 13.09
32 16.32
64 23.13
96 25.93

"size=8k ovr=10.17
2 5.22
4 7.42
8 9.27
16 14.67
24 21.14
32 27.09
64 35.00
96 36.71

"size=16k ovr=14.46
2 6.44
4 9.68
8 13.76
16 32.62
24 44.68
32 50.64
64 55.41
96 56.15

"size=32k ovr=24.17
2 10.79
4 16.11
8 44.74
16 79.24
24 87.26
32 89.19
64 91.45
96 92.15

"size=64k ovr=49.96
2 11.89
4 60.54
8 130.84
16 150.94
24 153.05
32 154.63
64 155.76
96 155.71

tlb: 32 pages

Memory load parallelism
0.001024 4.18
0.002048 3.00
0.004096 3.00
0.008192 3.01
0.016384 3.01
0.032768 3.56
0.065536 1.07
0.131072 1.07
0.262144 1.00
0.524288 1.00
1.048576 1.00
2.097152 1.00
4.194304 1.00
8.388608 1.00
16.777216 1.02
33.554432 1.00
67.108864 1.23
134.217728 1.00

STREAM copy latency: 31.81 nanoseconds
STREAM copy bandwidth: 503.05 MB/sec
STREAM scale latency: 34.69 nanoseconds
STREAM scale bandwidth: 461.18 MB/sec
STREAM add latency: 37.77 nanoseconds
STREAM add bandwidth: 635.51 MB/sec
STREAM triad latency: 36.84 nanoseconds
STREAM triad bandwidth: 651.50 MB/sec
STREAM2 fill latency: 6.81 nanoseconds
STREAM2 fill bandwidth: 1175.45 MB/sec
STREAM2 copy latency: 30.72 nanoseconds
STREAM2 copy bandwidth: 520.78 MB/sec
STREAM2 daxpy latency: 65.29 nanoseconds
STREAM2 daxpy bandwidth: 367.61 MB/sec
STREAM2 sum latency: 16.96 nanoseconds
STREAM2 sum bandwidth: 471.69 MB/sec

Memory load latency
"stride=16
0.00049 3.024
0.00098 3.023
0.00195 3.003
0.00293 3.003
0.00391 2.998
0.00586 2.994
0.00781 2.990
0.01172 2.990
0.01562 2.996
0.02344 3.015
0.03125 3.199
0.04688 4.409
0.06250 4.714
0.09375 4.956
0.12500 5.084
0.18750 5.587
0.25000 11.334
0.37500 24.036
0.50000 29.779
0.75000 34.485
1.00000 36.259
1.50000 37.496
2.00000 38.381
3.00000 38.630
4.00000 39.409
6.00000 39.481
8.00000 39.482
12.00000 39.536
16.00000 39.587
24.00000 39.458
32.00000 39.499
48.00000 39.407
64.00000 39.465
96.00000 39.492
128.00000 39.544

"stride=32
0.00049 3.022
0.00098 3.023
0.00195 3.022
0.00293 3.022
0.00391 3.003
0.00586 3.002
0.00781 2.997
0.01172 2.992
0.01562 2.989
0.02344 2.989
0.03125 3.021
0.04688 7.939
0.06250 9.104
0.09375 10.208
0.12500 11.700
0.18750 22.784
0.25000 25.551
0.37500 45.903
0.50000 56.630
0.75000 64.598
1.00000 67.822
1.50000 70.982
2.00000 72.219
3.00000 73.070
4.00000 74.155
6.00000 74.562
8.00000 74.623
12.00000 74.650
16.00000 74.606
24.00000 74.661
32.00000 74.674
48.00000 74.542
64.00000 74.610
96.00000 74.655
128.00000 74.616

"stride=64
0.00049 3.023
0.00098 3.021
0.00195 3.022
0.00293 3.024
0.00391 3.021
0.00586 3.023
0.00781 3.004
0.01172 3.002
0.01562 2.997
0.02344 3.063
0.03125 3.543
0.04688 8.567
0.06250 9.917
0.09375 10.823
0.12500 11.304
0.18750 13.218
0.25000 31.775
0.37500 85.209
0.50000 107.863
0.75000 126.181
1.00000 132.201
1.50000 138.427
2.00000 140.360
3.00000 142.804
4.00000 144.149
6.00000 146.061
8.00000 146.430
12.00000 146.492
16.00000 146.486
24.00000 146.315
32.00000 146.452
48.00000 146.259
64.00000 146.323
96.00000 146.370
128.00000 146.338

"stride=128
0.00049 3.021
0.00098 3.022
0.00195 3.022
0.00293 3.022
0.00391 3.022
0.00586 3.023
0.00781 3.023
0.01172 3.022
0.01562 3.002
0.02344 3.044
0.03125 3.427
0.04688 8.346
0.06250 9.745
0.09375 10.704
0.12500 11.350
0.18750 13.379
0.25000 36.719
0.37500 93.305
0.50000 113.724
0.75000 126.948
1.00000 132.802
1.50000 137.860
2.00000 140.154
3.00000 143.475
4.00000 143.988
6.00000 145.327
8.00000 145.993
12.00000 147.171
16.00000 147.073
24.00000 147.194
32.00000 147.138
48.00000 146.953
64.00000 147.150
96.00000 147.184
128.00000 147.146

"stride=256
0.00049 3.022
0.00098 3.022
0.00195 3.022
0.00293 3.022
0.00391 3.022
0.00586 3.022
0.00781 3.022
0.01172 3.023
0.01562 3.022
0.02344 3.086
0.03125 3.438
0.04688 8.373
0.06250 9.758
0.09375 10.797
0.12500 11.447
0.18750 31.554
0.25000 43.882
0.37500 90.125
0.50000 112.855
0.75000 128.204
1.00000 133.449
1.50000 139.338
2.00000 141.934
3.00000 144.329
4.00000 145.650
6.00000 146.825
8.00000 147.374
12.00000 147.935
16.00000 148.309
24.00000 148.563
32.00000 148.697
48.00000 148.564
64.00000 148.777
96.00000 148.813
128.00000 148.933

"stride=512
0.00049 3.022
0.00098 3.022
0.00195 3.022
0.00293 3.023
0.00391 3.022
0.00586 3.023
0.00781 3.022
0.01172 3.022
0.01562 3.022
0.02344 3.022
0.03125 3.483
0.04688 8.261
0.06250 9.437
0.09375 10.689
0.12500 11.228
0.18750 19.380
0.25000 39.420
0.37500 96.000
0.50000 114.337
0.75000 132.124
1.00000 137.861
1.50000 143.709
2.00000 146.221
3.00000 148.830
4.00000 150.007
6.00000 151.080
8.00000 151.715
12.00000 152.225
16.00000 152.341
24.00000 152.753
32.00000 152.837
48.00000 152.914
64.00000 153.068
96.00000 153.219
128.00000 153.277

"stride=1024
0.00098 3.021
0.00195 3.022
0.00293 3.021
0.00391 3.021
0.00586 3.022
0.00781 3.021
0.01172 3.021
0.01562 3.023
0.02344 3.027
0.03125 3.701
0.04688 8.100
0.06250 9.485
0.09375 10.669
0.12500 11.113
0.18750 21.339
0.25000 53.619
0.37500 102.698
0.50000 122.231
0.75000 139.555
1.00000 146.308
1.50000 151.792
2.00000 154.644
3.00000 157.573
4.00000 158.892
6.00000 160.041
8.00000 160.688
12.00000 161.229
16.00000 161.252
24.00000 161.607
32.00000 161.530
48.00000 161.654
64.00000 161.823
96.00000 162.231
128.00000 162.548


Random load latency
"stride=16
0.00049 3.022
0.00098 3.022
0.00195 3.006
0.00293 3.004
0.00391 2.995
0.00586 2.992
0.00781 2.990
0.01172 2.987
0.01562 3.022
0.02344 2.995
0.03125 6.745
0.04688 9.146
0.06250 11.152
0.09375 12.733
0.12500 13.007
0.18750 61.038
0.25000 87.262
0.37500 144.961
0.50000 173.121
0.75000 194.267
1.00000 204.653
1.50000 214.100
2.00000 229.908
3.00000 246.984
4.00000 257.683
6.00000 273.018
8.00000 283.399
12.00000 290.301
16.00000 296.817
24.00000 299.538
32.00000 298.201
48.00000 301.354
64.00000 300.946
96.00000 297.991
128.00000 297.065
theres an interesting benchmarking posting on the new wheezy armhf for raspi vs old armel :

http://www.memetic.org/raspbian-benchmarking-armel-vs-armhf/
Re: Kernel benchmarks: 3.0.8+ vs. 3.0.36+ defconfig vs. 3.0.36+ a10linux_config
July 19, 2012 09:50AM
You can see from above that there are no regressions from 3.0.8+ when using the 3.0.36+ kernel with our config.
In fact it is noticeably faster in basically every benchmark. That is great. It makes my life a bit easier!

But what if it is just that the 3.0.36+ code that is better. What if my new config actually sucks. Does the defconfig run much better, or is it the same or worse. Keep in mind that the 3.0.36+ config tested above has all drivers and networking features configured. That could add a bit of overhead. If it only adds a small bit of overhead that is fine considering all the benefits. But if it adds a lot of slower bloated code I may need to go back and reconfigure it again. There is only one way to find out. That is to delete the 3.0.36+ module directory and kernel, and then copy over the 3.0.36+ defconfig kernel to the Mele.

Here it goes. After all the kernel builds I have been doing there are 14 different kernel builds that successfully built and booted residing in my kernel testing directory. The oldest one used an older defconfig which had no hdmi output. It would be useless and unfair to use that one. The next oldest is Turl's zatab_defconfig that got my hdmi working. Using that one would also be somewhat unfair as it is not the defconfig. Unfortunately I don't have an actual defconfig kernel built. Even both of these have netfilter enabled. That is not part of the defconfig. So in order to test the defconfig I will actually need to build a new kernel. That will take some time and is not something I really am looking forward to doing. In the meantime, while that kernel is compiling, lets look at a kernel with the closest diff match to the current defconfig. That one is (drumroll): none of them!

We're just going to have to wait until a defconfig kernel has been compiled. I have to compile yet another kernel for the zillionth time, before we can even begin to complete our testing . . . Figures as much . . . but at least I'll be able to see if the defconfig is getting any better. I am getting very sick of compiling kernels though. . . Forgive me if I take a break and save this post while the defconfig kernel compiles.
Testing Methodologies

While we are waiting for the kernel to finish compiling lets talk a bit about the details of the platforms, etc. used for these tests.

All testing platforms are using the Debian Sid Linux distribution. The GoFlex is running standard Sid. The A10 -based Mele A2000 is running the armhf float version of Sid. Lmbench is being compiled natively on the GoFlex by the following GCC version:
gcc (Debian 4.6.3-7) 4.6.3
it was compiled on the Mele by this GCC version (sorry. I forgot to sync the gcc versions beforehand.)
gcc (Debian 4.6.3-8) 4.6.3
There were no additonal compiler flags set on the GoFlex. But on the Mele armhf the following flags had to be set for lmbench to successfully compile:
export CFLAGS="-march=armv5te"
Probably a later ARM platform version could have been specified. But that is what worked on the GoFlex and was easiest to use for an initial comparison. I will try to get some benchmarks later using a current ARM platform version. If somebody can point me to a patch to allow lmbench to successfully compile on armhf that would be greatly appreciated.

Also, unfortunately I left out a few of the platform details in the lmbench output. That was not intentional. But I don't want to be sharing my network details with everybody, and the platform details were interspersed with that. So here are the additional platform details minus the networking.

GoFlex:
[ENOUGH: 10000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m]
[INFO: INFO.DS02]
[LINE_SIZE: 32]
[LOOP_O: 0.00000813]
[MB: 85]
[MHZ: 1200]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv5tel-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv5tel]
[RELEASE: 3.2.0-2-kirkwood]
[VERSION: #1 Sat Jun 2 13:45:52 UTC 2012]

A10 3.0.36+
[ENOUGH: 100000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[INFO: INFO.T-01]
[LINE_SIZE: 64]
[LOOP_O: 0.00000000]
[MB: 180]
[MHZ: 1006 MHz, 0.9940 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv7l-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv7l]
[RELEASE: 3.0.36+]
[VERSION: #28 PREEMPT Tue Jul 17 21:43:19 IST 2012]

A10 3.0.8+
[ENOUGH: 100000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[INFO: INFO.T-01]
[LINE_SIZE: 64]
[LOOP_O: 0.00000000]
[MB: 180]
[MHZ: 1006 MHz, 0.9940 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv7l-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv7l]
[RELEASE: 3.0.8+]
[VERSION: #2 PREEMPT Fri Mar 2 14:28:08 CST 2012]

The above header information will already be in the output for the last test. I will also post the benchmark output as files you can download, minus the network information I do not wish to share. I will also post the lmbench utility on our site. But you can also get the same identical version at the "official" lmbench Sourceforge site.

Please keep in mind that most benchmarks are an artificial guide to system performance. Although they are relevant and offer a good base for comparisons some benchmarks, especially the disk and network ones, may vary greatly depending on the devices. Now that the defconfig kernel has finished compiling let us get on with the last benchmark.
Quote

heres an interesting benchmarking posting on the new wheezy armhf for raspi vs old armel :
hey. you're interrupting my thread ;) j/k

Good post! Thanks for that! It will be really fun to throw in some Raspi benchmarks for comparison! This would actually make for a good html page once the benchmarks are complete and the data summarized. I think it would be best, however, to wait to get an actual armv7 hardfloat benchmark done before we start that comparison. We want to be fair to the A10, right?
We don't want to embarrass the A10 platform by it only being 10 or 20 times faster in long divides. . .

The Raspi people had to optimize and recompile a entire Debian distro to get those benchmarks.
The least we can do is fix lmbench so it works for ours!


But first I need to finish these benchmarks and summarize them for people who don't want to look at a long thread of data.
Quote

http://www.memetic.org/raspbian-benchmarking-armel-vs-armhf/

I just visited that page. Unfortunately it is useless crap. There is no detail to the benchmark methodology. The benchmarks are not reproducible in their current form. So they have no credibility whatsoever. The only reproducible benchmark the guy has on his page is a regression.

I'm sure that the Raspi armhf Debian distro runs slightly faster. But why even waste time running those unverifiable benchmarks? If you do something at least do it right, or don't do it at all. That is my motto.

Hopefully somebody else will use lmbench to do some real benchmarks on the Raspi. Those other benchmarks would also be nice for additional data. But only if we know the methods behind their use so we can compare them with other platforms.
. . . and now (drumroll) what you have all been waiting for. (All? There's nobody even here yet except hyena. But hey, I'm very interested and have been waiting patiently now for over an hour for this. So I'll quit typing and get on with the final results;)

Here are the results of the A10 defconfig (But first does it even boot? Yes! But the HDMI still does not work. Hey, if the hdmi isn't working then that means X and LXDE are not running and using resources. That is cheating on the benchmark! Oh well. It's is not worth it to build yet another kernel with hdmi enabled. So we will let the defconfig cheat a bit. Give it a slight break;)

So here we go:
export CFLAGS="-march=armv5te"
make rerun
Another 0.0208 days later (where are my days going?):
******************************************************************************************************************************************
[lmbench3.0 results for Linux T-01 3.0.36+ #29 PREEMPT Thu Jul 19 16:15:03 IST 2012 armv7l GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m]
[DISKS: /dev/mmcblk0 ]
[DISK_DESC: [/dev/mmcblk0:Class 10 SD card] ]
[ENOUGH: 100000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[INFO: INFO.T-01]
[LINE_SIZE: 64]
[LOOP_O: 0.00000000]
[MB: 180]
[MHZ: 1006 MHz, 0.9940 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv7l-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv7l]
[RELEASE: 3.0.36+]
[VERSION: #29 PREEMPT Thu Jul 19 16:15:03 IST 2012]
Simple syscall: 0.7532 microseconds
Simple read: 0.6180 microseconds
Simple write: 1.4339 microseconds
Simple stat: 4.4727 microseconds
Simple fstat: 1.6094 microseconds
Simple open/close: 11.7917 microseconds
Select on 10 fd's: 2.0400 microseconds
Select on 100 fd's: 9.9370 microseconds
Select on 250 fd's: 22.6034 microseconds
Select on 500 fd's: 44.0101 microseconds
Select on 10 tcp fd's: 2.2488 microseconds
Select on 100 tcp fd's: 16.2673 microseconds
Select on 250 tcp fd's: 39.5860 microseconds
Select on 500 tcp fd's: 78.6743 microseconds
Signal handler installation: 1.1653 microseconds
Signal handler overhead: 5.7945 microseconds
Protection fault: 1.2939 microseconds
Pipe latency: 20.4761 microseconds
AF_UNIX sock stream latency: 36.7843 microseconds
Process fork+exit: 857.1562 microseconds
Process fork+execve: 2762.3250 microseconds
Process fork+/bin/sh -c: 6459.6250 microseconds
integer bit: 1.00 nanoseconds
integer add: 1.00 nanoseconds
integer mul: 5.97 nanoseconds
integer div: 72.19 nanoseconds
integer mod: 22.89 nanoseconds
int64 bit: 1.01 nanoseconds
uint64 add: 1.34 nanoseconds
int64 mul: 10.99 nanoseconds
int64 div: 276.09 nanoseconds
int64 mod: 189.04 nanoseconds
float add: 8.95 nanoseconds
float mul: 9.96 nanoseconds
float div: 32.83 nanoseconds
double add: 8.86 nanoseconds
double mul: 10.95 nanoseconds
double div: 56.70 nanoseconds
float bogomflops: 103.75 nanoseconds
double bogomflops: 133.07 nanoseconds
integer bit parallelism: 1.49
integer add parallelism: 1.71
integer mul parallelism: 2.90
integer div parallelism: 1.19
integer mod parallelism: 1.00
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.03
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 3749 KB/sec
Pagefaults on /var/tmp/XXX: 5.7158 microseconds

"mappings
0.524288 58
1.048576 88
2.097152 121
4.194304 223
8.388608 502
16.777216 993
33.554432 1720
67.108864 3440
134.217728 7234

"File system latency
0k 1219 11075 12176
1k 640 5798 11316
4k 796 7247 12598
10k 606 5336 4561

"Seek times for /dev/mmcblk0
2144.3 1.20
. . .
2.3 4.56

"Zone bandwidth for /dev/mmcblk0
1.1 7.92
. . .
2136.0 5.83

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 843.0640 microseconds
TCP latency using localhost: 72.1819 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
(null): RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 237.1810 microseconds

Socket bandwidth using localhost
0.000001 0.18 MB/sec
0.000064 5.38 MB/sec
0.000128 11.73 MB/sec
0.000256 23.78 MB/sec
0.000512 32.98 MB/sec
0.001024 68.43 MB/sec
0.001437 73.65 MB/sec
10.000000 11.36 MB/sec

Avg xfer: 3.2KB, 41.8KB in 125 millisecs, 334.72 KB/sec
AF_UNIX sock stream bandwidth: 298.71 MB/sec
Pipe bandwidth: 382.30 MB/sec

"read bandwidth
0.000512 139.10
0.001024 243.50
0.002048 377.40
0.004096 528.00
0.008192 570.93
0.016384 585.82
0.032768 528.61
0.065536 517.81
0.131072 505.91
0.262144 314.93
0.524288 255.53
1.05 251.90
2.10 251.07
4.19 251.41
8.39 253.83
16.78 250.66
33.55 173.86
67.11 247.58
134.22 251.63

"read open2close bandwidth
0.000512 32.24
0.001024 62.78
0.002048 116.12
0.004096 203.52
0.008192 306.37
0.016384 394.64
0.032768 430.55
0.065536 398.57
0.131072 422.25
0.262144 288.92
0.524288 244.78
1.05 242.44
2.10 243.89
4.19 248.31
8.39 249.92
16.78 246.06
33.55 248.99
67.11 252.35
134.22 252.45


"Mmap read bandwidth
0.000512 1806.28
0.001024 1893.87
0.002048 1940.11
0.004096 1950.39
0.008192 1968.08
0.016384 1979.01
0.032768 1975.34
0.065536 1635.13
0.131072 1550.43
0.262144 873.81
0.524288 464.59
1.05 400.09
2.10 378.96
4.19 370.02
8.39 369.12
16.78 367.76
33.55 367.42
67.11 367.81
134.22 367.49

"Mmap read open2close bandwidth
0.000512 14.69
0.001024 28.96
0.002048 56.74
0.004096 109.29
0.008192 180.09
0.016384 267.14
0.032768 357.13
0.065536 426.66
0.131072 465.39
0.262144 335.84
0.524288 250.93
1.05 238.58
2.10 239.72
4.19 242.01
8.39 244.42
16.78 246.11
33.55 246.65
67.11 246.95
134.22 247.50


"libc bcopy unaligned
0.000512 1210.05
0.001024 1219.35
0.002048 1220.14
0.004096 1105.66
0.008192 1111.43
0.016384 1147.37
0.032768 1216.29
0.065536 1161.04
0.131072 1161.93
0.262144 719.34
0.524288 294.15
1.05 255.16
2.10 242.89
4.19 241.08
8.39 228.70
16.78 239.74
33.55 239.96
67.11 239.66

"libc bcopy aligned
0.000512 1067.32
0.001024 1214.50
0.002048 1209.24
0.004096 1134.18
0.008192 1225.28
0.016384 1211.34
0.032768 1186.66
0.065536 1188.70
0.131072 1185.33
0.262144 728.16
0.524288 289.13
1.05 255.89
2.10 245.34
4.19 241.12
8.39 233.28
16.78 234.36
33.55 240.74
67.11 241.09

Memory bzero bandwidth
0.000512 1208.87
0.001024 1207.50
0.002048 1207.78
0.004096 1244.95
0.008192 1201.59
0.016384 1213.30
0.032768 1220.38
0.065536 1193.96
0.131072 1199.99
0.262144 1186.55
0.524288 1179.67
1.05 1186.09
2.10 1176.58
4.19 1175.59
8.39 1179.28
16.78 1178.72
33.55 1176.84
67.11 1175.63
134.22 1172.70

"unrolled bcopy unaligned
0.000512 1390.63
0.001024 1298.37
0.002048 1258.82
0.004096 1219.77
0.008192 1112.04
0.016384 1140.01
0.032768 1236.03
0.065536 1173.69
0.131072 1162.66
0.262144 728.71
0.524288 301.12
1.05 253.49
2.10 243.12
4.19 240.50
8.39 239.32
16.78 227.47
33.55 239.80
67.11 238.53

"unrolled partial bcopy unaligned
0.000512 1219.28
0.001024 1219.24
0.002048 1077.06
0.004096 1346.03
0.008192 1238.71
0.016384 1138.14
0.032768 1232.53
0.065536 1178.53
0.131072 1172.84
0.262144 833.18
0.524288 426.31
1.05 389.04
2.10 382.50
4.19 383.14
8.39 384.65
16.78 380.51
33.55 380.91
67.11 388.26

Memory read bandwidth
0.000512 1957.73
0.001024 1976.03
0.002048 1986.07
0.004096 1991.69
0.008192 1986.06
0.016384 1992.73
0.032768 1921.35
0.065536 1637.99
0.131072 1551.28
0.262144 958.61
0.524288 465.71
1.05 402.52
2.10 380.84
4.19 370.82
8.39 369.14
16.78 368.34
33.55 368.49
67.11 368.30
134.22 368.56

Memory partial read bandwidth
0.000512 7253.59
0.001024 7518.32
0.002048 7655.38
0.004096 7732.79
0.008192 7674.63
0.016384 7734.16
0.032768 7662.80
0.065536 4006.80
0.131072 3506.36
0.262144 1358.82
0.524288 555.14
1.05 448.81
2.10 423.47
4.19 411.66
8.39 404.79
16.78 404.39
33.55 403.63
67.11 403.80
134.22 403.90

Memory write bandwidth
0.000512 1219.75
0.001024 1219.27
0.002048 1219.80
0.004096 1219.83
0.008192 1119.96
0.016384 1104.34
0.032768 1112.00
0.065536 1159.54
0.131072 1172.81
0.262144 1175.02
0.524288 1183.41
1.05 1187.41
2.10 1178.91
4.19 1175.55
8.39 1179.51
16.78 1178.63
33.55 1177.27
67.11 1175.56
134.22 1172.83

Memory partial write bandwidth
0.000512 1219.53
0.001024 1077.16
0.002048 1111.63
0.004096 1238.98
0.008192 1239.12
0.016384 1217.89
0.032768 1179.30
0.065536 1174.26
0.131072 1190.72
0.262144 1174.23
0.524288 1180.51
1.05 1188.50
2.10 1179.55
4.19 1176.40
8.39 1181.07
16.78 1180.86
33.55 1179.38
67.11 1178.11
134.22 1175.85

Memory partial read/write bandwidth
0.000512 5001.23
0.001024 5125.74
0.002048 5190.80
0.004096 5221.99
0.008192 5191.22
0.016384 5172.26
0.032768 5214.00
0.065536 3424.06
0.131072 2994.06
0.262144 1122.56
0.524288 391.59
1.05 321.80
2.10 315.64
4.19 314.97
8.39 314.07
16.78 313.19
33.55 314.90
67.11 315.54
134.22 315.44



"size=0k ovr=5.59
2 5.28
4 6.63
8 6.86
16 8.52
24 10.62
32 12.66
64 12.72
96 14.21

"size=4k ovr=7.59
2 5.05
4 7.18
8 7.75
16 10.12
24 13.32
32 16.47
64 23.55
96 26.63

"size=8k ovr=9.95
2 4.70
4 7.29
8 11.44
16 16.89
24 21.37
32 26.98
64 35.85
96 37.79

"size=16k ovr=14.24
2 6.47
4 9.14
8 13.97
16 33.23
24 46.29
32 52.73
64 57.17
96 58.09

"size=32k ovr=23.77
2 9.46
4 14.10
8 46.58
16 84.32
24 92.43
32 94.34
64 95.64
96 96.19

"size=64k ovr=49.42
2 10.28
4 63.45
8 139.68
16 159.40
24 161.90
32 163.23
64 163.90
96 163.40

tlb: 32 pages

Memory load parallelism
0.001024 3.03
0.002048 3.00
0.004096 3.00
0.008192 3.00
0.016384 3.00
0.032768 3.67
0.065536 1.06
0.131072 1.00
0.262144 1.10
0.524288 1.00
1.048576 1.02
2.097152 1.00
4.194304 1.00
8.388608 1.00
16.777216 1.00
33.554432 1.00
67.108864 1.00
134.217728 1.00

STREAM copy latency: 33.29 nanoseconds
STREAM copy bandwidth: 480.65 MB/sec
STREAM scale latency: 42.17 nanoseconds
STREAM scale bandwidth: 379.40 MB/sec
STREAM add latency: 48.61 nanoseconds
STREAM add bandwidth: 493.72 MB/sec
STREAM triad latency: 38.52 nanoseconds
STREAM triad bandwidth: 622.99 MB/sec
STREAM2 fill latency: 6.82 nanoseconds
STREAM2 fill bandwidth: 1172.85 MB/sec
STREAM2 copy latency: 33.20 nanoseconds
STREAM2 copy bandwidth: 481.88 MB/sec
STREAM2 daxpy latency: 66.37 nanoseconds
STREAM2 daxpy bandwidth: 361.60 MB/sec
STREAM2 sum latency: 16.90 nanoseconds
STREAM2 sum bandwidth: 473.41 MB/sec

Memory load latency
"stride=16
0.00049 3.020
0.00098 3.172
0.00195 3.001
0.00293 3.005
0.00391 2.993
0.00586 2.990
0.00781 2.987
0.01172 2.985
0.01562 2.994
0.02344 3.004
0.03125 3.139
0.04688 4.406
0.06250 4.685
0.09375 4.969
0.12500 5.074
0.18750 7.413
0.25000 14.734
0.37500 24.801
0.50000 30.797
0.75000 35.660
1.00000 37.250
1.50000 38.967
2.00000 39.629
3.00000 40.333
4.00000 40.888
6.00000 41.218
8.00000 41.295
12.00000 41.456
16.00000 41.485
24.00000 41.485
32.00000 41.466
48.00000 41.501
64.00000 41.506
96.00000 41.515
128.00000 41.553

"stride=32
0.00049 3.020
0.00098 3.020
0.00195 3.020
0.00293 3.020
0.00391 3.000
0.00586 3.000
0.00781 2.993
0.01172 2.990
0.01562 2.987
0.02344 2.986
0.03125 3.002
0.04688 7.866
0.06250 9.097
0.09375 10.286
0.12500 10.784
0.18750 13.843
0.25000 25.698
0.37500 48.369
0.50000 60.045
0.75000 67.451
1.00000 70.771
1.50000 73.703
2.00000 74.434
3.00000 76.287
4.00000 76.609
6.00000 77.668
8.00000 77.953
12.00000 78.181
16.00000 78.196
24.00000 78.186
32.00000 78.194
48.00000 78.177
64.00000 78.263
96.00000 78.281
128.00000 78.251

"stride=64
0.00049 3.020
0.00098 3.020
0.00195 3.019
0.00293 3.020
0.00391 3.020
0.00586 3.020
0.00781 3.000
0.01172 3.000
0.01562 2.993
0.02344 3.036
0.03125 3.326
0.04688 8.624
0.06250 9.745
0.09375 10.869
0.12500 11.364
0.18750 19.664
0.25000 32.777
0.37500 91.022
0.50000 114.715
0.75000 131.034
1.00000 137.576
1.50000 143.841
2.00000 145.732
3.00000 149.225
4.00000 150.108
6.00000 151.872
8.00000 152.557
12.00000 153.338
16.00000 153.478
24.00000 153.569
32.00000 153.573
48.00000 153.616
64.00000 153.647
96.00000 153.701
128.00000 153.601

"stride=128
0.00049 3.020
0.00098 3.020
0.00195 3.020
0.00293 3.020
0.00391 3.020
0.00586 3.020
0.00781 3.020
0.01172 3.020
0.01562 3.000
0.02344 3.042
0.03125 3.432
0.04688 8.587
0.06250 9.860
0.09375 11.140
0.12500 21.145
0.18750 31.116
0.25000 48.631
0.37500 93.143
0.50000 118.803
0.75000 133.655
1.00000 139.659
1.50000 144.707
2.00000 146.910
3.00000 150.116
4.00000 150.991
6.00000 152.522
8.00000 152.946
12.00000 153.855
16.00000 154.264
24.00000 154.351
32.00000 154.620
48.00000 154.500
64.00000 154.636
96.00000 154.573
128.00000 154.638

"stride=256
0.00049 3.020
0.00098 3.020
0.00195 3.020
0.00293 3.020
0.00391 3.020
0.00586 3.020
0.00781 3.021
0.01172 3.020
0.01562 3.020
0.02344 3.046
0.03125 3.302
0.04688 8.534
0.06250 9.856
0.09375 10.734
0.12500 11.508
0.18750 14.509
0.25000 47.722
0.37500 96.329
0.50000 116.018
0.75000 134.269
1.00000 140.638
1.50000 146.226
2.00000 148.979
3.00000 151.747
4.00000 153.054
6.00000 154.368
8.00000 154.994
12.00000 155.722
16.00000 156.117
24.00000 156.287
32.00000 156.339
48.00000 156.737
64.00000 156.684
96.00000 156.716
128.00000 156.867

"stride=512
0.00049 3.020
0.00098 3.020
0.00195 3.020
0.00293 3.020
0.00391 3.019
0.00586 3.020
0.00781 3.020
0.01172 3.020
0.01562 3.019
0.02344 3.020
0.03125 3.540
0.04688 8.462
0.06250 9.603
0.09375 10.819
0.12500 11.291
0.18750 19.852
0.25000 37.931
0.37500 97.438
0.50000 122.978
0.75000 138.626
1.00000 144.861
1.50000 150.684
2.00000 153.545
3.00000 156.289
4.00000 157.927
6.00000 159.003
8.00000 159.631
12.00000 160.248
16.00000 160.713
24.00000 160.907
32.00000 160.937
48.00000 161.161
64.00000 161.237
96.00000 161.519
128.00000 161.533

"stride=1024
0.00098 3.020
0.00195 3.020
0.00293 3.020
0.00391 3.020
0.00586 3.020
0.00781 3.020
0.01172 3.020
0.01562 3.020
0.02344 3.020
0.03125 3.773
0.04688 8.279
0.06250 9.557
0.09375 10.783
0.12500 11.276
0.18750 21.717
0.25000 58.376
0.37500 102.708
0.50000 128.780
0.75000 145.166
1.00000 152.816
1.50000 158.482
2.00000 161.555
3.00000 164.198
4.00000 165.752
6.00000 166.894
8.00000 167.470
12.00000 168.095
16.00000 168.659
24.00000 168.665
32.00000 168.863
48.00000 168.946
64.00000 168.973
96.00000 169.287
128.00000 169.481


Random load latency
"stride=16
0.00049 3.020
0.00098 3.019
0.00195 3.000
0.00293 3.000
0.00391 2.993
0.00586 2.990
0.00781 2.987
0.01172 2.985
0.01562 2.995
0.02344 2.990
0.03125 6.751
0.04688 10.821
0.06250 11.149
0.09375 12.579
0.12500 13.162
0.18750 53.589
0.25000 81.349
0.37500 151.321
0.50000 178.238
0.75000 197.915
1.00000 206.744
1.50000 221.797
2.00000 225.629
3.00000 243.867
4.00000 253.110
6.00000 268.904
8.00000 287.595
12.00000 298.591
16.00000 302.796
24.00000 306.267
32.00000 307.175
48.00000 306.927
64.00000 306.211
96.00000 303.287
128.00000 304.590
Preliminary Conclusions

Obviously the A10 is way faster than the GoFlex. There is no contest there. The 3.0.8+ A10 kernel is a bit slower than the 3.0.36+ kernel. That is also an easy conclusion.

When it comes to comparing the two different 3.0.36+ kernel configs it gets a bit more difficult to draw any major conclusions. There is no doubt that little to no performance has been lost using the a10linux config. In some benchmarks, such as a Simple syscall or open/close the a10linux config is actually much faster. In others, such as networking and some memcopy operations, the defconfig seems to be faster. But you must keep in mind that no iptables, X server or LXDE is even running in the defconfig. There are no services running in the defconfig except bluetooth, NetworkManager, and avahi. So it is really not a fair comparison with the a10config. The kernel config comparisons would be more reliable if they were rerun with hdmi and X enabled on the defconfig, and also repeated several times.

It is easy to say, however, that our new kernel config has little to no detrimental effects. It also has a lot of benefits (like hdmi).
Since the defconfig had no video it was not a fair comparison with the other tests. So I recompiled again using the defconfig with the only change being having the video drivers built into the kernel. Here are the results of that benchmark:
************************************************************************************************************************************
[lmbench3.0 results for Linux T-01 3.0.36+ #30 PREEMPT Thu Jul 19 21:07:43 IST 2012 armv7l GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m]
[DISKS: /dev/mmcblk0 ]
[DISK_DESC: [/dev/mmcblk0:Class 10 SD card] ]
[ENOUGH: 100000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[INFO: INFO.T-01]
[LINE_SIZE: 64]
[LOOP_O: 0.00000000]
[MB: 180]
[MHZ: 1006 MHz, 0.9940 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv7l-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv7l]
[RELEASE: 3.0.36+]
[VERSION: #30 PREEMPT Thu Jul 19 21:07:43 IST 2012]
Simple syscall: 0.7484 microseconds
Simple read: 0.6400 microseconds
Simple write: 1.4195 microseconds
Simple stat: 4.5492 microseconds
Simple fstat: 1.6254 microseconds
Simple open/close: 12.9788 microseconds
Select on 10 fd's: 2.1492 microseconds
Select on 100 fd's: 9.9708 microseconds
Select on 250 fd's: 22.6328 microseconds
Select on 500 fd's: 44.0149 microseconds
Select on 10 tcp fd's: 2.2489 microseconds
Select on 100 tcp fd's: 16.3856 microseconds
Select on 250 tcp fd's: 39.7070 microseconds
Select on 500 tcp fd's: 78.7194 microseconds
Signal handler installation: 1.1680 microseconds
Signal handler overhead: 5.8662 microseconds
Protection fault: 1.3692 microseconds
Pipe latency: 20.6322 microseconds
AF_UNIX sock stream latency: 37.4667 microseconds
Process fork+exit: 888.1852 microseconds
Process fork+execve: 2909.7222 microseconds
Process fork+/bin/sh -c: 6579.2941 microseconds
integer bit: 1.00 nanoseconds
integer add: 1.00 nanoseconds
integer mul: 5.97 nanoseconds
integer div: 72.18 nanoseconds
integer mod: 22.90 nanoseconds
int64 bit: 1.06 nanoseconds
uint64 add: 1.41 nanoseconds
int64 mul: 10.98 nanoseconds
int64 div: 276.62 nanoseconds
int64 mod: 189.00 nanoseconds
float add: 8.86 nanoseconds
float mul: 9.95 nanoseconds
float div: 32.82 nanoseconds
double add: 8.87 nanoseconds
double mul: 12.04 nanoseconds
double div: 62.72 nanoseconds
float bogomflops: 104.36 nanoseconds
double bogomflops: 133.73 nanoseconds
integer bit parallelism: 1.49
integer add parallelism: 1.71
integer mul parallelism: 2.90
integer div parallelism: 1.19
integer mod parallelism: 1.09
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.03
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 3823 KB/sec
Pagefaults on /var/tmp/XXX: 5.7794 microseconds

"mappings
0.524288 46
1.048576 71
2.097152 119
4.194304 223
8.388608 429
16.777216 847
33.554432 1710
67.108864 3468
134.217728 7269

"File system latency
0k 1260 10477 12549
1k 847 7323 9496
4k 835 7562 12300
10k 565 5434 10104

"Seek times for /dev/mmcblk0
2144.3 1.19
. . .
2.3 4.63

"Zone bandwidth for /dev/mmcblk0
1.1 7.58
. . .
2136.0 5.86

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 939.1154 microseconds
TCP latency using localhost: 1267.5260 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
(null): RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 243.5045 microseconds

Socket bandwidth using localhost
0.000001 0.17 MB/sec
0.000064 4.97 MB/sec
0.000128 9.42 MB/sec
0.000256 17.75 MB/sec
0.000512 25.89 MB/sec
0.001024 65.22 MB/sec
0.001437 77.08 MB/sec
10.000000 8.66 MB/sec

Avg xfer: 3.2KB, 41.8KB in 126 millisecs, 332.75 KB/sec
AF_UNIX sock stream bandwidth: 20.81 MB/sec
Pipe bandwidth: 381.11 MB/sec

"read bandwidth
0.000512 140.65
0.001024 241.87
0.002048 375.67
0.004096 527.49
0.008192 574.10
0.016384 584.08
0.032768 521.99
0.065536 516.30
0.131072 508.29
0.262144 316.92
0.524288 257.20
1.05 253.92
2.10 251.46
4.19 258.72
8.39 254.98
16.78 237.63
33.55 185.53
67.11 252.86
134.22 249.24

"read open2close bandwidth
0.000512 32.38
0.001024 62.66
0.002048 115.51
0.004096 204.35
0.008192 306.71
0.016384 334.97
0.032768 353.51
0.065536 432.61
0.131072 410.62
0.262144 289.09
0.524288 244.07
1.05 238.22
2.10 249.16
4.19 242.61
8.39 247.36
16.78 246.70
33.55 246.34
67.11 252.54
134.22 249.17


"Mmap read bandwidth
0.000512 1804.82
0.001024 1890.72
0.002048 1938.67
0.004096 1948.85
0.008192 1966.58
0.016384 1975.99
0.032768 1971.91
0.065536 1634.74
0.131072 1547.22
0.262144 991.07
0.524288 460.29
1.05 397.92
2.10 376.96
4.19 369.01
8.39 367.15
16.78 367.15
33.55 366.85
67.11 366.52
134.22 367.08

"Mmap read open2close bandwidth
0.000512 14.79
0.001024 29.13
0.002048 57.20
0.004096 109.73
0.008192 181.09
0.016384 269.73
0.032768 358.28
0.065536 429.27
0.131072 456.60
0.262144 339.03
0.524288 249.82
1.05 239.06
2.10 239.06
4.19 240.74
8.39 242.43
16.78 244.75
33.55 245.81
67.11 245.82
134.22 246.54


"libc bcopy unaligned
0.000512 1190.46
0.001024 1214.36
0.002048 1074.53
0.004096 1202.91
0.008192 1199.12
0.016384 1146.08
0.032768 1137.23
0.065536 1183.55
0.131072 1178.07
0.262144 679.75
0.524288 291.44
1.05 253.09
2.10 243.78
4.19 240.66
8.39 238.71
16.78 239.22
33.55 239.24
67.11 223.97

"libc bcopy aligned
0.000512 1068.51
0.001024 1200.07
0.002048 1239.90
0.004096 1114.12
0.008192 1152.51
0.016384 1215.03
0.032768 1152.86
0.065536 1196.60
0.131072 1066.26
0.262144 755.29
0.524288 288.33
1.05 253.54
2.10 243.73
4.19 240.40
8.39 234.24
16.78 239.29
33.55 228.53
67.11 239.09

Memory bzero bandwidth
0.000512 1198.18
0.001024 1200.26
0.002048 1200.13
0.004096 1197.64
0.008192 1165.86
0.016384 1211.14
0.032768 1211.72
0.065536 1180.37
0.131072 1175.90
0.262144 1174.98
0.524288 1165.75
1.05 1169.60
2.10 1176.41
4.19 1180.09
8.39 1173.13
16.78 1181.30
33.55 1177.49
67.11 1175.77
134.22 1170.27

"unrolled bcopy unaligned
0.000512 1213.42
0.001024 1213.79
0.002048 1214.15
0.004096 1211.62
0.008192 1210.18
0.016384 1211.84
0.032768 1188.65
0.065536 1187.56
0.131072 1178.61
0.262144 724.17
0.524288 267.61
1.05 248.84
2.10 241.97
4.19 240.52
8.39 232.11
16.78 239.66
33.55 239.09
67.11 239.24

"unrolled partial bcopy unaligned
0.000512 1078.69
0.001024 1295.28
0.002048 1215.62
0.004096 1214.38
0.008192 1210.81
0.016384 1214.51
0.032768 1189.90
0.065536 1179.10
0.131072 1180.87
0.262144 811.97
0.524288 416.53
1.05 379.42
2.10 376.64
4.19 381.59
8.39 380.70
16.78 386.63
33.55 389.80
67.11 359.13

Memory read bandwidth
0.000512 1955.99
0.001024 1974.68
0.002048 1984.87
0.004096 1989.53
0.008192 1984.60
0.016384 1989.26
0.032768 1936.58
0.065536 1634.74
0.131072 1556.04
0.262144 1145.89
0.524288 461.09
1.05 399.24
2.10 378.61
4.19 370.43
8.39 366.98
16.78 367.74
33.55 367.51
67.11 367.18
134.22 362.16

Memory partial read bandwidth
0.000512 7243.15
0.001024 7510.26
0.002048 7653.30
0.004096 7720.83
0.008192 7658.74
0.016384 7726.51
0.032768 7638.97
0.065536 4012.78
0.131072 3458.43
0.262144 1383.04
0.524288 556.60
1.05 444.65
2.10 419.82
4.19 407.11
8.39 402.79
16.78 402.44
33.55 402.47
67.11 402.68
134.22 402.98

Memory write bandwidth
0.000512 1210.85
0.001024 1214.64
0.002048 1253.07
0.004096 1213.17
0.008192 1112.41
0.016384 1137.49
0.032768 1148.95
0.065536 1167.57
0.131072 1172.37
0.262144 1178.02
0.524288 1172.87
1.05 1151.94
2.10 1168.92
4.19 1181.95
8.39 1176.04
16.78 1178.63
33.55 1177.05
67.11 1176.47
134.22 1171.53

Memory partial write bandwidth
0.000512 1197.79
0.001024 1204.67
0.002048 1240.32
0.004096 1202.52
0.008192 1200.94
0.016384 1160.63
0.032768 1185.16
0.065536 1170.03
0.131072 1184.18
0.262144 1180.17
0.524288 1190.29
1.05 1177.23
2.10 1178.76
4.19 1181.60
8.39 1176.81
16.78 1180.44
33.55 1178.45
67.11 1177.77
134.22 1173.63

Memory partial read/write bandwidth
0.000512 4993.49
0.001024 5118.28
0.002048 5181.94
0.004096 5218.56
0.008192 5183.50
0.016384 5214.61
0.032768 5214.12
0.065536 3401.48
0.131072 2998.80
0.262144 1314.67
0.524288 384.32
1.05 327.89
2.10 311.81
4.19 314.41
8.39 312.45
16.78 312.56
33.55 312.20
67.11 313.26
134.22 314.65



"size=0k ovr=5.56
2 5.39
4 6.86
8 6.86
16 7.77
24 9.17
32 10.21
64 12.93
96 14.23

"size=4k ovr=7.76
2 4.88
4 6.95
8 7.76
16 10.24
24 13.99
32 16.35
64 23.30
96 26.43

"size=8k ovr=9.90
2 5.11
4 7.21
8 8.79
16 16.18
24 22.49
32 28.67
64 35.81
96 37.98

"size=16k ovr=14.25
2 6.37
4 9.29
8 15.80
16 35.43
24 49.08
32 52.47
64 57.34
96 58.57

"size=32k ovr=23.98
2 9.68
4 25.48
8 47.65
16 84.90
24 92.45
32 94.26
64 96.33
96 96.88

"size=64k ovr=49.84
2 21.34
4 60.28
8 139.83
16 157.65
24 161.65
32 163.47
64 164.87
96 164.61

tlb: 32 pages

Memory load parallelism
0.001024 3.01
0.002048 3.01
0.004096 3.01
0.008192 3.00
0.016384 3.00
0.032768 3.88
0.065536 1.08
0.131072 1.00
0.262144 1.00
0.524288 1.00
1.048576 1.00
2.097152 1.00
4.194304 1.00
8.388608 1.00
16.777216 1.00
33.554432 1.00
67.108864 1.00
134.217728 1.00

STREAM copy latency: 33.32 nanoseconds
STREAM copy bandwidth: 480.20 MB/sec
STREAM scale latency: 42.04 nanoseconds
STREAM scale bandwidth: 380.58 MB/sec
STREAM add latency: 48.58 nanoseconds
STREAM add bandwidth: 493.98 MB/sec
STREAM triad latency: 38.75 nanoseconds
STREAM triad bandwidth: 619.38 MB/sec
STREAM2 fill latency: 6.82 nanoseconds
STREAM2 fill bandwidth: 1173.10 MB/sec
STREAM2 copy latency: 33.41 nanoseconds
STREAM2 copy bandwidth: 478.86 MB/sec
STREAM2 daxpy latency: 66.49 nanoseconds
STREAM2 daxpy bandwidth: 360.93 MB/sec
STREAM2 sum latency: 16.94 nanoseconds
STREAM2 sum bandwidth: 472.21 MB/sec

Memory load latency
"stride=16
0.00049 3.155
0.00098 3.023
0.00195 3.002
0.00293 3.003
0.00391 2.997
0.00586 2.992
0.00781 2.989
0.01172 2.987
0.01562 2.997
0.02344 3.016
0.03125 3.165
0.04688 4.424
0.06250 4.733
0.09375 4.951
0.12500 5.270
0.18750 8.466
0.25000 13.650
0.37500 24.626
0.50000 31.112
0.75000 35.934
1.00000 37.551
1.50000 39.192
2.00000 40.105
3.00000 40.520
4.00000 41.123
6.00000 41.464
8.00000 41.522
12.00000 41.599
16.00000 41.622
24.00000 41.560
32.00000 41.661
48.00000 41.607
64.00000 41.633
96.00000 41.580
128.00000 41.681

"stride=32
0.00049 3.022
0.00098 3.022
0.00195 3.022
0.00293 3.022
0.00391 3.002
0.00586 3.003
0.00781 2.995
0.01172 2.992
0.01562 2.989
0.02344 2.989
0.03125 3.020
0.04688 7.946
0.06250 9.181
0.09375 10.302
0.12500 10.762
0.18750 16.643
0.25000 38.431
0.37500 52.818
0.50000 60.112
0.75000 67.630
1.00000 71.169
1.50000 73.985
2.00000 75.182
3.00000 76.343
4.00000 77.255
6.00000 77.858
8.00000 78.148
12.00000 78.526
16.00000 78.388
24.00000 78.339
32.00000 78.349
48.00000 78.307
64.00000 78.383
96.00000 78.409
128.00000 78.437

"stride=64
0.00049 3.022
0.00098 3.022
0.00195 3.022
0.00293 3.022
0.00391 3.022
0.00586 3.022
0.00781 3.002
0.01172 3.002
0.01562 2.997
0.02344 3.057
0.03125 3.466
0.04688 8.801
0.06250 9.799
0.09375 10.833
0.12500 11.424
0.18750 24.488
0.25000 36.959
0.37500 93.599
0.50000 112.242
0.75000 132.097
1.00000 138.431
1.50000 144.713
2.00000 146.804
3.00000 149.810
4.00000 150.761
6.00000 152.777
8.00000 153.309
12.00000 154.137
16.00000 153.877
24.00000 153.794
32.00000 154.003
48.00000 154.081
64.00000 153.936
96.00000 154.003
128.00000 154.040

"stride=128
0.00049 3.022
0.00098 3.022
0.00195 3.022
0.00293 3.022
0.00391 3.022
0.00586 3.022
0.00781 3.022
0.01172 3.022
0.01562 3.003
0.02344 3.034
0.03125 3.519
0.04688 8.631
0.06250 9.850
0.09375 10.942
0.12500 11.640
0.18750 37.369
0.25000 51.451
0.37500 92.014
0.50000 113.522
0.75000 133.776
1.00000 139.379
1.50000 144.739
2.00000 147.030
3.00000 150.346
4.00000 151.456
6.00000 152.796
8.00000 154.142
12.00000 154.367
16.00000 154.592
24.00000 154.682
32.00000 154.687
48.00000 155.136
64.00000 154.865
96.00000 154.849
128.00000 154.901

"stride=256
0.00049 3.022
0.00098 3.021
0.00195 3.022
0.00293 3.023
0.00391 3.022
0.00586 3.023
0.00781 3.022
0.01172 3.022
0.01562 3.022
0.02344 3.115
0.03125 3.531
0.04688 8.608
0.06250 9.840
0.09375 10.806
0.12500 11.428
0.18750 14.492
0.25000 38.982
0.37500 93.437
0.50000 110.962
0.75000 135.646
1.00000 141.016
1.50000 146.418
2.00000 149.206
3.00000 151.950
4.00000 152.986
6.00000 154.784
8.00000 155.635
12.00000 156.180
16.00000 156.308
24.00000 156.626
32.00000 157.676
48.00000 156.950
64.00000 156.799
96.00000 157.145
128.00000 157.100

"stride=512
0.00049 3.022
0.00098 3.022
0.00195 3.022
0.00293 3.021
0.00391 3.022
0.00586 3.022
0.00781 3.022
0.01172 3.022
0.01562 3.022
0.02344 3.059
0.03125 3.639
0.04688 8.579
0.06250 9.688
0.09375 14.566
0.12500 19.360
0.18750 87.834
0.25000 110.752
0.37500 125.511
0.50000 114.009
0.75000 139.488
1.00000 145.065
1.50000 151.047
2.00000 153.741
3.00000 156.581
4.00000 157.952
6.00000 159.184
8.00000 159.672
12.00000 160.436
16.00000 160.658
24.00000 161.163
32.00000 161.411
48.00000 161.482
64.00000 161.353
96.00000 161.726
128.00000 161.926

"stride=1024
0.00098 3.022
0.00195 3.021
0.00293 3.022
0.00391 3.022
0.00586 3.021
0.00781 3.022
0.01172 3.022
0.01562 3.022
0.02344 3.027
0.03125 3.762
0.04688 8.456
0.06250 9.435
0.09375 10.802
0.12500 11.340
0.18750 21.942
0.25000 40.056
0.37500 101.295
0.50000 126.371
0.75000 145.971
1.00000 153.056
1.50000 158.879
2.00000 161.826
3.00000 165.094
4.00000 165.754
6.00000 167.220
8.00000 167.657
12.00000 168.130
16.00000 168.447
24.00000 168.984
32.00000 168.951
48.00000 169.270
64.00000 169.277
96.00000 169.849
128.00000 170.063


Random load latency
"stride=16
0.00049 3.022
0.00098 3.022
0.00195 3.002
0.00293 3.002
0.00391 2.995
0.00586 2.992
0.00781 2.989
0.01172 2.987
0.01562 2.997
0.02344 2.995
0.03125 5.476
0.04688 8.672
0.06250 11.274
0.09375 12.552
0.12500 13.084
0.18750 61.181
0.25000 93.056
0.37500 150.356
0.50000 178.082
0.75000 201.107
1.00000 213.613
1.50000 224.577
2.00000 239.510
3.00000 248.241
4.00000 265.191
6.00000 274.574
8.00000 292.453
12.00000 303.770
16.00000 310.875
24.00000 311.422
32.00000 312.976
48.00000 312.579
64.00000 310.934
96.00000 307.141
128.00000 304.943
More Preliminary Conclusions

The defconfig and our config are both extremely close in performance. In some tests our config is significantly faster. In certain other tests, such as memory reads, ours is very slightly slower. But due to the nature of benchmarks any small variations could merely be a statistical aberration. They are all both too close to call in most tests.

I will continue to try to test the kernels by attempting to compile lmbench using compiler flags from the later ARM platforms such as armv7. I will also try to compile our kernel again using some minor config changes and also cross-compile it with the later ARM platform compiler flags such as armv7 and armhf. Finally, once our kernel config is finalized, I will compile it natively on the A10 to see if that helps any with performance.
I tried compiling our kernel again using some minor config changes. The changes resulted in little or no changes in performance. Any of the small deviations could have been the result of statistical variations. Therefore the conclusion is that most kernel config changes make little to no difference in lmbench performance. Certain kernel config changes, however, can make a large difference in the performance of application programs. That will need to be investigated further using application benchmark tests. But we can be fairly certain that most kernel config changes have little to no effect on performance.

My next test is to see if changes in the cross-compiler flags will make a difference. It is possible to tune the compiler to the exact platform architecture using compiler flags. Although other tests posted online have shown that the use of different compilers when compiling the kernel has little to no effect I want to see for myself if the compiler flags will have any effect on the kernel. Therefore on this compiler run I used the following compiler flags:
export CCFLAGS="-O3 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard"

That will tune the compiler to use all the features of the A10 to maximum advantage. This would have a big effect on userspace programs. That is why Debian armhf uses similar compiler flags. But do the flags have an effect on the kernel?
****************************************************************************************************************************************
[lmbench3.0 results for Linux T-01 3.0.36+ #33 PREEMPT Fri Jul 20 16:11:13 IST 2012 armv7l GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m]
[DISKS: /dev/mmcblk0 ]
[DISK_DESC: [/dev/mmcblk0:Class 10 SD card] ]
[ENOUGH: 100000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[INFO: INFO.T-01]
[LINE_SIZE: 64]
[LOOP_O: 0.00000000]
[MB: 180]
[MHZ: 1006 MHz, 0.9940 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv7l-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv7l]
[RELEASE: 3.0.36+]
[VERSION: #33 PREEMPT Fri Jul 20 16:11:13 IST 2012]
Simple syscall: 0.3455 microseconds
Simple read: 0.6717 microseconds
Simple write: 0.8451 microseconds
Simple stat: 3.6152 microseconds
Simple fstat: 1.2281 microseconds
Simple open/close: 6.7987 microseconds
Select on 10 fd's: 1.3290 microseconds
Select on 100 fd's: 8.2676 microseconds
Select on 250 fd's: 19.5433 microseconds
Select on 500 fd's: 38.9488 microseconds
Select on 10 tcp fd's: 1.5399 microseconds
Select on 100 tcp fd's: 14.9057 microseconds
Select on 250 tcp fd's: 37.0342 microseconds
Select on 500 tcp fd's: 75.0782 microseconds
Signal handler installation: 0.8740 microseconds
Signal handler overhead: 3.9910 microseconds
Protection fault: 0.3663 microseconds
Pipe latency: 17.7345 microseconds
AF_UNIX sock stream latency: 33.2064 microseconds
Process fork+exit: 713.5449 microseconds
Process fork+execve: 2482.7381 microseconds
Process fork+/bin/sh -c: 6353.4211 microseconds
integer bit: 1.76 nanoseconds
integer add: 1.75 nanoseconds
integer mul: 10.48 nanoseconds
integer div: 126.83 nanoseconds
integer mod: 40.22 nanoseconds
int64 bit: 1.01 nanoseconds
uint64 add: 1.57 nanoseconds
int64 mul: 12.85 nanoseconds
int64 div: 276.37 nanoseconds
int64 mod: 189.12 nanoseconds
float add: 9.56 nanoseconds
float mul: 11.62 nanoseconds
float div: 32.95 nanoseconds
double add: 8.86 nanoseconds
double mul: 10.94 nanoseconds
double div: 56.74 nanoseconds
float bogomflops: 104.57 nanoseconds
double bogomflops: 133.46 nanoseconds
integer bit parallelism: 1.49
integer add parallelism: 1.71
integer mul parallelism: 2.90
integer div parallelism: 1.19
integer mod parallelism: 1.09
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.03
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 3162 KB/sec
Pagefaults on /var/tmp/XXX: 4.2778 microseconds

"mappings
0.524288 30
1.048576 45
2.097152 72
4.194304 131
8.388608 249
16.777216 496
33.554432 974
67.108864 2084
134.217728 4770

"File system latency
0k 1905 17161 24203
1k 1091 5889 17665
4k 1036 9634 12816
10k 772 6399 784

"Seek times for /dev/mmcblk0
2144.3 2.80
. . .
2.3 5.59

"Zone bandwidth for /dev/mmcblk0
1.1 7.23
. . .
2136.0 12.47

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 1045.6234 microseconds
TCP latency using localhost: 70.3928 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
(null): RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 224.7416 microseconds

Socket bandwidth using localhost
0.000001 0.24 MB/sec
0.000064 11.33 MB/sec
0.000128 14.36 MB/sec
0.000256 27.27 MB/sec
0.000512 44.39 MB/sec
0.001024 75.08 MB/sec
0.001437 82.71 MB/sec
10.000000 7.93 MB/sec

Avg xfer: 3.2KB, 41.8KB in 168 millisecs, 248.01 KB/sec
AF_UNIX sock stream bandwidth: 25.62 MB/sec
Pipe bandwidth: 324.01 MB/sec

"read bandwidth
0.000512 202.50
0.001024 345.54
0.002048 389.89
0.004096 535.39
0.008192 824.30
0.016384 832.16
0.032768 726.05
0.065536 494.86
0.131072 600.60
0.262144 350.85
0.524288 276.84
1.05 262.84
2.10 255.51
4.19 251.13
8.39 261.96
16.78 253.45
33.55 189.44
67.11 261.91
134.22 247.37

"read open2close bandwidth
0.000512 49.13
0.001024 93.83
0.002048 171.63
0.004096 298.06
0.008192 455.78
0.016384 577.59
0.032768 609.71
0.065536 588.10
0.131072 423.36
0.262144 319.63
0.524288 269.60
1.05 261.84
2.10 256.13
4.19 255.52
8.39 260.81
16.78 260.91
33.55 260.71
67.11 264.76
134.22 257.71


"Mmap read bandwidth
0.000512 1805.54
0.001024 1890.60
0.002048 1937.36
0.004096 1947.73
0.008192 1965.96
0.016384 1976.66
0.032768 1969.67
0.065536 1633.04
0.131072 1531.61
0.262144 892.66
0.524288 464.23
1.05 396.16
2.10 375.30
4.19 366.67
8.39 366.16
16.78 366.26
33.55 366.33
67.11 366.22
134.22 358.04

"Mmap read open2close bandwidth
0.000512 19.78
0.001024 38.79
0.002048 75.85
0.004096 143.57
0.008192 233.14
0.016384 348.85
0.032768 463.27
0.065536 530.90
0.131072 549.16
0.262144 372.08
0.524288 275.48
1.05 261.13
2.10 260.32
4.19 261.56
8.39 263.77
16.78 268.02
33.55 270.43
67.11 263.15
134.22 267.17


"libc bcopy unaligned
0.000512 1196.84
0.001024 1216.30
0.002048 1216.40
0.004096 1216.85
0.008192 1199.58
0.016384 1212.18
0.032768 1173.40
0.065536 1156.49
0.131072 1164.21
0.262144 672.87
0.524288 289.80
1.05 253.14
2.10 242.72
4.19 240.90
8.39 237.76
16.78 237.50
33.55 239.40
67.11 231.57

"libc bcopy aligned
0.000512 1203.74
0.001024 1213.76
0.002048 1214.98
0.004096 1209.00
0.008192 1205.62
0.016384 1175.44
0.032768 1210.89
0.065536 1142.12
0.131072 1168.59
0.262144 750.03
0.524288 291.41
1.05 253.05
2.10 243.51
4.19 240.87
8.39 237.10
16.78 239.17
33.55 233.72
67.11 240.18

Memory bzero bandwidth
0.000512 1216.38
0.001024 1217.43
0.002048 1070.82
0.004096 1060.62
0.008192 1109.60
0.016384 1099.12
0.032768 1145.46
0.065536 1166.65
0.131072 1165.77
0.262144 1169.09
0.524288 1168.39
1.05 1158.47
2.10 1172.89
4.19 1178.83
8.39 1178.63
16.78 1177.56
33.55 1175.94
67.11 1171.04
134.22 1168.08

"unrolled bcopy unaligned
0.000512 1211.68
0.001024 1215.32
0.002048 1210.99
0.004096 1204.68
0.008192 1203.35
0.016384 1146.39
0.032768 1213.12
0.065536 1159.64
0.131072 1163.36
0.262144 657.56
0.524288 295.20
1.05 247.14
2.10 235.96
4.19 240.09
8.39 238.64
16.78 239.60
33.55 235.87
67.11 233.94

"unrolled partial bcopy unaligned
0.000512 1204.72
0.001024 1204.53
0.002048 1204.52
0.004096 1207.34
0.008192 1215.30
0.016384 1221.25
0.032768 1149.01
0.065536 1173.94
0.131072 1169.53
0.262144 779.48
0.524288 393.66
1.05 365.77
2.10 374.06
4.19 379.28
8.39 375.32
16.78 375.28
33.55 359.04
67.11 382.54

Memory read bandwidth
0.000512 1955.40
0.001024 1974.57
0.002048 1982.59
0.004096 1989.18
0.008192 1984.25
0.016384 1986.94
0.032768 1909.24
0.065536 1634.56
0.131072 1551.71
0.262144 981.88
0.524288 457.35
1.05 399.11
2.10 377.34
4.19 369.03
8.39 367.01
16.78 365.81
33.55 366.35
67.11 366.74
134.22 358.78

Memory partial read bandwidth
0.000512 7244.76
0.001024 7503.75
0.002048 7647.49
0.004096 7720.30
0.008192 7657.18
0.016384 7724.08
0.032768 7583.33
0.065536 4012.54
0.131072 3506.34
0.262144 1357.95
0.524288 532.15
1.05 442.29
2.10 418.24
4.19 407.45
8.39 402.52
16.78 401.69
33.55 402.45
67.11 401.38
134.22 392.77

Memory write bandwidth
0.000512 1213.69
0.001024 1215.65
0.002048 1215.79
0.004096 1219.23
0.008192 1215.82
0.016384 1207.49
0.032768 1207.86
0.065536 1173.11
0.131072 1184.81
0.262144 1177.92
0.524288 1163.20
1.05 1161.46
2.10 1168.94
4.19 1179.56
8.39 1177.78
16.78 1176.94
33.55 1173.12
67.11 1170.47
134.22 1168.20

Memory partial write bandwidth
0.000512 1212.10
0.001024 1074.10
0.002048 1211.15
0.004096 1208.47
0.008192 1212.13
0.016384 1135.44
0.032768 1185.26
0.065536 1141.40
0.131072 1186.43
0.262144 1175.30
0.524288 1179.94
1.05 1151.08
2.10 1176.03
4.19 1186.45
8.39 1176.38
16.78 1178.27
33.55 1177.92
67.11 1153.64
134.22 1169.80

Memory partial read/write bandwidth
0.000512 4992.99
0.001024 5113.29
0.002048 5183.86
0.004096 5217.33
0.008192 5187.03
0.016384 5216.85
0.032768 5208.85
0.065536 3392.40
0.131072 2933.33
0.262144 1159.94
0.524288 387.08
1.05 320.90
2.10 314.00
4.19 313.39
8.39 312.58
16.78 310.76
33.55 312.94
67.11 307.44
134.22 308.75



"size=0k ovr=3.77
2 5.67
4 6.90
8 7.23
16 8.49
24 9.71
32 10.37
64 12.87
96 14.16

"size=4k ovr=5.78
2 5.96
4 7.42
8 8.67
16 10.67
24 14.49
32 16.22
64 23.68
96 25.99

"size=8k ovr=7.83
2 5.77
4 8.32
8 11.80
16 16.04
24 23.38
32 28.24
64 36.19
96 37.69

"size=16k ovr=12.25
2 7.35
4 10.06
8 15.14
16 34.13
24 47.24
32 52.60
64 57.04
96 57.57

"size=32k ovr=22.42
2 9.56
4 17.43
8 50.47
16 84.01
24 91.61
32 93.85
64 94.62
96 94.96

"size=64k ovr=47.30
2 9.30
4 71.79
8 139.41
16 160.71
24 162.97
32 164.64
64 164.37
96 165.52

tlb: 32 pages

Memory load parallelism
0.001024 3.15
0.002048 3.12
0.004096 3.00
0.008192 3.00
0.016384 3.00
0.032768 3.83
0.065536 1.05
0.131072 1.07
0.262144 1.00
0.524288 1.02
1.048576 1.00
2.097152 1.00
4.194304 1.00
8.388608 1.00
16.777216 1.00
33.554432 1.00
67.108864 1.00
134.217728 1.00

STREAM copy latency: 33.36 nanoseconds
STREAM copy bandwidth: 479.67 MB/sec
STREAM scale latency: 42.23 nanoseconds
STREAM scale bandwidth: 378.87 MB/sec
STREAM add latency: 48.89 nanoseconds
STREAM add bandwidth: 490.93 MB/sec
STREAM triad latency: 39.17 nanoseconds
STREAM triad bandwidth: 612.64 MB/sec
STREAM2 fill latency: 6.83 nanoseconds
STREAM2 fill bandwidth: 1171.10 MB/sec
STREAM2 copy latency: 33.29 nanoseconds
STREAM2 copy bandwidth: 480.67 MB/sec
STREAM2 daxpy latency: 67.11 nanoseconds
STREAM2 daxpy bandwidth: 357.64 MB/sec
STREAM2 sum latency: 16.96 nanoseconds
STREAM2 sum bandwidth: 471.65 MB/sec

Memory load latency
"stride=16
0.00049 3.023
0.00098 3.077
0.00195 3.006
0.00293 3.003
0.00391 2.998
0.00586 2.993
0.00781 2.990
0.01172 3.141
0.01562 3.001
0.02344 3.029
0.03125 3.220
0.04688 5.840
0.06250 4.732
0.09375 4.983
0.12500 5.100
0.18750 8.657
0.25000 11.238
0.37500 25.618
0.50000 31.791
0.75000 36.285
1.00000 37.832
1.50000 39.481
2.00000 40.143
3.00000 40.815
4.00000 41.173
6.00000 41.469
8.00000 41.717
12.00000 41.710
16.00000 41.696
24.00000 41.698
32.00000 41.715
48.00000 41.708
64.00000 41.652
96.00000 41.720
128.00000 41.682

"stride=32
0.00049 3.023
0.00098 3.023
0.00195 3.023
0.00293 4.234
0.00391 3.005
0.00586 3.004
0.00781 2.997
0.01172 2.994
0.01562 2.990
0.02344 2.990
0.03125 3.016
0.04688 7.849
0.06250 9.073
0.09375 10.173
0.12500 10.860
0.18750 16.719
0.25000 28.212
0.37500 49.149
0.50000 58.877
0.75000 66.541
1.00000 71.555
1.50000 76.153
2.00000 75.398
3.00000 76.966
4.00000 77.332
6.00000 78.165
8.00000 78.561
12.00000 78.603
16.00000 78.549
24.00000 78.554
32.00000 78.494
48.00000 78.508
64.00000 78.477
96.00000 78.515
128.00000 78.506

"stride=64
0.00049 3.024
0.00098 3.023
0.00195 3.024
0.00293 3.024
0.00391 3.024
0.00586 3.024
0.00781 3.004
0.01172 3.005
0.01562 2.997
0.02344 3.083
0.03125 3.896
0.04688 8.613
0.06250 9.913
0.09375 10.932
0.12500 11.412
0.18750 12.854
0.25000 31.669
0.37500 89.753
0.50000 109.841
0.75000 130.508
1.00000 139.487
1.50000 145.218
2.00000 148.119
3.00000 150.772
4.00000 152.256
6.00000 153.351
8.00000 154.041
12.00000 154.380
16.00000 154.402
24.00000 154.460
32.00000 154.396
48.00000 154.263
64.00000 154.189
96.00000 154.251
128.00000 154.282

"stride=128
0.00049 3.023
0.00098 3.023
0.00195 3.023
0.00293 3.024
0.00391 3.024
0.00586 3.024
0.00781 3.023
0.01172 3.023
0.01562 3.004
0.02344 3.102
0.03125 3.609
0.04688 8.709
0.06250 9.874
0.09375 11.006
0.12500 11.515
0.18750 20.423
0.25000 38.799
0.37500 92.073
0.50000 111.710
0.75000 127.257
1.00000 140.377
1.50000 145.359
2.00000 147.776
3.00000 150.716
4.00000 151.851
6.00000 153.022
8.00000 153.931
12.00000 157.461
16.00000 155.227
24.00000 155.006
32.00000 155.018
48.00000 154.992
64.00000 155.051
96.00000 155.020
128.00000 155.013

"stride=256
0.00049 3.024
0.00098 3.023
0.00195 3.023
0.00293 3.023
0.00391 3.023
0.00586 3.023
0.00781 3.024
0.01172 3.024
0.01562 3.024
0.02344 3.185
0.03125 3.790
0.04688 8.817
0.06250 9.807
0.09375 10.952
0.12500 11.485
0.18750 14.425
0.25000 39.713
0.37500 96.351
0.50000 115.494
0.75000 131.771
1.00000 141.686
1.50000 146.749
2.00000 149.175
3.00000 152.067
4.00000 153.571
6.00000 154.638
8.00000 155.631
12.00000 156.241
16.00000 156.674
24.00000 156.882
32.00000 157.014
48.00000 156.936
64.00000 157.061
96.00000 157.044
128.00000 157.222

"stride=512
0.00049 3.023
0.00098 3.023
0.00195 3.023
0.00293 3.023
0.00391 3.023
0.00586 3.023
0.00781 3.023
0.01172 3.023
0.01562 3.023
0.02344 3.040
0.03125 3.833
0.04688 8.518
0.06250 9.533
0.09375 10.874
0.12500 11.308
0.18750 20.463
0.25000 46.166
0.37500 92.554
0.50000 115.098
0.75000 137.423
1.00000 145.360
1.50000 150.916
2.00000 153.656
3.00000 156.509
4.00000 158.198
6.00000 159.470
8.00000 160.225
12.00000 160.958
16.00000 161.349
24.00000 161.311
32.00000 161.590
48.00000 161.699
64.00000 161.786
96.00000 161.821
128.00000 162.245

"stride=1024
0.00098 3.024
0.00195 3.024
0.00293 3.024
0.00391 3.023
0.00586 3.024
0.00781 3.023
0.01172 3.024
0.01562 3.024
0.02344 3.027
0.03125 3.570
0.04688 8.413
0.06250 9.628
0.09375 10.757
0.12500 11.409
0.18750 29.754
0.25000 57.441
0.37500 95.616
0.50000 120.850
0.75000 143.006
1.00000 153.800
1.50000 158.921
2.00000 161.779
3.00000 164.350
4.00000 165.988
6.00000 167.088
8.00000 167.994
12.00000 168.603
16.00000 169.146
24.00000 175.540
32.00000 175.818
48.00000 175.986
64.00000 169.578
96.00000 169.971
128.00000 170.420


Random load latency
"stride=16
0.00049 3.024
0.00098 3.023
0.00195 3.004
0.00293 3.003
0.00391 2.997
0.00586 2.994
0.00781 2.990
0.01172 2.989
0.01562 2.999
0.02344 6.274
0.03125 8.590
0.04688 8.687
0.06250 11.498
0.09375 12.659
0.12500 13.105
0.18750 55.162
0.25000 92.858
0.37500 146.210
0.50000 172.587
0.75000 201.251
1.00000 215.998
1.50000 226.085
2.00000 235.885
3.00000 257.750
4.00000 269.470
6.00000 283.295
8.00000 296.239
12.00000 305.602
16.00000 311.986
24.00000 322.051
32.00000 321.009
48.00000 319.164
64.00000 320.025
96.00000 318.796
128.00000 314.291
Conclusions:

The benchmarks for the kernel using the custom cross-compiler flags are all just very slightly better. The differences are so small they could be statistical aberrations. But there is obviously no drawbacks to using them, and there seem to be possible advantages. Since it takes no effort no use the additional compiler flags their use is highly recommended.
The next test is to compile the kernel natively on the A10. I am using a Mele A2000 to compile the kernel. The kernel itself is done now. Surprisingly, it is now only 1.92 MB! That is almost 128 kB smaller with no other changes! Interesting.

Edit: BTW the compiler flags used (-O3 in specific) were set to optimize for performance, not size.



Edited 1 time(s). Last edit at 07/20/2012 02:15PM by gnexus.
Here are the benchmarks for the natively compiled kernel. But first it would be best to show the GCC version:
# gcc --version
gcc (Debian 4.6.3-8) 4.6.3
# gcc -dumpmachine
arm-linux-gnueabihf

gcc was tuned to optimize for performance, not size, as much as possible, and to use the ARM Neon as much as possible, with the following compiler flags:
export CFLAGS="-O3 -mfpu=neon -ftree-vectorize -ffast-math"

Now let us see if any of the above had any effect.
**************************************************************************************************************************************
[lmbench3.0 results for Linux T-01 3.0.36+ #34 PREEMPT Fri Jul 20 20:03:23 IST 2012 armv7l GNU/Linux]
[LMBENCH_VER: 3.0-a9]
[BENCHMARK_HARDWARE: YES]
[BENCHMARK_OS: YES]
[ALL: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m 128m]
[DISKS: /dev/mmcblk0 ]
[DISK_DESC: [/dev/mmcblk0:Class 10 SD card] ]
[ENOUGH: 100000]
[FAST: ]
[FASTMEM: NO]
[FILE: /var/tmp/XXX]
[FSDIR: /var/tmp]
[HALF: 512 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1m 2m 4m 8m 16m 32m 64m]
[INFO: INFO.T-01]
[LINE_SIZE: 64]
[LOOP_O: 0.00000000]
[MB: 180]
[MHZ: 1006 MHz, 0.9940 nanosec clock]
[MOTHERBOARD: ]
[NETWORKS: ]
[PROCESSORS: 0]
[REMOTE: ]
[SLOWFS: NO]
[OS: armv7l-linux-gnu]
[SYNC_MAX: 1]
[LMBENCH_SCHED: DEFAULT]
[TIMING_O: 0]
[LMBENCH VERSION: 3.0-a9]
[SYSNAME: Linux]
[PROCESSOR: unknown]
[MACHINE: armv7l]
[RELEASE: 3.0.36+]
[VERSION: #34 PREEMPT Fri Jul 20 20:03:23 IST 2012]
Simple syscall: 0.3457 microseconds
Simple read: 0.7548 microseconds
Simple write: 0.7170 microseconds
Simple stat: 3.5229 microseconds
Simple fstat: 1.2631 microseconds
Simple open/close: 6.6309 microseconds
Select on 10 fd's: 1.5013 microseconds
Select on 100 fd's: 8.3810 microseconds
Select on 250 fd's: 33.8798 microseconds
Select on 500 fd's: 39.1249 microseconds
Select on 10 tcp fd's: 1.5918 microseconds
Select on 100 tcp fd's: 14.9666 microseconds
Select on 250 tcp fd's: 37.1510 microseconds
Select on 500 tcp fd's: 74.8490 microseconds
Signal handler installation: 0.8359 microseconds
Signal handler overhead: 4.1553 microseconds
Protection fault: 0.3981 microseconds
Pipe latency: 17.4849 microseconds
AF_UNIX sock stream latency: 32.7206 microseconds
Process fork+exit: 709.2065 microseconds
Process fork+execve: 2393.1429 microseconds
Process fork+/bin/sh -c: 5671.7000 microseconds
integer bit: 1.00 nanoseconds
integer add: 1.16 nanoseconds
integer mul: 7.88 nanoseconds
integer div: 94.73 nanoseconds
integer mod: 22.90 nanoseconds
int64 bit: 1.12 nanoseconds
uint64 add: 1.57 nanoseconds
int64 mul: 12.83 nanoseconds
int64 div: 276.50 nanoseconds
int64 mod: 189.20 nanoseconds
float add: 8.86 nanoseconds
float mul: 9.95 nanoseconds
float div: 32.84 nanoseconds
double add: 8.87 nanoseconds
double mul: 10.94 nanoseconds
double div: 56.69 nanoseconds
float bogomflops: 104.62 nanoseconds
double bogomflops: 133.60 nanoseconds
integer bit parallelism: 1.49
integer add parallelism: 1.71
integer mul parallelism: 2.91
integer div parallelism: 1.19
integer mod parallelism: 1.00
int64 bit parallelism: 1.00
int64 add parallelism: 1.00
int64 mul parallelism: 1.00
int64 div parallelism: 1.03
int64 mod parallelism: 1.00
float add parallelism: 1.00
float mul parallelism: 1.00
float div parallelism: 1.00
double add parallelism: 1.00
double mul parallelism: 1.00
double div parallelism: 1.00
File /var/tmp/XXX write bandwidth: 3257 KB/sec
Pagefaults on /var/tmp/XXX: 4.2432 microseconds

"mappings
0.524288 31
1.048576 46
2.097152 73
4.194304 134
8.388608 258
16.777216 509
33.554432 999
67.108864 2126
134.217728 5189

"File system latency
0k 1590 16338 21362
1k 256 2335 14480
4k 857 7844 19646
10k 419 4317 9266

"Seek times for /dev/mmcblk0
2144.3 1.35
. . .
2.3 4.95

"Zone bandwidth for /dev/mmcblk0
1.1 9.01
. . .
2136.0 5.83

Cannot register service: RPC: Unable to receive; errno = Connection refused
unable to register (XACT_PROG, XACT_VERS, udp).
UDP latency using localhost: 1057.4986 microseconds
TCP latency using localhost: 69.4050 microseconds
localhost: RPC: Port mapper failure - RPC: Unable to receive
localhost: RPC: Remote system error - Connection refused
(null): RPC: Port mapper failure - RPC: Unable to receive
TCP/IP connection cost to localhost: 216.5770 microseconds

Socket bandwidth using localhost
0.000001 0.26 MB/sec
0.000064 11.97 MB/sec
0.000128 16.58 MB/sec
0.000256 33.04 MB/sec
0.000512 38.38 MB/sec
0.001024 73.12 MB/sec
0.001437 85.37 MB/sec
10.000000 8.67 MB/sec

Avg xfer: 3.2KB, 41.8KB in 155 millisecs, 270.14 KB/sec
AF_UNIX sock stream bandwidth: 304.03 MB/sec
Pipe bandwidth: 369.56 MB/sec

"read bandwidth
0.000512 197.11
0.001024 337.27
0.002048 524.99
0.004096 728.58
0.008192 831.03
0.016384 830.50
0.032768 696.88
0.065536 712.11
0.131072 567.90
0.262144 359.47
0.524288 255.29
1.05 236.61
2.10 253.18
4.19 259.37
8.39 264.73
16.78 252.64
33.55 173.34
67.11 140.34
134.22 259.03

"read open2close bandwidth
0.000512 47.86
0.001024 96.15
0.002048 176.93
0.004096 293.12
0.008192 448.53
0.016384 554.31
0.032768 601.54
0.065536 614.26
0.131072 539.53
0.262144 324.47
0.524288 265.47
1.05 248.67
2.10 255.40
4.19 249.86
8.39 261.68
16.78 260.85
33.55 260.61
67.11 262.14
134.22 255.56


"Mmap read bandwidth
0.000512 1803.92
0.001024 1890.47
0.002048 1937.03
0.004096 1937.86
0.008192 1870.93
0.016384 1969.23
0.032768 1971.38
0.065536 1632.46
0.131072 1545.97
0.262144 882.83
0.524288 456.77
1.05 394.98
2.10 374.28
4.19 367.37
8.39 366.15
16.78 366.47
33.55 366.55
67.11 366.10
134.22 358.52

"Mmap read open2close bandwidth
0.000512 19.85
0.001024 38.86
0.002048 75.95
0.004096 143.57
0.008192 234.92
0.016384 351.22
0.032768 463.85
0.065536 555.66
0.131072 561.20
0.262144 382.24
0.524288 273.81
1.05 260.74
2.10 260.84
4.19 262.32
8.39 265.62
16.78 265.59
33.55 270.20
67.11 269.75
134.22 267.69


"libc bcopy unaligned
0.000512 1208.47
0.001024 1216.00
0.002048 1216.48
0.004096 1219.17
0.008192 1108.78
0.016384 1110.68
0.032768 1209.37
0.065536 1152.98
0.131072 1169.95
0.262144 644.89
0.524288 285.18
1.05 251.54
2.10 243.60
4.19 210.29
8.39 234.95
16.78 239.38
33.55 238.89
67.11 234.20

"libc bcopy aligned
0.000512 1198.72
0.001024 1215.73
0.002048 1215.87
0.004096 1119.55
0.008192 1216.79
0.016384 1164.08
0.032768 1148.94
0.065536 1165.36
0.131072 1161.44
0.262144 657.27
0.524288 282.26
1.05 241.80
2.10 236.90
4.19 239.80
8.39 239.33
16.78 239.19
33.55 238.70
67.11 239.50

Memory bzero bandwidth
0.000512 1215.30
0.001024 1210.84
0.002048 1213.19
0.004096 1214.38
0.008192 1215.56
0.016384 1141.30
0.032768 1140.15
0.065536 1148.29
0.131072 1158.16
0.262144 1170.84
0.524288 1169.85
1.05 1154.06
2.10 1179.47
4.19 1191.27
8.39 1191.77
16.78 1184.21
33.55 1174.70
67.11 1173.41
134.22 1171.80

"unrolled bcopy unaligned
0.000512 1212.09
0.001024 1217.96
0.002048 1214.94
0.004096 1211.18
0.008192 1210.19
0.016384 1139.76
0.032768 1003.01
0.065536 1167.30
0.131072 1127.68
0.262144 518.76
0.524288 290.92
1.05 250.65
2.10 243.55
4.19 239.58
8.39 239.22
16.78 233.47
33.55 237.50
67.11 235.66

"unrolled partial bcopy unaligned
0.000512 1218.29
0.001024 1217.50
0.002048 1077.41
0.004096 1218.37
0.008192 1215.20
0.016384 1212.73
0.032768 1188.06
0.065536 1172.12
0.131072 1190.20
0.262144 807.33
0.524288 422.61
1.05 375.06
2.10 371.34
4.19 379.00
8.39 377.88
16.78 383.81
33.55 383.17
67.11 385.89

Memory read bandwidth
0.000512 1954.61
0.001024 1975.56
0.002048 1983.42
0.004096 1972.06
0.008192 1982.08
0.016384 1986.42
0.032768 1918.76
0.065536 1627.34
0.131072 1551.74
0.262144 1001.62
0.524288 459.11
1.05 397.72
2.10 379.05
4.19 368.42
8.39 367.57
16.78 366.80
33.55 364.89
67.11 366.76
134.22 358.99

Memory partial read bandwidth
0.000512 7245.20
0.001024 7501.25
0.002048 7644.37
0.004096 7714.82
0.008192 7661.73
0.016384 7723.02
0.032768 7613.14
0.065536 3995.32
0.131072 3500.18
0.262144 1579.60
0.524288 533.63
1.05 441.12
2.10 419.25
4.19 406.44
8.39 402.79
16.78 402.57
33.55 401.65
67.11 402.32
134.22 392.76

Memory write bandwidth
0.000512 1204.49
0.001024 1203.88
0.002048 1201.26
0.004096 1202.22
0.008192 1201.89
0.016384 1209.38
0.032768 1183.91
0.065536 1159.31
0.131072 1177.25
0.262144 1174.08
0.524288 1175.29
1.05 1166.35
2.10 1181.79
4.19 1175.04
8.39 1181.22
16.78 1180.07
33.55 1175.70
67.11 1171.59
134.22 1167.53

Memory partial write bandwidth
0.000512 1078.46
0.001024 1214.49
0.002048 1211.08
0.004096 1078.17
0.008192 1049.16
0.016384 1099.68
0.032768 1132.76
0.065536 1177.12
0.131072 1179.15
0.262144 1172.82
0.524288 1171.93
1.05 1179.60
2.10 1193.13
4.19 1179.01
8.39 1179.50
16.78 1178.75
33.55 1177.70
67.11 1174.70
134.22 1172.18

Memory partial read/write bandwidth
0.000512 4991.27
0.001024 5114.28
0.002048 5178.92
0.004096 5213.50
0.008192 5181.46
0.016384 5210.28
0.032768 5205.31
0.065536 3414.55
0.131072 2981.87
0.262144 1184.88
0.524288 375.94
1.05 313.48
2.10 314.71
4.19 310.04
8.39 308.83
16.78 309.74
33.55 311.78
67.11 313.19
134.22 308.24



"size=0k ovr=3.55
2 5.92
4 6.90
8 7.48
16 9.11
24 9.61
32 10.48
64 13.05
96 14.24

"size=4k ovr=5.67
2 5.53
4 7.43
8 9.98
16 15.07
24 14.57
32 16.70
64 23.23
96 26.39

"size=8k ovr=7.82
2 5.62
4 8.26
8 10.34
16 16.21
24 22.66
32 27.59
64 35.57
96 37.53

"size=16k ovr=12.04
2 7.37
4 10.44
8 16.45
16 37.12
24 46.74
32 51.92
64 57.50
96 58.06

"size=32k ovr=21.82
2 9.74
4 13.74
8 43.67
16 82.06
24 92.57
32 94.23
64 95.75
96 96.01

"size=64k ovr=48.15
2 14.96
4 83.50
8 138.50
16 160.10
24 162.26
32 163.59
64 164.33
96 164.02

tlb: 32 pages

Memory load parallelism
0.001024 3.00
0.002048 3.00
0.004096 3.01
0.008192 3.00
0.016384 3.04
0.032768 4.01
0.065536 1.09
0.131072 1.07
0.262144 1.02
0.524288 1.72
1.048576 1.00
2.097152 1.00
4.194304 1.00
8.388608 1.02
16.777216 1.00
33.554432 1.00
67.108864 1.00
134.217728 1.00

STREAM copy latency: 33.42 nanoseconds
STREAM copy bandwidth: 478.77 MB/sec
STREAM scale latency: 42.23 nanoseconds
STREAM scale bandwidth: 378.84 MB/sec
STREAM add latency: 48.88 nanoseconds
STREAM add bandwidth: 491.00 MB/sec
STREAM triad latency: 38.73 nanoseconds
STREAM triad bandwidth: 619.75 MB/sec
STREAM2 fill latency: 6.84 nanoseconds
STREAM2 fill bandwidth: 1170.19 MB/sec
STREAM2 copy latency: 33.33 nanoseconds
STREAM2 copy bandwidth: 480.00 MB/sec
STREAM2 daxpy latency: 67.76 nanoseconds
STREAM2 daxpy bandwidth: 354.18 MB/sec
STREAM2 sum latency: 16.96 nanoseconds
STREAM2 sum bandwidth: 471.72 MB/sec

Memory load latency
"stride=16
0.00049 3.024
0.00098 3.175
0.00195 3.003
0.00293 3.003
0.00391 2.997
0.00586 2.994
0.00781 2.990
0.01172 2.989
0.01562 2.999
0.02344 3.023
0.03125 3.229
0.04688 4.418
0.06250 4.692
0.09375 4.964
0.12500 5.108
0.18750 6.591
0.25000 14.692
0.37500 25.784
0.50000 31.570
0.75000 36.029
1.00000 37.756
1.50000 39.524
2.00000 40.304
3.00000 40.811
4.00000 41.312
6.00000 41.488
8.00000 41.672
12.00000 41.699
16.00000 41.710
24.00000 41.697
32.00000 41.706
48.00000 41.615
64.00000 41.669
96.00000 41.656
128.00000 41.674

"stride=32
0.00049 3.023
0.00098 3.024
0.00195 3.023
0.00293 3.023
0.00391 3.003
0.00586 3.003
0.00781 2.997
0.01172 2.994
0.01562 2.991
0.02344 2.991
0.03125 3.033
0.04688 7.889
0.06250 9.114
0.09375 10.295
0.12500 10.799
0.18750 13.849
0.25000 23.067
0.37500 49.790
0.50000 56.821
0.75000 68.312
1.00000 71.185
1.50000 74.263
2.00000 75.346
3.00000 76.920
4.00000 77.627
6.00000 78.104
8.00000 78.564
12.00000 78.570
16.00000 78.535
24.00000 78.524
32.00000 78.544
48.00000 78.488
64.00000 78.468
96.00000 78.597
128.00000 78.520

"stride=64
0.00049 3.023
0.00098 3.024
0.00195 3.023
0.00293 3.023
0.00391 3.023
0.00586 3.023
0.00781 3.004
0.01172 3.004
0.01562 3.000
0.02344 3.075
0.03125 3.725
0.04688 8.710
0.06250 9.926
0.09375 10.866
0.12500 11.545
0.18750 13.372
0.25000 39.181
0.37500 90.596
0.50000 109.622
0.75000 131.091
1.00000 139.290
1.50000 145.086
2.00000 147.916
3.00000 150.799
4.00000 152.040
6.00000 153.120
8.00000 154.231
12.00000 154.349
16.00000 154.301
24.00000 154.448
32.00000 154.348
48.00000 154.206
64.00000 154.291
96.00000 154.157
128.00000 154.207

"stride=128
0.00049 3.023
0.00098 3.023
0.00195 3.023
0.00293 3.023
0.00391 3.023
0.00586 3.023
0.00781 3.024
0.01172 3.025
0.01562 3.004
0.02344 3.139
0.03125 3.784
0.04688 8.788
0.06250 9.974
0.09375 10.928
0.12500 11.476
0.18750 25.838
0.25000 38.329
0.37500 85.830
0.50000 108.469
0.75000 131.726
1.00000 140.101
1.50000 145.025
2.00000 147.661
3.00000 150.628
4.00000 151.998
6.00000 153.651
8.00000 154.041
12.00000 154.916
16.00000 154.956
24.00000 154.983
32.00000 155.103
48.00000 155.031
64.00000 155.093
96.00000 154.952
128.00000 155.028

"stride=256
0.00049 3.023
0.00098 3.023
0.00195 3.022
0.00293 3.024
0.00391 3.023
0.00586 3.026
0.00781 3.024
0.01172 3.023
0.01562 3.023
0.02344 3.133
0.03125 3.757
0.04688 8.623
0.06250 9.805
0.09375 10.907
0.12500 11.368
0.18750 14.462
0.25000 41.525
0.37500 94.066
0.50000 109.664
0.75000 134.073
1.00000 141.325
1.50000 146.474
2.00000 149.159
3.00000 152.074
4.00000 153.564
6.00000 154.800
8.00000 155.802
12.00000 156.530
16.00000 156.401
24.00000 156.960
32.00000 156.908
48.00000 157.001
64.00000 156.978
96.00000 157.065
128.00000 157.143

"stride=512
0.00049 3.024
0.00098 3.023
0.00195 3.023
0.00293 3.023
0.00391 3.024
0.00586 3.023
0.00781 3.023
0.01172 3.024
0.01562 3.023
0.02344 3.101
0.03125 3.990
0.04688 8.721
0.06250 9.628
0.09375 10.944
0.12500 11.294
0.18750 16.808
0.25000 33.836
0.37500 99.375
0.50000 122.662
0.75000 137.526
1.00000 145.406
1.50000 150.884
2.00000 153.721
3.00000 156.595
4.00000 158.055
6.00000 159.434
8.00000 160.402
12.00000 160.767
16.00000 161.182
24.00000 161.471
32.00000 161.570
48.00000 161.562
64.00000 161.698
96.00000 161.819
128.00000 162.010

"stride=1024
0.00098 3.023
0.00195 3.023
0.00293 3.023
0.00391 3.023
0.00586 3.023
0.00781 3.023
0.01172 3.023
0.01562 3.023
0.02344 3.115
0.03125 4.075
0.04688 8.594
0.06250 9.884
0.09375 10.873
0.12500 11.377
0.18750 21.466
0.25000 42.814
0.37500 98.172
0.50000 116.844
0.75000 141.514
1.00000 153.104
1.50000 158.757
2.00000 161.460
3.00000 164.232
4.00000 166.135
6.00000 167.370
8.00000 168.093
12.00000 171.293
16.00000 169.163
24.00000 169.333
32.00000 169.140
48.00000 169.292
64.00000 169.401
96.00000 169.537
128.00000 170.023


Random load latency
"stride=16
0.00049 3.023
0.00098 3.023
0.00195 3.004
0.00293 3.003
0.00391 2.997
0.00586 2.993
0.00781 2.990
0.01172 2.989
0.01562 3.026
0.02344 2.995
0.03125 8.430
0.04688 8.879
0.06250 10.751
0.09375 12.734
0.12500 13.100
0.18750 62.772
0.25000 94.214
0.37500 140.556
0.50000 168.011
0.75000 200.843
1.00000 209.404
1.50000 227.843
2.00000 230.147
3.00000 247.134
4.00000 258.778
6.00000 274.302
8.00000 291.598
12.00000 304.598
16.00000 307.462
24.00000 313.865
32.00000 315.778
48.00000 314.083
64.00000 315.226
96.00000 312.031
128.00000 309.501
Conclusion:

It seems that compiling the kernel natively has little to no effect on performance. The variations in performance were very small. They are too small to not be attributable to statistical variances from test to test.

That is a positive verification of this XDA MythBuster. There seems to be no advantage for a particular compiler, whether cross-compiling or native, when building the kernel.

For userspace programs it is a different story. Optimized compiler flags should always be used for userspace programs.

It seems there is one advantage to a native build that I must note, however.
As noted in a previous post above, the resulting kernel image when compiled natively was approx 128KB smaller. That alone is enough for me to want to compile the kernel natively on the final build after all tests have been completed.
Author:

Your Email:


Subject:


Spam prevention:
Please, enter the code that you see below in the input field. This is for blocking bots that try to post this form automatically. If the code is hard to read, then just try to guess it right. If you enter the wrong code, a new image is created and you get another chance to enter it right.
Message: