Linus' sha1 is much faster!

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Linus' sha1 is much faster!

Pádraig Brady
I've noticed before that coreutils hashing utils
were a little behind in performance, but was prompted
to look at it again when I noticed the recently
updated sha1 implementation in git:
http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu

Testing that with the attached program which I wrote
in a couple of mins to try and match sha1sum's system calls
shows that it's around 33% faster, as shown below:

$ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1

$ time ./linus-sha1 300MB_file
df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file
real    0m2.742s
user    0m2.516s
sys     0m0.206s

$ time ~/git/coreutils/src/sha1sum 300MB_file
df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file

real    0m4.166s
user    0m3.846s
sys     0m0.298s

So, could we use that code in coreutils?
Think of all the dead fish it would save.

I've also attached a trivial block-sha1 patch which doesn't
affect performance, but does suppress a signed unsigned
comparison warning which occurs with -Wextra for example.

cheers,
Pádraig.

/* gcc -O2 -Wall linus-sha1.c sha1.c -o linus-sha1 */
#include <stdio.h>
#include <stdlib.h>
#include "sha1.h"

int main(int argc, char** argv)
{
    if (argc != 2) return 1;
    const char* filename = argv[1];
    FILE *fp = fopen (filename, "r");
    if (!fp) return 1;

    #define BS 4096 /* match coreutils */

    blk_SHA_CTX ctx;
    blk_SHA1_Init(&ctx);
    size_t nr;
    char buf[BS];
    while ((nr=fread_unlocked(buf, 1, sizeof(buf), fp)))
        blk_SHA1_Update(&ctx, buf, nr);
    unsigned char hash[20];
    blk_SHA1_Final(hash, &ctx);
    int i;
    for (i=0; i<sizeof(hash); i++)
        printf("%02x",*(hash+i));
    printf("  %s\n", filename);

    return 0;
}

From fa75e818836f763357ff9b7bbde3327e1aabbe47 Mon Sep 17 00:00:00 2001
From: =?utf-8?q?P=C3=A1draig=20Brady?= <[hidden email]>
Date: Sat, 15 Aug 2009 00:17:30 +0100
Subject: [PATCH] block-sha1: suppress signed unsigned comparison warning

* block-sha1/sha1.c: Use unsigned ints as the values
will never go negative.
---
 block-sha1/sha1.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block-sha1/sha1.c b/block-sha1/sha1.c
index d3121f7..be763d8 100644
--- a/block-sha1/sha1.c
+++ b/block-sha1/sha1.c
@@ -231,13 +231,13 @@ void blk_SHA1_Init(blk_SHA_CTX *ctx)
 
 void blk_SHA1_Update(blk_SHA_CTX *ctx, const void *data, unsigned long len)
 {
- int lenW = ctx->size & 63;
+ unsigned int lenW = ctx->size & 63;
 
  ctx->size += len;
 
  /* Read the data into W and process blocks as they get full */
  if (lenW) {
- int left = 64 - lenW;
+ unsigned int left = 64 - lenW;
  if (len < left)
  left = len;
  memcpy(lenW + (char *)ctx->W, data, left);
--
1.6.2.5

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Bryan Donlan
2009/8/14 Pádraig Brady <[hidden email]>:

> I've noticed before that coreutils hashing utils
> were a little behind in performance, but was prompted
> to look at it again when I noticed the recently
> updated sha1 implementation in git:
> http://git.kernel.org/?p=git/git.git;a=history;f=block-sha1;h=d3121f7;hb=pu
>
> Testing that with the attached program which I wrote
> in a couple of mins to try and match sha1sum's system calls
> shows that it's around 33% faster, as shown below:
>
> $ gcc $(rpm -q --qf="%{OPTFLAGS}\n" coreutils) linus-sha1.c sha1.c -o linus-sha1
>
> $ time ./linus-sha1 300MB_file
> df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file
> real    0m2.742s
> user    0m2.516s
> sys     0m0.206s
>
> $ time ~/git/coreutils/src/sha1sum 300MB_file
> df1e19e245fee4f53087b50ef953ca2c8d1644d7  300MB_file
>
> real    0m4.166s
> user    0m3.846s
> sys     0m0.298s
>
> So, could we use that code in coreutils?
> Think of all the dead fish it would save.

coreutils is licensed under GPLv3, and git under GPLv2 (only), so
you'd need permission from all contributors to the implementation in
order to relicense under GPLv3. A quick grep of the history suggests
these contributors to be:

Brandon Casey <[hidden email]>
Junio C Hamano <[hidden email]>
Linus Torvalds <[hidden email]>
Nicolas Pitre <[hidden email]>
(adding these people to the CC list)

Additionally, it was originally based on the code in
mozilla-sha1/sha1.c, but that contains a license grant allowing it to
be used under GPLv2 /or later/, so if GPLv3 relicensing is enough it
shouldn't be necessary to get in contact with the original author.
However if the FSF requires copyright assignment to accept the new
implementation, it will be necessary to track down contributors to the
original mozilla-sha1/sha1.c as well.

Note that I'm not a lawyer, so there might be other roadblocks etc to
this as well, etc :)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

John Tapsell
2009/8/15 Bryan Donlan <[hidden email]>:
> coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> you'd need permission from all contributors to the implementation in
> order to relicense under GPLv3. A quick grep of the history suggests
> these contributors to be:


X11 also requires a fast SHA1 implementation.  It uses this to check
if two pixmaps are the same.  So it would be really nice to relicense
under a liberal enough license that xorg can use it.

John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Linus Torvalds-3


On Sat, 15 Aug 2009, John Tapsell wrote:

> 2009/8/15 Bryan Donlan <[hidden email]>:
> > coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> > you'd need permission from all contributors to the implementation in
> > order to relicense under GPLv3. A quick grep of the history suggests
> > these contributors to be:
>
> X11 also requires a fast SHA1 implementation.  It uses this to check
> if two pixmaps are the same.  So it would be really nice to relicense
> under a liberal enough license that xorg can use it.

I'm personally ok with retaining the mozilla-sha1 license.

There's not really anything _remaining_ of the mozilla code, but hey, I
started from it. In retrospect I probably should have started from the PPC
asm code that already did the blocking sanely - but that's a "20/20
hindsight" kind of thing.

Plus hey, the mozilla code being a horrid pile of crud was why I was so
convinced that I could improve on things. So that's a kind of source for
it, even if it's more about the motivational side than any actual
remaining code ;)

That said, I don't know if the MPL is ok for X11. I've not looked at
compatibility issues with MPL. For git, we could just ignore the MPL,
since the GPLv2 was acceptable regardless of it.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Linus Torvalds-3


On Sat, 15 Aug 2009, Linus Torvalds wrote:
>
> That said, I don't know if the MPL is ok for X11. I've not looked at
> compatibility issues with MPL. For git, we could just ignore the MPL,
> since the GPLv2 was acceptable regardless of it.

If MPL isn't ok for X11, then we'd need to make sure that even the
silliest Mozilla crud has been rewritten. There really isn't much, but
hey, the _history_ is based on the mozilla code, and who knows - the
'blk_SHA_CTX' struct has things like the fields in the same order as the
Mozilla equivalent, for all those historical reasons.

(Heh. Looking at that, I probably should move the 'size' field first,
since that would have different alignment rules, and the struct would be
more tightly packed that way, and initialize better).

Afaik, none of the actual code remains (the mozilla SHA1 thing did the
wrong thing for performance even for just the final bytes, and did those a
byte at a time etc, so I rewrote even the trivial SHA1_Final parts).

Of course, maybe the Mozilla people would be interested in taking my
faster version, and say that the new-BSD license is ok, and make everybody
happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added
him to the Cc.

Paul, for your information, we're talking about a faster rewritten "mostly
portable" SHA1 routines that you can find at

        http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu

(follow the "blob" pointers to see sha1.c and sha1.h). I don't know if
you're active with Mozilla/Firefox or whether you even care, but you seem
to be the logical choice of person to ask.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Ted Ts'o
In reply to this post by John Tapsell
On Sat, Aug 15, 2009 at 09:12:58PM +0100, John Tapsell wrote:
> 2009/8/15 Bryan Donlan <[hidden email]>:
> > coreutils is licensed under GPLv3, and git under GPLv2 (only), so
> > you'd need permission from all contributors to the implementation in
> > order to relicense under GPLv3. A quick grep of the history suggests
> > these contributors to be:
>
> X11 also requires a fast SHA1 implementation.  It uses this to check
> if two pixmaps are the same.  So it would be really nice to relicense
> under a liberal enough license that xorg can use it.

If the checksum isn't being exposed in the protocol (i.e., it's just
internal to the X server), one possibility for X11 is to consider to
use the SHA-3 candidate Skein instead.  After receiving a large amount
of evaluation by cryptographic experts, it was one of the 18
algorithms (our of an original 64 entries) that have made it the 2nd
round of the NIST competition.  It's also *substantially* faster than
SHA:

    One exception to this is Skein, created by several well-known
    cryptographers and noted pundit Bruce Schneier. It was designed
    specifically to exploit all three of the Core 2 execution units
    and to run at a full 64-bits. This gives it roughly four to 10
    times the logic density of competing submissions.

    This is what I meant by the Matrix quote above. They didn't bend
    the spoon; they bent the crypto algorithm. They moved the logic
    operations around in a way that wouldn't weaken the crypto, but
    would strengthen its speed on the Intel Core 2.

    In their paper (PDF), the authors of Skein express surprise that a
    custom silicon ASIC implementation is not any faster than the
    software implementation. They shouldn't be surprised. Every time
    you can redefine a problem to run optimally in software, you will
    reach the same speeds you get with optimized ASIC hardware. The
    reason software has a reputation of being slow is because people
    don't redefine the original problem.

    http://www.darkreading.com/blog/archives/2008/11/bending_skein_c.html

For more information and some optimized implementation, see:

        http://www.skein-hash.info/

                                                        - Ted
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

giuseppe
In reply to this post by Pádraig Brady
Hi Pádraig,

I tried to reproduce your results but I wasn't able to do it.  The
biggest difference on a 300MB file I noticed was approximately 15% using
on both implementations -O2, and 5% using -O3.
My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
Pentium(R) D CPU 3.20GHz.

I also spent some time trying to improve the gnulib SHA1 implementation
and it seems a lookup table can improve things a bit.

Can you please try the patch that I have attached and tell me which
performance difference (if any) you get?

Thanks,
Giuseppe



From b975a5e0849eaa46e5cf410c5bf6e2308f044d61 Mon Sep 17 00:00:00 2001
From: Giuseppe Scrivano <[hidden email]>
Date: Sun, 16 Aug 2009 20:53:54 +0200
Subject: [PATCH] SHA1: use a lookup table for faster hashing

* lib/sha1.c (struct sha1_pre): New member.
* lib/sha1.c (sha1_process_block): Use the lookup table to quickly find
indices to use in the current round.
---
 lib/sha1.c |  160 ++++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 92 insertions(+), 68 deletions(-)

diff --git a/lib/sha1.c b/lib/sha1.c
index 9c6c7ae..ec18ba7 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -283,6 +283,32 @@ sha1_process_bytes (const void *buffer, size_t len, struct sha1_ctx *ctx)
 #define F3(B,C,D) ( ( B & C ) | ( D & ( B | C ) ) )
 #define F4(B,C,D) (B ^ C ^ D)
 
+struct lookup_t
+{
+  unsigned char l1 : 4;
+  unsigned char l2 : 4;
+  unsigned char l3 : 4;
+  unsigned char l4 : 4;
+};
+
+const static struct lookup_t
+sha1_pre[16] = {{(0 - 3) & 0x0f, (0 - 8) & 0x0f, (0 - 14) & 0x0f},
+                {(1 - 3) & 0x0f, (1 - 8) & 0x0f, (1 - 14) & 0x0f},
+                {(2 - 3) & 0x0f, (2 - 8) & 0x0f, (2 - 14) & 0x0f},
+                {(3 - 3) & 0x0f, (3 - 8) & 0x0f, (3 - 14) & 0x0f},
+                {(4 - 3) & 0x0f, (4 - 8) & 0x0f, (4 - 14) & 0x0f},
+                {(5 - 3) & 0x0f, (5 - 8) & 0x0f, (5 - 14) & 0x0f},
+                {(6 - 3) & 0x0f, (6 - 8) & 0x0f, (6 - 14) & 0x0f},
+                {(7 - 3) & 0x0f, (7 - 8) & 0x0f, (7 - 14) & 0x0f},
+                {(8 - 3) & 0x0f, (8 - 8) & 0x0f, (8 - 14) & 0x0f},
+                {(9 - 3) & 0x0f, (9 - 8) & 0x0f, (9 - 14) & 0x0f},
+                {(10 - 3) & 0x0f, (10 - 8) & 0x0f, (10 - 14) & 0x0f},
+                {(11 - 3) & 0x0f, (11 - 8) & 0x0f, (11 - 14) & 0x0f},
+                {(12 - 3) & 0x0f, (12 - 8) & 0x0f, (12 - 14) & 0x0f},
+                {(13 - 3) & 0x0f, (13 - 8) & 0x0f, (13 - 14) & 0x0f},
+                {(14 - 3) & 0x0f, (14 - 8) & 0x0f, (14 - 14) & 0x0f},
+                {(15 - 3) & 0x0f, (15 - 8) & 0x0f, (15 - 14) & 0x0f}};
+
 /* Process LEN bytes of BUFFER, accumulating context into CTX.
    It is assumed that LEN % 64 == 0.
    Most of this code comes from GnuPG's cipher/sha1.c.  */
@@ -309,9 +335,8 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
 
 #define rol(x, n) (((x) << (n)) | ((uint32_t) (x) >> (32 - (n))))
 
-#define M(I) ( tm =   x[I&0x0f] ^ x[(I-14)&0x0f] \
-    ^ x[(I-8)&0x0f] ^ x[(I-3)&0x0f] \
-       , (x[I&0x0f] = rol(tm, 1)) )
+#define M(I) (x[I] = rol (x[sha1_pre[I].l1] ^ x[sha1_pre[I].l2] \
+                          ^ x[sha1_pre[I].l3] ^ x[I], 1))
 
 #define R(A,B,C,D,E,F,K,M)  do { E += rol( A, 5 )     \
       + F( B, C, D )  \
@@ -322,7 +347,6 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
 
   while (words < endp)
     {
-      uint32_t tm;
       int t;
       for (t = 0; t < 16; t++)
  {
@@ -346,70 +370,70 @@ sha1_process_block (const void *buffer, size_t len, struct sha1_ctx *ctx)
       R( c, d, e, a, b, F1, K1, x[13] );
       R( b, c, d, e, a, F1, K1, x[14] );
       R( a, b, c, d, e, F1, K1, x[15] );
-      R( e, a, b, c, d, F1, K1, M(16) );
-      R( d, e, a, b, c, F1, K1, M(17) );
-      R( c, d, e, a, b, F1, K1, M(18) );
-      R( b, c, d, e, a, F1, K1, M(19) );
-      R( a, b, c, d, e, F2, K2, M(20) );
-      R( e, a, b, c, d, F2, K2, M(21) );
-      R( d, e, a, b, c, F2, K2, M(22) );
-      R( c, d, e, a, b, F2, K2, M(23) );
-      R( b, c, d, e, a, F2, K2, M(24) );
-      R( a, b, c, d, e, F2, K2, M(25) );
-      R( e, a, b, c, d, F2, K2, M(26) );
-      R( d, e, a, b, c, F2, K2, M(27) );
-      R( c, d, e, a, b, F2, K2, M(28) );
-      R( b, c, d, e, a, F2, K2, M(29) );
-      R( a, b, c, d, e, F2, K2, M(30) );
-      R( e, a, b, c, d, F2, K2, M(31) );
-      R( d, e, a, b, c, F2, K2, M(32) );
-      R( c, d, e, a, b, F2, K2, M(33) );
-      R( b, c, d, e, a, F2, K2, M(34) );
-      R( a, b, c, d, e, F2, K2, M(35) );
-      R( e, a, b, c, d, F2, K2, M(36) );
-      R( d, e, a, b, c, F2, K2, M(37) );
-      R( c, d, e, a, b, F2, K2, M(38) );
-      R( b, c, d, e, a, F2, K2, M(39) );
-      R( a, b, c, d, e, F3, K3, M(40) );
-      R( e, a, b, c, d, F3, K3, M(41) );
-      R( d, e, a, b, c, F3, K3, M(42) );
-      R( c, d, e, a, b, F3, K3, M(43) );
-      R( b, c, d, e, a, F3, K3, M(44) );
-      R( a, b, c, d, e, F3, K3, M(45) );
-      R( e, a, b, c, d, F3, K3, M(46) );
-      R( d, e, a, b, c, F3, K3, M(47) );
-      R( c, d, e, a, b, F3, K3, M(48) );
-      R( b, c, d, e, a, F3, K3, M(49) );
-      R( a, b, c, d, e, F3, K3, M(50) );
-      R( e, a, b, c, d, F3, K3, M(51) );
-      R( d, e, a, b, c, F3, K3, M(52) );
-      R( c, d, e, a, b, F3, K3, M(53) );
-      R( b, c, d, e, a, F3, K3, M(54) );
-      R( a, b, c, d, e, F3, K3, M(55) );
-      R( e, a, b, c, d, F3, K3, M(56) );
-      R( d, e, a, b, c, F3, K3, M(57) );
-      R( c, d, e, a, b, F3, K3, M(58) );
-      R( b, c, d, e, a, F3, K3, M(59) );
-      R( a, b, c, d, e, F4, K4, M(60) );
-      R( e, a, b, c, d, F4, K4, M(61) );
-      R( d, e, a, b, c, F4, K4, M(62) );
-      R( c, d, e, a, b, F4, K4, M(63) );
-      R( b, c, d, e, a, F4, K4, M(64) );
-      R( a, b, c, d, e, F4, K4, M(65) );
-      R( e, a, b, c, d, F4, K4, M(66) );
-      R( d, e, a, b, c, F4, K4, M(67) );
-      R( c, d, e, a, b, F4, K4, M(68) );
-      R( b, c, d, e, a, F4, K4, M(69) );
-      R( a, b, c, d, e, F4, K4, M(70) );
-      R( e, a, b, c, d, F4, K4, M(71) );
-      R( d, e, a, b, c, F4, K4, M(72) );
-      R( c, d, e, a, b, F4, K4, M(73) );
-      R( b, c, d, e, a, F4, K4, M(74) );
-      R( a, b, c, d, e, F4, K4, M(75) );
-      R( e, a, b, c, d, F4, K4, M(76) );
-      R( d, e, a, b, c, F4, K4, M(77) );
-      R( c, d, e, a, b, F4, K4, M(78) );
-      R( b, c, d, e, a, F4, K4, M(79) );
+      R( e, a, b, c, d, F1, K1, M( 0) );
+      R( d, e, a, b, c, F1, K1, M( 1) );
+      R( c, d, e, a, b, F1, K1, M( 2) );
+      R( b, c, d, e, a, F1, K1, M( 3) );
+      R( a, b, c, d, e, F2, K2, M( 4) );
+      R( e, a, b, c, d, F2, K2, M( 5) );
+      R( d, e, a, b, c, F2, K2, M( 6) );
+      R( c, d, e, a, b, F2, K2, M( 7) );
+      R( b, c, d, e, a, F2, K2, M( 8) );
+      R( a, b, c, d, e, F2, K2, M( 9) );
+      R( e, a, b, c, d, F2, K2, M(10) );
+      R( d, e, a, b, c, F2, K2, M(11) );
+      R( c, d, e, a, b, F2, K2, M(12) );
+      R( b, c, d, e, a, F2, K2, M(13) );
+      R( a, b, c, d, e, F2, K2, M(14) );
+      R( e, a, b, c, d, F2, K2, M(15) );
+      R( d, e, a, b, c, F2, K2, M( 0) );
+      R( c, d, e, a, b, F2, K2, M( 1) );
+      R( b, c, d, e, a, F2, K2, M( 2) );
+      R( a, b, c, d, e, F2, K2, M( 3) );
+      R( e, a, b, c, d, F2, K2, M( 4) );
+      R( d, e, a, b, c, F2, K2, M( 5) );
+      R( c, d, e, a, b, F2, K2, M( 6) );
+      R( b, c, d, e, a, F2, K2, M( 7) );
+      R( a, b, c, d, e, F3, K3, M( 8) );
+      R( e, a, b, c, d, F3, K3, M( 9) );
+      R( d, e, a, b, c, F3, K3, M(10) );
+      R( c, d, e, a, b, F3, K3, M(11) );
+      R( b, c, d, e, a, F3, K3, M(12) );
+      R( a, b, c, d, e, F3, K3, M(13) );
+      R( e, a, b, c, d, F3, K3, M(14) );
+      R( d, e, a, b, c, F3, K3, M(15) );
+      R( c, d, e, a, b, F3, K3, M( 0) );
+      R( b, c, d, e, a, F3, K3, M( 1) );
+      R( a, b, c, d, e, F3, K3, M( 2) );
+      R( e, a, b, c, d, F3, K3, M( 3) );
+      R( d, e, a, b, c, F3, K3, M( 4) );
+      R( c, d, e, a, b, F3, K3, M( 5) );
+      R( b, c, d, e, a, F3, K3, M( 6) );
+      R( a, b, c, d, e, F3, K3, M( 7) );
+      R( e, a, b, c, d, F3, K3, M( 8) );
+      R( d, e, a, b, c, F3, K3, M( 9) );
+      R( c, d, e, a, b, F3, K3, M(10) );
+      R( b, c, d, e, a, F3, K3, M(11) );
+      R( a, b, c, d, e, F4, K4, M(12) );
+      R( e, a, b, c, d, F4, K4, M(13) );
+      R( d, e, a, b, c, F4, K4, M(14) );
+      R( c, d, e, a, b, F4, K4, M(15) );
+      R( b, c, d, e, a, F4, K4, M( 0) );
+      R( a, b, c, d, e, F4, K4, M( 1) );
+      R( e, a, b, c, d, F4, K4, M( 2) );
+      R( d, e, a, b, c, F4, K4, M( 3) );
+      R( c, d, e, a, b, F4, K4, M( 4) );
+      R( b, c, d, e, a, F4, K4, M( 5) );
+      R( a, b, c, d, e, F4, K4, M( 6) );
+      R( e, a, b, c, d, F4, K4, M( 7) );
+      R( d, e, a, b, c, F4, K4, M( 8) );
+      R( c, d, e, a, b, F4, K4, M( 9) );
+      R( b, c, d, e, a, F4, K4, M(10) );
+      R( a, b, c, d, e, F4, K4, M(11) );
+      R( e, a, b, c, d, F4, K4, M(12) );
+      R( d, e, a, b, c, F4, K4, M(13) );
+      R( c, d, e, a, b, F4, K4, M(14) );
+      R( b, c, d, e, a, F4, K4, M(15) );
 
       a = ctx->A += a;
       b = ctx->B += b;
--
1.6.3.3

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Linus Torvalds-3


On Sun, 16 Aug 2009, Giuseppe Scrivano wrote:
>
> My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
> Pentium(R) D CPU 3.20GHz.

Netburst is very sensitive to random spill effects, and you can basically
tune things by just code shuffling that just has random effects on the
generated asm code.

> I also spent some time trying to improve the gnulib SHA1 implementation
> and it seems a lookup table can improve things a bit.

I pretty much can guarantee you that it improves things only because it
makes gcc generate crap code, which then hides some of the P4 issues.

I'd also suggest you try gcc-4.4, since that apparently fixes some of the
oddest spill issues.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

giuseppe
Linus Torvalds <[hidden email]> writes:

> I pretty much can guarantee you that it improves things only because it
> makes gcc generate crap code, which then hides some of the P4 issues.
>
> I'd also suggest you try gcc-4.4, since that apparently fixes some of the
> oddest spill issues.


Thanks for the hint.  I tried gcc-4.4 and it produces slower code than
4.3 on the gnulib SHA1 implementation and my patch makes it even more!

I noticed that on my machine your implementation is ~30-40% faster using
SHA_ROT for rol/ror instructions than inline assembly, at least with the
test-case Pádraig wrote.  Am I the only one reporting it?

Cheers,
Giuseppe
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Linus Torvalds-3


On Mon, 17 Aug 2009, Giuseppe Scrivano wrote:
>
> Thanks for the hint.  I tried gcc-4.4 and it produces slower code than
> 4.3 on the gnulib SHA1 implementation and my patch makes it even more!

Check out the asm, see if you can see why. One of the most common problems
with P4's is literally that you end up loading from the same stack slot
that you just stored to (gcc can do some really crazy spills), and that
causes a store buffer hazard replay.

My personal opinion is that Netburst is useless for trying to optimize C
code for. It's just too random.

> I noticed that on my machine your implementation is ~30-40% faster using
> SHA_ROT for rol/ror instructions than inline assembly, at least with the
> test-case Pádraig wrote.  Am I the only one reporting it?

I bet it's the same thing. Small perturbations of the source causing small
changes to register allocation and thus spilling, and then Netburst goes
crazy one way or another. It's interestign trying to fix it, and very
frustrating.

My workstation is a Nehalem (but Core 2 will have pretty much the same
behavior), and it doesn't have the crazy netburst behavior. Shorter and
simpler code generally performs better (which is _not_ true on Netburst).

On my machine, for example, forcing gcc to do those rotates on registers
is the difference between ~381MB/s and 415MB/s. And that's mainly because
it makes gcc keep A-E in registers, rather than trying to cache the
array[] references.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Pádraig Brady
In reply to this post by giuseppe
Giuseppe Scrivano wrote:

> Hi Pádraig,
>
> I tried to reproduce your results but I wasn't able to do it.  The
> biggest difference on a 300MB file I noticed was approximately 15% using
> on both implementations -O2, and 5% using -O3.
> My GCC version is "gcc (Debian 4.3.3-14) 4.3.3" and the CPU is: Intel(R)
> Pentium(R) D CPU 3.20GHz.
>
> I also spent some time trying to improve the gnulib SHA1 implementation
> and it seems a lookup table can improve things a bit.
>
> Can you please try the patch that I have attached and tell me which
> performance difference (if any) you get?

Thanks for looking at this Giuseppe
and sorry for not mentioning my GCC and CPU.

Note the binaries below is compiled with
$(rpm -q --qf="%{OPTFLAGS}\n" coreutils)
for consistency, which on my F11 machines is:

  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
  -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
  -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1

Testing on 2 machines I have here:

$ rpm -q gcc
gcc-4.4.1-2.fc11.i586
$ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' '
model name : Intel(R) Pentium(R) M processor 1.70GHz
$ truncate -s300MB sha1.test
$ time sha1sum sha1.test
real    0m3.540s
$ time linus-sha1 sha1.test
real    0m2.319s (-34%)
$ time  giuseppe-sha1sum sha1.test
real    0m3.513s (-.8%)

$ rpm -q gcc
gcc-4.4.1-2.fc11.i586
$ grep "model name" /proc/cpuinfo | head -n1 | tr -s '[:blank:]' ' '
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
$ truncate -s300MB sha1.test
$ time sha1sum sha1.test
real    0m1.857s
$ time linus-sha1 sha1.test
real    0m1.102s (-40%)
$ time giuseppe-sha1sum sha1.test
real    0m1.932s (+ 4%)

cheers,
Pádraig.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Nicolas Pitre
In reply to this post by Linus Torvalds-3
On Sat, 15 Aug 2009, Linus Torvalds wrote:

> (Heh. Looking at that, I probably should move the 'size' field first,
> since that would have different alignment rules, and the struct would be
> more tightly packed that way, and initialize better).

I was about to suggest (i.e. post) a patch for that.  This is indeed a
good idea.

> Afaik, none of the actual code remains (the mozilla SHA1 thing did the
> wrong thing for performance even for just the final bytes, and did those a
> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).

Maybe a patch adding a proper header with the actual license would be a
good idea too.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

George Spelvin
In reply to this post by Pádraig Brady
If it helps anyone resolve license issues, here's a from-FIPS-180-2
implementation that's placed in the public domain.  That should be
compatible with any license.

It uses Linus's and Artur's performance ideas, and some of Linus' macro
ideas (in the rotate implementation), but tries to be textually different.
Is there anything recognizable that anyone cares to clam copyright to?

It's not quite 100% finished, as I haven't benchmarked it against Linus's
code yet, but it's functionally correct.

It's also clean with -W -Wall -Wextra.

TODO: Check if an initial copy to w[] is faster on i386 (less register
pressure).

/*
 * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
 *
 * This implementation is in the public domain.  Copyright abandoned.
 * You may do anything you like with it, including evil things.
 *
 * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
 * from the git mailing list (August, 2009).  Additional optimization
 * ideas cribbed from
 * - Artur Skawina (x86, particularly P4, and much benchmarking)
 * - Nicilas Pitre (ARM)
 */

/* Cut here for sha1.h */
#include <stddef.h> /* For size_t */
#include <stdint.h> /* For uint32_t, uint64_t */

typedef struct SHA_context {
        uint64_t len; /* May be shrunk to uint32_t */
        uint32_t iv[5];
        unsigned char buf[64]; /* Must be 32-bit aligned */
} SHA_CTX;

void SHA1_Init(struct SHA_context *c);
void SHA1_Update(struct SHA_context *c, void const *p, size_t n);
void SHA1_Final(unsigned char hash[20], struct SHA_context *c);
/* End of sha1.h */

#include <string.h> /* For memcpy() */
#include <arpa/inet.h> /* For ntohl() */

static void sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks);

/* Machine specific hacks */
#if defined(__i386__) || defined(__x86_64__) || defined (__ppc__) || \
    defined(__ppc64__) || defined(__powerpc__) || defined (__powerpc64__) || \
    defined(__s390__) || defined(__s390x__)
/* Unaligned access is okay */
static inline uint32_t get_be32(unsigned char const *p)
{
        return ntohl(*(uint32_t const *)p);
}
static inline void put_be32(unsigned char const *p, uint32_t v)
{
        *(uint32_t *)p = htonl(v);
}

#else
/* Unaligned access is not okay; do conversion as byte fetches */
static inline uint32_t get_be32(unsigned char const *p)
{
        return p[0] << 24 || p[1] << 16 | p[8] << 8 | p[3];
}
static inline void put_be32(unsigned char const *p, uint32_t v)
{
        p[0] = v >> 24;
        p[1] = v >> 16;
        p[2] = v >> 8;
        p[3] = v;
}
#endif

void SHA1_Init(struct SHA_context *c)
{
        /* This is a prefix of the SHA_context structure */
        static struct {
                uint64_t len;
                uint32_t iv[5];
        } const iv = { 0,
                { 0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0 }
        };

        memcpy(c, &iv, sizeof iv);
}

void SHA1_Update(struct SHA_context *c, void const *p, size_t n)
{
        size_t pos = c->len & 63; /* Offset into current input block */

        c->len += n;

        /* Initial partial block (if any) */
        if (pos) {
                size_t space = 63 - pos;
                if (n < space)
                        goto end;
                memcpy(c->buf + pos, p, space);
                sha1_core(c->iv, c->buf, 1);
                n -= space;
                p = (char const *)p + space;
        }

        /* The large middle piece */
        if (n >> 6) {
                sha1_core(c->iv, p, n >> 6);
                p = (char const *)p + (n & -(size_t)64);
                n &= 63;
        }
        pos = 0;
end:
        /* Final partial block (may be zero size) */
        memcpy(c->buf + pos, p, n);
}

void SHA1_Final(unsigned char hash[20], struct SHA_context *c)
{
        size_t pos = c->len & 63;
        unsigned i;

        /* Append a single 1 bit */
        c->buf[pos++] = 0x80;

        /* Append 0 bits until 64 bits remain in a block */
        if (pos > 56) {
                memset(c->buf + pos, 0, 64 - pos);
                sha1_core(c->iv, c->buf, 1);
                pos = 0;
        }
        memset(c->buf + pos, 0, 56 - pos);

        /* Append total input length in bits */
        ((uint32_t *)c->buf)[14] = htonl((uint32_t)(c->len >> 29));
        ((uint32_t *)c->buf)[15] = htonl((uint32_t)c->len << 3);

        /* Final hash round */
        sha1_core(c->iv, c->buf, 1);

        /* Copy hash result out */
        for (i = 0; i < 5; i++)
                put_be32(hash + 4*i, c->iv[i]);
}


/*
 * Helper macros for sha1_core function.  To avoid clutter, these macros
 * are NOT fully parenthesized if it doesn't matter to the actual use.
 */

#if __GNUC__ && (defined(__i386__) || defined(__x86_64__))
/*
 * GCC by itself only generates left rotates.  Use right rotates if
 * possible to be kinder to dinky implementations with iterative rotate
 * instructions.
 */
#define ROT(op, x, k) \
        ({ uint32_t y; asm(op " %1,%0" : "=r" (y) : "I" (k), "0" (x)); y; })
#define ROL(x,k) ROT("roll", x, k)
#define ROR(x,k) ROT("rorl", x, k)

#else
/* Generic C equivalent */
#define ROT(x,l,r) ((x) << (l) | (x) >> (r))
#define ROL(x,k) ROT(x,k,32-(k))
#define ROR(x,k) ROT(x,32-(k),k)
#endif


/*
 * An important temporary array in SHA-1 is the working array W[],
 * which holds a scheduled key word per round.  Only the last 16 words
 * are relevant, so only use 16 words...
 */
#define W(i) w[(i) & 15]

/*
 * If you have a small register set, it helps GCC to force stores to
 * the w[] array to memory.  Given 22 or more registers (e.g. PowerPC),
 * GCC can map the entire w[] array to registers and this becomes
 * counterproductive.
 *
 * The optimal kludge seems to differ between x86 and ARM.  On the latter,
 * forcing a full memory barrier is required to stop GCC from splitting
 * live ranges for each round and generating a separate stack slot for
 * each.  (Which produces a huge stack frame and kills performance.)
 */
#if defined(__i386__) || defined(__x86_64__)
#define STORE(i, x) *(volatile uint32_t *)&W(i) = x
#elif __GNUC__ && defined(__arm__)
#define STORE(i, x) W(i) = x; __asm__("":::"memory")
#else
#define STORE(i, x) W(i) = x
#endif


/* The three round functions.  F2 is also used as F4 */
#define F1(b,c,d) (((d ^ c) & b) ^ d) /* Bitwise b ? c : d */
#define F2(b,c,d) (d ^ c ^ b) /* Even parity */
#define F3(b,c,d) (d & c) + ((d ^ c) & b) /* Majority function */

/* The four round constants */
#define K1 0x5a827999 /* 2^30 * sqrt(2) */
#define K2 0x6ed9eba1 /* 2^30 * sqrt(3) */
#define K3 0x8f1bbcdc /* 2^30 * sqrt(5) */
#define K4 0xca62c1d6 /* 2^30 * sqrt(10) */

/* Rounds 0..15 fetch a word from the input */
#define FETCH(t,i) t = get_be32(p + 4*(i)); STORE(i,t)
/* Rounds 16..79 mix previous words to get a new one */
#define MIX(t,i) t = W(i) ^ W(i+2) ^ W(i+8) ^ W(i+13); t = ROL(t, 1)
/* Rounds 16..76 have to store back the result */
#define CALC(t,i) MIX(t,i); STORE(i,t)

/* The basic SHA-1 round */
#define SHA_ROUND(a,b,c,d,e,f,k,src,t,i) \
        src(t,i); \
        e += t + f(b,c,d) + k + ROL(a,5); \
        b = ROR(b,2)

/* An aligned group of 5 rounds */
#define SHA_ROUND5(f,k,src,t,i) \
        SHA_ROUND(a,b,c,d,e, f,k,src,t,i); \
        SHA_ROUND(e,a,b,c,d, f,k,src,t,i+1); \
        SHA_ROUND(d,e,a,b,c, f,k,src,t,i+2); \
        SHA_ROUND(c,d,e,a,b, f,k,src,t,i+3); \
        SHA_ROUND(b,c,d,e,a, f,k,src,t,i+4)

/*
 * The core SHA-1 transform.
 *
 * iv[5] = Current SHA-1 hash state.
 * p = Pointer to source data.  Not necessarily aligned.
 * nblocks = Number of 64-byte blocks at p.  Guaranteed non-zero.
 */
static void
sha1_core(uint32_t iv[5], unsigned char const *p, size_t nblocks)
{
        uint32_t e = iv[4], d = iv[3], c = iv[2], b = iv[1], a = iv[0];
        uint32_t w[16];

        do {
                uint32_t t;

                SHA_ROUND5(F1, K1, FETCH, t,  0);
                SHA_ROUND5(F1, K1, FETCH, t,  5);
                SHA_ROUND5(F1, K1, FETCH, t, 10);
                SHA_ROUND(a,b,c,d,e, F1, K1, FETCH, t, 15);
                SHA_ROUND(e,a,b,c,d, F1, K1, CALC, t, 16);
                SHA_ROUND(d,e,a,b,c, F1, K1, CALC, t, 17);
                SHA_ROUND(c,d,e,a,b, F1, K1, CALC, t, 18);
                SHA_ROUND(b,c,d,e,a, F1, K1, CALC, t, 19);

                SHA_ROUND5(F2, K2, CALC, t, 20);
                SHA_ROUND5(F2, K2, CALC, t, 25);
                SHA_ROUND5(F2, K2, CALC, t, 30);
                SHA_ROUND5(F2, K2, CALC, t, 35);

                SHA_ROUND5(F3, K3, CALC, t, 40);
                SHA_ROUND5(F3, K3, CALC, t, 45);
                SHA_ROUND5(F3, K3, CALC, t, 50);
                SHA_ROUND5(F3, K3, CALC, t, 55);

                SHA_ROUND5(F2, K4, CALC, t, 60);
                SHA_ROUND5(F2, K4, CALC, t, 65);
                SHA_ROUND5(F2, K4, CALC, t, 70);

                SHA_ROUND(a,b,c,d,e, F2, K4, CALC, t, 75);
                SHA_ROUND(e,a,b,c,d, F2, K4, CALC, t, 76);
                /* Last 3 rounds don't need to store mixed value */
                SHA_ROUND(d,e,a,b,c, F2, K4, MIX, t, 77);
                SHA_ROUND(c,d,e,a,b, F2, K4, MIX, t, 78);
                SHA_ROUND(b,c,d,e,a, F2, K4, MIX, t, 79);

                iv[4] = e += iv[4];
                iv[3] = d += iv[3];
                iv[2] = c += iv[2];
                iv[1] = b += iv[1];
                iv[0] = a += iv[0];
        } while (--nblocks);
}


/* Compile with -DUNITTEST for self-tests */
#if UNITTEST

#include <stdio.h>

/* Known answer test */
static void katest(char const *text, size_t len, unsigned char const hash[20])
{
        SHA_CTX c;
        unsigned char hash2[20];
        int i;

        SHA1_Init(&c);
        SHA1_Update(&c, text, len);
        SHA1_Final(hash2, &c);

        for (i = 0; i < 20; i++)
                if (hash[i] != hash2[i])
                        goto mismatch;
        printf("%u-byte known answer test PASSED\n", len);
        return;

mismatch:
        printf("%u-byte known answer test FAILED:\n", len);
        printf("Input: \"%.*s\"\n", len, text);
        printf("Computed:");
        for (i = 0; i < 20; i++)
                printf(" %02x", hash2[i]);
        printf("\nExpected:");
        for (i = 0; i < 20; i++)
                printf(" %02x", hash[i]);
        putchar('\n');
}

int
main(void)
{
        /* FIPS PUB 180.1 example A.1 */
        static char const text1[3] = "abc";
        static unsigned char const hash1[20] = {
                0xa9, 0x99, 0x3e, 0x36, 0x47, 0x06, 0x81, 0x6a, 0xba, 0x3e,
                0x25, 0x71, 0x78, 0x50, 0xc2, 0x6c, 0x9c, 0xd0, 0xd8, 0x9d };

        /* FIPS PUB 180.1 example A.2 */
        static char const text2[56] =
                "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
        static unsigned char const hash2[20] = {
                0x84, 0x98, 0x3e, 0x44, 0x1c, 0x3b, 0xd2, 0x6e, 0xba, 0xae,
                0x4a, 0xa1, 0xf9, 0x51, 0x29, 0xe5, 0xe5, 0x46, 0x70, 0xf1 };

        katest(text1, sizeof text1, hash1);
        katest(text2, sizeof text2, hash2);

        return 0;
}
#endif
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Andreas Ericsson
In reply to this post by Linus Torvalds-3
Linus Torvalds wrote:

>
> On Sat, 15 Aug 2009, Linus Torvalds wrote:
>> That said, I don't know if the MPL is ok for X11. I've not looked at
>> compatibility issues with MPL. For git, we could just ignore the MPL,
>> since the GPLv2 was acceptable regardless of it.
>
> If MPL isn't ok for X11, then we'd need to make sure that even the
> silliest Mozilla crud has been rewritten. There really isn't much, but
> hey, the _history_ is based on the mozilla code, and who knows - the
> 'blk_SHA_CTX' struct has things like the fields in the same order as the
> Mozilla equivalent, for all those historical reasons.
>
> (Heh. Looking at that, I probably should move the 'size' field first,
> since that would have different alignment rules, and the struct would be
> more tightly packed that way, and initialize better).
>
> Afaik, none of the actual code remains (the mozilla SHA1 thing did the
> wrong thing for performance even for just the final bytes, and did those a
> byte at a time etc, so I rewrote even the trivial SHA1_Final parts).
>
> Of course, maybe the Mozilla people would be interested in taking my
> faster version, and say that the new-BSD license is ok, and make everybody
> happy. The only listed author for the Mozilla SHA1 is Paul Kocher. I added
> him to the Cc.
>
> Paul, for your information, we're talking about a faster rewritten "mostly
> portable" SHA1 routines that you can find at
>
> http://git.kernel.org/?p=git/git.git;a=tree;f=block-sha1;hb=pu
>
> (follow the "blob" pointers to see sha1.c and sha1.h). I don't know if
> you're active with Mozilla/Firefox or whether you even care, but you seem
> to be the logical choice of person to ask.
>

I contacted Paul in february this year to get permission to use the mozilla
sha1 code for libgit2. His reply then was:
"I'm not sure which version the diffs are relative to, so I haven't reviewed them.
It's fine to distribute under BSD, GPL, or LGPL, however."

I also got explicit permission to relicense it under GPLv2 with the gcc exception.

I added the mail-address I used to contact him to CC as well. Sorry if you get
this twice, Paul.

Naturally, I'd like to use the faster version for libgit2 as well. The people
who Linus listed as contributors earlier (Brandon Casy, Linus, Junio and Nicolas
Pitre) have already consented to relicense their git contributions for libgit2
use. If anyone would like to revoke that consent for this code, speak now please,
or I'll patch it into libgit2 as well.

--
Andreas Ericsson                   [hidden email]
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

giuseppe
In reply to this post by Pádraig Brady
Pádraig Brady <[hidden email]> writes:

>   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
>   -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
>   -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1

thanks.  I did again all tests on my machine using these same options.
I repeated each test 6 times and I took the median without consider the
first result.  Except the first run that it is not considered, I didn't
report a big variance on results of the same test.


gcc 4.3.3

gnulib sha1:            real 0m2.543s
gnulib sha1 lookup:     real 0m1.906s (-25%)
linus's sha1:           real 0m2.468s (-3%)
linus's sha1 no asm:    real 0m2.289s (-9%)


gcc 4.4.1

gnulib sha1:            real 0m3.386s
gnulib sha1 lookup:     real 0m3.110s (-8%)
linus's sha1:           real 0m1.701s (-49%)
linus's sha1 no asm:    real 0m1.284s (-62%)


I don't see such big differences in asm generated by gcc 4.4.1 and gcc
4.3.3 to explain this performance difference, what I noticed immediately
is that in the gcc-4.4 generated asm there are more "lea" instructions
(+30%), but I doubt this is the reason of these poor results.  Anyway, I
haven't yet looked much in details.

Cheers,
Giuseppe
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Nicolas Pitre
In reply to this post by George Spelvin
On Mon, 17 Aug 2009, George Spelvin wrote:

> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain.  That should be
> compatible with any license.
>
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?
>
> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
>
> It's also clean with -W -Wall -Wextra.
>
> TODO: Check if an initial copy to w[] is faster on i386 (less register
> pressure).
>
> /*
>  * Secure Hash Algorith SHA-1, as published in FIPS PUB 180-2.
>  *
>  * This implementation is in the public domain.  Copyright abandoned.
>  * You may do anything you like with it, including evil things.
>  *
>  * This is a rewrite from scratch, based on Linus Torvalds' "block-sha1"
>  * from the git mailing list (August, 2009).  Additional optimization
>  * ideas cribbed from
>  * - Artur Skawina (x86, particularly P4, and much benchmarking)
>  * - Nicilas Pitre (ARM)

Please be careful to spell my name correctly.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Steven Noonan
In reply to this post by giuseppe
On Mon, Aug 17, 2009 at 3:51 AM, Giuseppe Scrivano<[hidden email]> wrote:

> Pádraig Brady <[hidden email]> writes:
>
>>   -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
>>   -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i586
>>   -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE=1
>
> thanks.  I did again all tests on my machine using these same options.
> I repeated each test 6 times and I took the median without consider the
> first result.  Except the first run that it is not considered, I didn't
> report a big variance on results of the same test.
>
>
> gcc 4.3.3
>
> gnulib sha1:            real    0m2.543s
> gnulib sha1 lookup:     real    0m1.906s (-25%)
> linus's sha1:           real    0m2.468s (-3%)
> linus's sha1 no asm:    real    0m2.289s (-9%)
>
>
> gcc 4.4.1
>
> gnulib sha1:            real    0m3.386s
> gnulib sha1 lookup:     real    0m3.110s (-8%)
> linus's sha1:           real    0m1.701s (-49%)
> linus's sha1 no asm:    real    0m1.284s (-62%)
>
>
> I don't see such big differences in asm generated by gcc 4.4.1 and gcc
> 4.3.3 to explain this performance difference, what I noticed immediately
> is that in the gcc-4.4 generated asm there are more "lea" instructions
> (+30%), but I doubt this is the reason of these poor results.  Anyway, I
> haven't yet looked much in details.
>
> Cheers,
> Giuseppe

Interesting. I compared Linus' implementation to the public domain one
by Steve Reid[1], which is used in OpenLDAP and a few other projects.
Anyone with some experience testing these kinds of things in a
statistically sound manner want to try it out? In my tests, I got
this:

(average of 5 runs)
Linus' sha1: 283MB/s
Steve Reid's sha1: 305MB/s

- Steven

[1] http://gpl.nas-central.org/SYNOLOGY/x07-series/514_UNTARED/source/openldap-2.3.11/libraries/liblutil/sha1.c
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Linus Torvalds-3


On Mon, 17 Aug 2009, Steven Noonan wrote:
>
> Interesting. I compared Linus' implementation to the public domain one
> by Steve Reid[1]

You _really_ need to talk about what kind of environment you have.

There are three major issues:
 - Netburst vs non-netburst
 - 32-bit vs 64-bit
 - compiler version

Steve Reid's code looks great, but the way it is coded, gcc makes a mess
of it, which is exactly what my SHA1 tries to avoid.

[ In contrast, gcc does very well on just about _any_ straightforward
  unrolled SHA1 C code if the target architecture is something like PPC or
  ia64 that has enough registers to keep it all in registers.

  I haven't really tested other compilers - a less aggressive compiler
  would actually do _better_ on SHA1, because the problem with gcc is that
  it turns the whole temporary 16-entry word array into register accesses,
  and tries to do register allocation on that _array_.

  That is wonderful for the above-mentioned PPC and IA64, but it makes gcc
  create totally crazy code when there aren't enough registers, and then
  gcc starts spilling randomly (ie it starts spilling a-e etc). This is
  why the compiler and version matters so much. ]

> (average of 5 runs)
> Linus' sha1: 283MB/s
> Steve Reid's sha1: 305MB/s

So I get very different results:

        #             TIME[s] SPEED[MB/s]
        Reid            2.742       222.6
        linus           1.464         417

this is Intel Nehalem, but compiled for 32-bit mode (which is the more
challenging one because x86-32 only has 7 general-purpose registers), and
with gcc-4.4.0.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Nicolas Pitre
In reply to this post by George Spelvin
On Mon, 17 Aug 2009, George Spelvin wrote:

> If it helps anyone resolve license issues, here's a from-FIPS-180-2
> implementation that's placed in the public domain.  That should be
> compatible with any license.
>
> It uses Linus's and Artur's performance ideas, and some of Linus' macro
> ideas (in the rotate implementation), but tries to be textually different.
> Is there anything recognizable that anyone cares to clam copyright to?

I don't think this trick of making source code textually different from
another work while still intimately mimicking the same structure entitles
you to any copyright (or non copyright) claims over that other work.  I
certainly wouldn't bet any dime for this standing up in court.  
Otherwise anyone could grab any copyrighted source code and perform a
bunch of search-and-replace ops on it, and maybe some code reordering
for good measure, to be able to claim own copyright on it. It is
probably much safer to simply ask the people involved to agree with your
relicensing.  And so far I don't see anyone with a stake in this
fiercely wanting to stick to a particular license.

> It's not quite 100% finished, as I haven't benchmarked it against Linus's
> code yet, but it's functionally correct.
>
> It's also clean with -W -Wall -Wextra.

Not if you try with the unaligned put_be32() as the destination pointer
is marked const.

As to the actual result on ARM... Well, the assembly _looks_ much worse
than Linus' version.  It uses a stack frame of 152 bytes instead of 64
bytes.  The resulting binary is also 6868 bytes large compared to 6180
bytes.  Surprisingly, the performance is not that bad (the reason for
the underlined "looks" above) albeit still a bit worse, like 5% slower.  
I was expecting much worse than that.

One possible reason for the bad assembly is probably due to the fact
that gcc is not smart enough to propagate constant address offsets
across different pointer types.  For example, my first version of
get_be32() was a macro that did this:

#define SHA_SRC(t) \
  ({ unsigned char *__d = (unsigned char *)&data[t]; \
     (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); })

With such a construct, gcc would always allocate a register to hold __d
and then dereference that with an offset from 0 to 3.  Whereas:

#define SHA_SRC(t) \
   ({   unsigned char *__d = (unsigned char *)data; \
        (__d[(t)*4 + 0] << 24) | (__d[(t)*4 + 1] << 16) | \
        (__d[(t)*4 + 2] <<  8) | (__d[(t)*4 + 3] <<  0); })

does produce optimal assembly as only the register holding the data
pointer is dereferenced with the absolute byte offset.  I suspect your
usage of inline functions has the same effect as the first SHA_SRC
definition above.

Also, wrt skipping the last 3 write back to the 16 word array...  For
all the (limited) attempts I've made so far to do that, it always ended
up making things worse.  I've yet to investigate why though.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linus' sha1 is much faster!

Paolo Bonzini-2
> my first version of
> get_be32() was a macro that did this:
>
> #define SHA_SRC(t) \
>    ({ unsigned char *__d = (unsigned char *)&data[t]; \
>       (__d[0]<<  24) | (__d[1]<<  16) | (__d[2]<<  8) | (__d[3]<<  0); })
>
> With such a construct, gcc would always allocate a register to hold __d
> and then dereference that with an offset from 0 to 3.  Whereas:
>
> #define SHA_SRC(t) \
>     ({   unsigned char *__d = (unsigned char *)data; \
>          (__d[(t)*4 + 0]<<  24) | (__d[(t)*4 + 1]<<  16) | \
>          (__d[(t)*4 + 2]<<   8) | (__d[(t)*4 + 3]<<   0); })
>
> does produce optimal assembly as only the register holding the data
> pointer is dereferenced with the absolute byte offset.  I suspect your
> usage of inline functions has the same effect as the first SHA_SRC
> definition above.

Yes, that's what happens.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12
Loading...