[PATCH 1/2] t4034-diff-words: replace regex for diff driver

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/2] t4034-diff-words: replace regex for diff driver

Tay Ray Chuan
The next patch uses a non-whitespace regex, similar to the regex
currently used by the 'testdriver' diff driver; replace the regex with a
distinct one so that we can continue to conclude its effects.

Signed-off-by: Tay Ray Chuan <[hidden email]>

---
Kept separate to keep the next patch clean.
---
 t/t4034-diff-words.sh |   20 +++++++++++++++++---
 1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
index 6f1e5a2..9ae0e1a 100755
--- a/t/t4034-diff-words.sh
+++ b/t/t4034-diff-words.sh
@@ -46,6 +46,20 @@ cat >expect.non-whitespace-is-word <<-\EOF
 
  <GREEN>aeff = aeff * ( aaa )<RESET>
 EOF
+cat >expect.everything-is-word <<-\EOF
+ <BOLD>diff --git a/pre b/post<RESET>
+ <BOLD>index 330b04f..5ed8eff 100644<RESET>
+ <BOLD>--- a/pre<RESET>
+ <BOLD>+++ b/post<RESET>
+ <CYAN>@@ -1,3 +1,7 @@<RESET>
+ <RED>h(4)<RESET><GREEN>h(4),hh[44]<RESET>
+
+ a = b + c<RESET>
+
+ <GREEN>aa = a<RESET>
+
+ <GREEN>aeff = aeff * ( aaa )<RESET>
+EOF
 
 word_diff () {
  test_must_fail git diff --no-index "$@" pre post >output &&
@@ -179,7 +193,7 @@ test_expect_success 'word diff with a regular expression' '
 '
 
 test_expect_success 'set up a diff driver' '
- git config diff.testdriver.wordRegex "[^[:space:]]" &&
+ git config diff.testdriver.wordRegex ".+" &&
  cat <<-\EOF >.gitattributes
  pre diff=testdriver
  post diff=testdriver
@@ -192,7 +206,7 @@ test_expect_success 'option overrides .gitattributes' '
 '
 
 test_expect_success 'use regex supplied by driver' '
- cp expect.non-whitespace-is-word expect &&
+ cp expect.everything-is-word expect &&
  word_diff --color-words
 '
 
@@ -224,7 +238,7 @@ test_expect_success 'command-line overrides config: --word-diff-regex' '
 '
 
 test_expect_success '.gitattributes override config' '
- cp expect.non-whitespace-is-word expect &&
+ cp expect.everything-is-word expect &&
  word_diff --color-words
 '
 
--
1.7.7.584.g16d0ea

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

[PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Tay Ray Chuan
Factor out the comprehensive non-whitespace regex in use by PATTERNS and
IPATTERN and use it as the word-diff regex for the default diff driver.

As the default regex is no longer non-empty, update the word-regex
selection logic (non-default driver from pre-image, then post-image,
then the diff.wordRegex config) accordingly.

Signed-off-by: Tay Ray Chuan <[hidden email]>
---
 diff.c                |   14 ++++++++------
 t/t4034-diff-words.sh |   31 +++++++++----------------------
 userdiff.c            |    8 +++++---
 3 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/diff.c b/diff.c
index 374ecf3..5f71f9f 100644
--- a/diff.c
+++ b/diff.c
@@ -1987,9 +1987,10 @@ static const struct userdiff_funcname *diff_funcname_pattern(struct diff_filespe
  return one->driver->funcname.pattern ? &one->driver->funcname : NULL;
 }
 
-static const char *userdiff_word_regex(struct diff_filespec *one)
+static const char *userdiff_word_regex(struct diff_filespec *one, int *is_default)
 {
  diff_filespec_load_driver(one);
+ *is_default = !strcmp(one->driver->name, "default");
  return one->driver->word_regex;
 }
 
@@ -2180,17 +2181,18 @@ static void builtin_diff(const char *name_a,
  else if (!prefixcmp(diffopts, "-u"))
  xecfg.ctxlen = strtoul(diffopts + 2, NULL, 10);
  if (o->word_diff) {
- int i;
+ int i, is_default;
 
  ecbdata.diff_words =
  xcalloc(1, sizeof(struct diff_words_data));
  ecbdata.diff_words->type = o->word_diff;
  ecbdata.diff_words->opt = o;
+ is_default = 0;
  if (!o->word_regex)
- o->word_regex = userdiff_word_regex(one);
- if (!o->word_regex)
- o->word_regex = userdiff_word_regex(two);
- if (!o->word_regex)
+ o->word_regex = userdiff_word_regex(one, &is_default);
+ if (is_default)
+ o->word_regex = userdiff_word_regex(two, &is_default);
+ if (is_default && diff_word_regex_cfg)
  o->word_regex = diff_word_regex_cfg;
  if (o->word_regex) {
  ecbdata.diff_words->word_regex = (regex_t *)
diff --git a/t/t4034-diff-words.sh b/t/t4034-diff-words.sh
index 9ae0e1a..e588849 100755
--- a/t/t4034-diff-words.sh
+++ b/t/t4034-diff-words.sh
@@ -84,26 +84,13 @@ test_expect_success setup '
  git config diff.color.func magenta
 '
 
-test_expect_success 'set up pre and post with runs of whitespace' '
+test_expect_success 'set up pre and post with runs of non-whitespace' '
  cp pre.simple pre &&
- cp post.simple post
+ cp post.simple post &&
+ cp expect.non-whitespace-is-word expect
 '
 
-test_expect_success 'word diff with runs of whitespace' '
- cat >expect <<-\EOF &&
- <BOLD>diff --git a/pre b/post<RESET>
- <BOLD>index 330b04f..5ed8eff 100644<RESET>
- <BOLD>--- a/pre<RESET>
- <BOLD>+++ b/post<RESET>
- <CYAN>@@ -1,3 +1,7 @@<RESET>
- <RED>h(4)<RESET><GREEN>h(4),hh[44]<RESET>
-
- a = b + c<RESET>
-
- <GREEN>aa = a<RESET>
-
- <GREEN>aeff = aeff * ( aaa )<RESET>
- EOF
+test_expect_success 'word diff defaults to runs of non-whitespace' '
  word_diff --color-words &&
  word_diff --word-diff=color &&
  word_diff --color --word-diff=color
@@ -116,8 +103,8 @@ test_expect_success '--word-diff=porcelain' '
  --- a/pre
  +++ b/post
  @@ -1,3 +1,7 @@
- -h(4)
- +h(4),hh[44]
+ h(4)
+ +,hh[44]
  ~
  # significant space
  ~
@@ -140,7 +127,7 @@ test_expect_success '--word-diff=plain' '
  --- a/pre
  +++ b/post
  @@ -1,3 +1,7 @@
- [-h(4)-]{+h(4),hh[44]+}
+ h(4){+,hh[44]+}
 
  a = b + c
 
@@ -159,7 +146,7 @@ test_expect_success '--word-diff=plain --color' '
  <BOLD>--- a/pre<RESET>
  <BOLD>+++ b/post<RESET>
  <CYAN>@@ -1,3 +1,7 @@<RESET>
- <RED>[-h(4)-]<RESET><GREEN>{+h(4),hh[44]+}<RESET>
+ h(4)<GREEN>{+,hh[44]+}<RESET>
 
  a = b + c<RESET>
 
@@ -177,7 +164,7 @@ test_expect_success 'word diff without context' '
  <BOLD>--- a/pre<RESET>
  <BOLD>+++ b/post<RESET>
  <CYAN>@@ -1 +1 @@<RESET>
- <RED>h(4)<RESET><GREEN>h(4),hh[44]<RESET>
+ h(4)<GREEN>,hh[44]<RESET>
  <CYAN>@@ -3,0 +4,4 @@<RESET> <RESET><MAGENTA>a = b + c<RESET>
 
  <GREEN>aa = a<RESET>
diff --git a/userdiff.c b/userdiff.c
index 76109da..cf38566 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -7,12 +7,14 @@ static struct userdiff_driver *drivers;
 static int ndrivers;
 static int drivers_alloc;
 
+#define NON_WHITESPACE \
+ "[^[:space:]]|[\xc0-\xff][\x80-\xbf]+"
 #define PATTERNS(name, pattern, word_regex) \
  { name, NULL, -1, { pattern, REG_EXTENDED }, \
-  word_regex "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" }
+  word_regex "|" NON_WHITESPACE }
 #define IPATTERN(name, pattern, word_regex) \
  { name, NULL, -1, { pattern, REG_EXTENDED | REG_ICASE }, \
-  word_regex "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" }
+  word_regex "|" NON_WHITESPACE }
 static struct userdiff_driver builtin_drivers[] = {
 IPATTERN("fortran",
  "!^([C*]|[ \t]*!)\n"
@@ -140,7 +142,7 @@ PATTERNS("csharp",
  "[a-zA-Z_][a-zA-Z0-9_]*"
  "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
  "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
-{ "default", NULL, -1, { NULL, 0 } },
+{ "default", NULL, -1, { NULL, 0 }, NON_WHITESPACE },
 };
 #undef PATTERNS
 #undef IPATTERN
--
1.7.7.584.g16d0ea

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Thomas Rast
Tay Ray Chuan <[hidden email]> writes:

> Factor out the comprehensive non-whitespace regex in use by PATTERNS and
> IPATTERN and use it as the word-diff regex for the default diff driver.

Why?

I seem to recall that the motivation for keeping the original code as-is
instead of just emulating its behavior with a default regex was that it
is faster.  So disabling the default mode should at least have an
advantage?

</devils-advocate>

--
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Tay Ray Chuan
Hi,

Thomas, first off, thanks for looking through this.

On Thu, Jan 12, 2012 at 4:05 AM, Thomas Rast <[hidden email]> wrote:

> Tay Ray Chuan <[hidden email]> writes:
>
>> Factor out the comprehensive non-whitespace regex in use by PATTERNS and
>> IPATTERN and use it as the word-diff regex for the default diff driver.
>
> Why?
>
> I seem to recall that the motivation for keeping the original code as-is
> instead of just emulating its behavior with a default regex was that it
> is faster.  So disabling the default mode should at least have an
> advantage?
>
> </devils-advocate>

If you're talking about speed, yeah, that's probably true.

But I think it's worthwhile to trade-off performance for a sensible
default. Something like

  matrix[a,b,c]
  matrix[d,b,c]

gives

  matrix[[-a-]{+d+},b,c]

and when we have

  ImagineALanguageLikeFoo
  ImagineALanguageLikeBar

we get

  ImagineALanguageLike[-Foo-]{+Bar+}

(But I cheated. Foo and Bar have no common characters in common; if
they did, the word diff would be messy.)

Both of which seem sensible. From a usability/effectiveness
standpoint, I think it's more useful than what the current word-diff
defaults to - the whole line is taken as a "word", with the pre-image
shown as deleted and the post-image as added; we don't even try to run
LCS on it.

Examples are lifted from:
[1] http://article.gmane.org/gmane.comp.version-control.git/105896
[2] http://article.gmane.org/gmane.comp.version-control.git/105237

--
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Thomas Rast
Tay Ray Chuan <[hidden email]> writes:

> On Thu, Jan 12, 2012 at 4:05 AM, Thomas Rast <[hidden email]> wrote:
>> Tay Ray Chuan <[hidden email]> writes:
>>
>>> Factor out the comprehensive non-whitespace regex in use by PATTERNS and
>>> IPATTERN and use it as the word-diff regex for the default diff driver.
>>
>> Why?

Sorry for distracting you with the performance argument; it was mostly
the first thing that came to my mind that I could use to ask for the
motivation, and evaluation of tradeoffs, that both were missing from the
proposed commit message.

> But I think it's worthwhile to trade-off performance for a sensible
> default. Something like
>
>   matrix[a,b,c]
>   matrix[d,b,c]
>
> gives
>
>   matrix[[-a-]{+d+},b,c]
>
> and when we have
>
>   ImagineALanguageLikeFoo
>   ImagineALanguageLikeBar
>
> we get
>
>   ImagineALanguageLike[-Foo-]{+Bar+}

In that case (and I should have read the original patch), I am
definitely against this change.  It turns the default word-diff into
character-diff, which is something entirely different, and frequently
useless precisely for the reason you state:

> (But I cheated. Foo and Bar have no common characters in common; if
> they did, the word diff would be messy.)

Case in point, consider my patch sent out yesterday

  http://article.gmane.org/gmane.comp.version-control.git/188391

It consists of a one-hunk doc update.  word-diff is not brilliant:

  -k::
          Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
          header line to extract the title line for the commit log
          [-message,-]
  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
  [-      whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
  [-      then prepends "[PATCH] ".-]{+message.+}  This [-flag forbids-]{+option prevents+} this munging, and is most
          useful when used to read back 'git format-patch -k' output.
[snip the rest as it's only {+}]

But character-diff tries too hard to find common subsequences:

  $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
  -k::
          Usually the program [-'cl-]{+remov+}e[-an-]s {+email cr+}u[-p'-]{+ft from+} the Subject:
          header line to extract the title line for the commit log
          message[-,-]
  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
  [-      w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[-ically '[PATCH]', and-]t[-he-]{+io+}n pre[-p-]{+v+}en[-ds "[PATCH] ".  This flag forbid-]{+t+}s this munging, and is most
          useful when used to read back 'git format-patch -k' output.
[snip]

Wouldn't you agree that

  w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[

is just line noise?  The colors don't even help as most of it is removed
(red).

Regarding your examples

> [1] http://article.gmane.org/gmane.comp.version-control.git/105896
> [2] http://article.gmane.org/gmane.comp.version-control.git/105237

first please notice that both of them were written before (and actually
discussing) the introduction of the wordRegex feature.  At this point,
we were trying to make up our minds w.r.t. how powerful the feature
needs to be.  Nowadays (or in fact, starting a few days after those
emails) the user can easily achieve everything discussed here by setting
the wordRegex to taste.

That being said, I can see some arguments for changing the default to
split punctuation into a separate word.  That is, whereas the current
default is semantically equivalent to a wordRegex of

  [^[:space:]]*

(but has a faster code path) and your proposal is equivalent to

  [^[:space:]]|UTF_8_GUARD

I think there is a case to be made for a default of

  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+

or some such.  There's a lot of bikeshedding lurking in the (non)extent
of the [[:alnum:]] here, however.

--
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Tay Ray Chuan
On Thu, Jan 12, 2012 at 5:22 PM, Thomas Rast <[hidden email]> wrote:

> [snip]
> Case in point, consider my patch sent out yesterday
>
>  http://article.gmane.org/gmane.comp.version-control.git/188391
>
> It consists of a one-hunk doc update.  word-diff is not brilliant:
>
>  -k::
>          Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
>          header line to extract the title line for the commit log
>          [-message,-]
>  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
>  [-      whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
>  [-      then prepends "[PATCH] ".-]{+message.+}  This [-flag forbids-]{+option prevents+} this munging, and is most
>          useful when used to read back 'git format-patch -k' output.
> [snip the rest as it's only {+}]
>
> But character-diff tries too hard to find common subsequences:
>
>  $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
>[snip]
>  w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[
>
> is just line noise?  The colors don't even help as most of it is removed
> (red).

You missed the '+' quantifier, as in

  [^[:space:]]+

Using that regex, that abomination of a word-diff that you mentioned
disappears, like this:

-k::
        Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
        header line to extract the title line for the commit log
        [-message,-]
[- among which (1) remove 'Re:' or 're:', (2) leading-]
[- whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
[- then prepends "[PATCH] ".-]{+message.+}  This [-flag
forbids-]{+option prevents+} this munging, and is most
        useful when used to read back 'git format-patch -k' output.

> [snip]
> That being said, I can see some arguments for changing the default to
> split punctuation into a separate word.  That is, whereas the current
> default is semantically equivalent to a wordRegex of
>
>  [^[:space:]]*
>
> (but has a faster code path)

Oh right, there *is* a sensible default implemented in. Somehow I was
under the impression that there wasn't.

I wonder which is faster, using the non-whitespace regex, or the
isspace() calls...

> and your proposal is equivalent to
>
>  [^[:space:]]|UTF_8_GUARD
>
> I think there is a case to be made for a default of
>
>  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+
>
> or some such.  There's a lot of bikeshedding lurking in the (non)extent
> of the [[:alnum:]] here, however.

Care to explain further? Not to sure what you mean here.

--
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Thomas Rast
Tay Ray Chuan <[hidden email]> writes:

> On Thu, Jan 12, 2012 at 5:22 PM, Thomas Rast <[hidden email]> wrote:
>> [snip]
>> Case in point, consider my patch sent out yesterday
>>
>>  http://article.gmane.org/gmane.comp.version-control.git/188391
>>
>> It consists of a one-hunk doc update.  word-diff is not brilliant:
>>
>>  -k::
>>          Usually the program [-'cleans up'-]{+removes email cruft from+} the Subject:
>>          header line to extract the title line for the commit log
>>          [-message,-]
>>  [-      among which (1) remove 'Re:' or 're:', (2) leading-]
>>  [-      whitespaces, (3) '[' up to ']', typically '[PATCH]', and-]
>>  [-      then prepends "[PATCH] ".-]{+message.+}  This [-flag forbids-]{+option prevents+} this munging, and is most
>>          useful when used to read back 'git format-patch -k' output.
>> [snip the rest as it's only {+}]
>>
>> But character-diff tries too hard to find common subsequences:
>>
>>  $ g show HEAD^^ --word-diff-regex='[^[:space:]]' | xsel
>>[snip]
>>  w-]{+.  T+}hi[-te-]s[-paces, (3) '[' up t-] o[-']', ty-]p[
>>
>> is just line noise?  The colors don't even help as most of it is removed
>> (red).
>
> You missed the '+' quantifier, as in
>
>   [^[:space:]]+

Did I?  I was working from the example you provided earlier

}   matrix[a,b,c]
}   matrix[d,b,c]
} gives
}   matrix[[-a-]{+d+},b,c]
}
} and when we have
}
}   ImagineALanguageLikeFoo
}   ImagineALanguageLikeBar
} we get
}   ImagineALanguageLike[-Foo-]{+Bar+}

Under [^[:space:]]+ neither of the examples would work.  Actually,
[^[:space:]]+ is the same as today's default, the [^[:space:]]* I
mentioned later is (strictly speaking) broken as it allows for a
0-length match.  (It doesn't really matter because IIRC the engine
ignores 0-length words.)

>> That being said, I can see some arguments for changing the default to
>> split punctuation into a separate word.  That is, whereas the current
>> default is semantically equivalent to a wordRegex of
>>
>>  [^[:space:]]*
>>
>> (but has a faster code path)
>
> Oh right, there *is* a sensible default implemented in. Somehow I was
> under the impression that there wasn't.
>
> I wonder which is faster, using the non-whitespace regex, or the
> isspace() calls...

I tried measuring it across a few commits, but it mostly gets drowned
out by the diff effort.  For a commit with stat

  exercises/cgal/cover/cover.cpp  |    5 +-
  exercises/cgal/cover/cover.in1  |27014 +++++++++++++++-----
  exercises/cgal/cover/cover.in2  |48996 +++++++++++++++++++++++------------
  exercises/cgal/cover/cover.in3  |55041 +++++++++++++++++++++++++--------------
  exercises/cgal/cover/cover.in4  |47600 ++++++++++++++++++++--------------
  exercises/cgal/cover/cover.int  |43491 ++++++++++++++++++++++---------
  exercises/cgal/cover/cover.out1 |   53 +-
  exercises/cgal/cover/cover.out2 |   24 +-
  exercises/cgal/cover/cover.out3 |   11 +-
  exercises/cgal/cover/cover.out4 |    2 +-
  exercises/cgal/cover/cover.outt |   23 +-
  exercises/cgal/cover/gen        |   39 +-
  exercises/cgal/cover/gen-1.cpp  |    4 +-
  exercises/cgal/cover/gen-2.cpp  |    6 +-
  exercises/cgal/cover/gen-3.cpp  |    6 +-

(sorry, can't share as those testcases are secret) I get best-of-5
timings

  --word-diff-regex='[^[:space:]]+'    0:07.50real 7.40user 0.07system
  --word-diff                          0:07.47real 7.41user 0.03system

In conclusion, "meh".  I think ripping out the isspace() part would make
for a nice code reduction.

>> and your proposal is equivalent to
>>
>>  [^[:space:]]|UTF_8_GUARD
>>
>> I think there is a case to be made for a default of
>>
>>  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+
>>
>> or some such.  There's a lot of bikeshedding lurking in the (non)extent
>> of the [[:alnum:]] here, however.
>
> Care to explain further? Not to sure what you mean here.

For natural language, it may or may not make sense to match numbers as
part of a word.

For typical use in e.g. emails, a lot of punctuation has a double role;
breaking words in

  http://article.gmane.org/gmane.comp.version-control.git/188391

may or may not make sense.

For some uses, especially source code, it would be better to match an
underscore _ as part of a complete word, too.

For some programming languages, say lisp, a dash - would also belong in
the same category.

There's no real reason other than ease of implementation why the pattern
handles ASCII non-alphanumerics separately, but non-ASCII UTF-8
non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2
\xa0) always goes into a word.  But if you were to make UTF-8 sequences
a single word, text in (say) many European languages would become
chunked at accented letters.

I'm sure you can find more items for this list.  It's a grey area.


--
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 2/2] diff --word-diff: use non-whitespace regex by default

Tay Ray Chuan
On Thu, Jan 19, 2012 at 11:53 PM, Thomas Rast <[hidden email]> wrote:
>[snip]
> Under [^[:space:]]+ neither of the examples would work.  Actually,
> [^[:space:]]+ is the same as today's default, the [^[:space:]]* I
> mentioned later is (strictly speaking) broken as it allows for a
> 0-length match.  (It doesn't really matter because IIRC the engine
> ignores 0-length words.)

My bad.

>[snip]
> I tried measuring it across a few commits, but it mostly gets drowned
> out by the diff effort.  For a commit with stat
>
>  exercises/cgal/cover/cover.cpp  |    5 +-
>  exercises/cgal/cover/cover.in1  |27014 +++++++++++++++-----
>  exercises/cgal/cover/cover.in2  |48996 +++++++++++++++++++++++------------
>  exercises/cgal/cover/cover.in3  |55041 +++++++++++++++++++++++++--------------
>  exercises/cgal/cover/cover.in4  |47600 ++++++++++++++++++++--------------
>  exercises/cgal/cover/cover.int  |43491 ++++++++++++++++++++++---------
>  exercises/cgal/cover/cover.out1 |   53 +-
>  exercises/cgal/cover/cover.out2 |   24 +-
>  exercises/cgal/cover/cover.out3 |   11 +-
>  exercises/cgal/cover/cover.out4 |    2 +-
>  exercises/cgal/cover/cover.outt |   23 +-
>  exercises/cgal/cover/gen        |   39 +-
>  exercises/cgal/cover/gen-1.cpp  |    4 +-
>  exercises/cgal/cover/gen-2.cpp  |    6 +-
>  exercises/cgal/cover/gen-3.cpp  |    6 +-
>
> (sorry, can't share as those testcases are secret) I get best-of-5
> timings
>
>  --word-diff-regex='[^[:space:]]+'    0:07.50real 7.40user 0.07system
>  --word-diff                          0:07.47real 7.41user 0.03system
>
> In conclusion, "meh".  I think ripping out the isspace() part would make
> for a nice code reduction.

Thanks for the numbers. Well, that agrees with the intuition that
regex is slower than isspace(), since you have run it through the
regex engine.

>>> and your proposal is equivalent to
>>>
>>>  [^[:space:]]|UTF_8_GUARD
>>>
>>> I think there is a case to be made for a default of
>>>
>>>  [^[:space:]]|([[:alnum:]]|UTF_8_GUARD)+
>>>
>>> or some such.  There's a lot of bikeshedding lurking in the (non)extent
>>> of the [[:alnum:]] here, however.
>>
>> Care to explain further? Not to sure what you mean here.
>
> For natural language, it may or may not make sense to match numbers as
> part of a word.
>
> For typical use in e.g. emails, a lot of punctuation has a double role;
> breaking words in
>
>  http://article.gmane.org/gmane.comp.version-control.git/188391
>
> may or may not make sense.
>
> For some uses, especially source code, it would be better to match an
> underscore _ as part of a complete word, too.
>
> For some programming languages, say lisp, a dash - would also belong in
> the same category.
>
> There's no real reason other than ease of implementation why the pattern
> handles ASCII non-alphanumerics separately, but non-ASCII UTF-8
> non-alnums (like, say, unicode NO-BREAK SPACE which would show as \xc2
> \xa0) always goes into a word.  But if you were to make UTF-8 sequences
> a single word, text in (say) many European languages would become
> chunked at accented letters.
>
> I'm sure you can find more items for this list.  It's a grey area.

Thanks.

--
Cheers,
Ray Chuan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html