[PATCH 0/2] minor cat-file optimizations

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH 0/2] minor cat-file optimizations

Jeff King
At Git Merge, Charles suggested that "--buffer" should be the default
for "--batch-all-objects". This bit me again yesterday, so I thought I'd
actually implement it. :)

That's in the second patch. The first one is another micro-optimization
that happens to help the use case I was working on (the speedup there is
"only" 40%, but the case I was looking at had a pathological number of
packs, and so was even slower).

  [1/2]: cat-file: avoid noop calls to sha1_object_info_extended
  [2/2]: cat-file: default to --buffer when --batch-all-objects is used

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/2] minor cat-file optimizations

Jeff King
On Wed, May 18, 2016 at 12:52:53PM -0400, Jeff King wrote:

> At Git Merge, Charles suggested that "--buffer" should be the default
> for "--batch-all-objects". This bit me again yesterday, so I thought I'd
> actually implement it. :)
>
> That's in the second patch. The first one is another micro-optimization
> that happens to help the use case I was working on (the speedup there is
> "only" 40%, but the case I was looking at had a pathological number of
> packs, and so was even slower).
>
>   [1/2]: cat-file: avoid noop calls to sha1_object_info_extended
>   [2/2]: cat-file: default to --buffer when --batch-all-objects is used

Whoops, meant to actually cc Charles.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

[PATCH 1/2] cat-file: avoid noop calls to sha1_object_info_extended

Jeff King
It is not unreasonable to ask cat-file for a batch-check
format of simply "%(objectname)". At first glance this seems
like a noop (you are generally already feeding the object
names on stdin!), but it has a few uses:

  1. With --batch-all-objects, you can generate a listing of
     the sha1s present in the repository, without any input.

  2. You do not have to feed sha1s; you can feed arbitrary
     sha1 expressions and have git resolve them en masse.

  3. You can even feed a raw sha1, with the result that git
     will tell you whether we actually have the object or
     not.

In case 3, the call to sha1_object_info is useful; it tells
us whether the object exists or not (technically we could
swap this out for has_sha1_file, but the cost is roughly the
same).

In case 2, the existence check is of debatable value. A
mass-resolution might prefer performance to safety (against
outputting a value for a corrupted ref, for example).
However, the object lookup cost is likely not as noticeable
compared to the resolution cost. And since we have provided
that safety in the past, the conservative choice is to keep
it.

In case 1, though, the object lookup is a definite noop; we
know about the object because we found it in the object
database. There is no new information gained by making the
call.

This patch detects that case and optimizes out the call.
Here are best-of-five timings for linux.git:

  [before]
  $ time git cat-file --buffer \
                      --batch-all-objects \
                      --batch-check='%(objectname)'
  real    0m2.117s
  user    0m2.044s
  sys     0m0.072s

  [after]
  $ time git cat-file --buffer \
                      --batch-all-objects \
                      --batch-check='%(objectname)'
  real    0m1.230s
  user    0m1.176s
  sys     0m0.052s

There are two implementation details to note here.

One is that we detect the noop case by seeing that "struct
object_info" does not request any information. But besides
object existence, there is one other piece of information
which sha1_object_info may fill in: whether the object is
cached, loose, or packed. We don't currently provide that
information in the output, but if we were to do so later,
we'd need to take note and disable the optimization in that
case.

And that leads to the second note. If we were to output
that information, a better implementation would be to
remember where we saw the object in --batch-all-objects in
the first place, and avoid looking it up again by sha1.

In fact, we could probably squeeze out some extra
performance for less-trivial cases, too, by remembering the
pack location where we saw the object, and going directly
there to find its information (like type, size, etc). That
would in theory make this optimization unnecessary.

I didn't pursue that path here for two reasons:

  1. It's non-trivial to implement, and has memory
     implications. Because we sort and de-dup the list of
     output sha1s, we'd have to record the pack information
     for each object, too.

  2. It doesn't save as much as you might hope. It saves the
     find_pack_entry() call, but getting the size and type
     for deltified objects requires walking down the delta
     chain (for the real type) or reading the delta data
     header (for the size). These costs tend to dominate the
     non-trivial cases.

By contrast, this optimization is easy and self-contained,
and speeds up a real-world case I've used.

Signed-off-by: Jeff King <[hidden email]>
---
 builtin/cat-file.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 54db118..5d3180e 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -154,6 +154,13 @@ struct expand_data {
  * elements above, so you can retrieve the response from there.
  */
  struct object_info info;
+
+ /*
+ * This flag will be true if the requested batch format and options
+ * don't require us to call sha1_object_info, which can then be
+ * optimized out.
+ */
+ unsigned skip_object_info : 1;
 };
 
 static int is_atom(const char *atom, const char *s, int slen)
@@ -258,7 +265,8 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 {
  struct strbuf buf = STRBUF_INIT;
 
- if (sha1_object_info_extended(data->sha1, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
+ if (!data->skip_object_info &&
+    sha1_object_info_extended(data->sha1, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
  printf("%s missing\n", obj_name ? obj_name : sha1_to_hex(data->sha1));
  fflush(stdout);
  return;
@@ -369,6 +377,13 @@ static int batch_objects(struct batch_options *opt)
  strbuf_expand(&buf, opt->format, expand_format, &data);
  data.mark_query = 0;
 
+ if (opt->all_objects) {
+ struct object_info empty;
+ memset(&empty, 0, sizeof(empty));
+ if (!memcmp(&data.info, &empty, sizeof(empty)))
+ data.skip_object_info = 1;
+ }
+
  /*
  * If we are printing out the object, then always fill in the type,
  * since we will want to decide whether or not to stream.
--
2.8.2.888.gecb1fe3

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

[PATCH 2/2] cat-file: default to --buffer when --batch-all-objects is used

Jeff King
In reply to this post by Jeff King
Traditionally cat-file's batch-mode does not do any output
buffering. The reason is that a caller may have pipes
connected to its input and output, and would want to use
cat-file interactively, getting output immediately for each
input it sends.

This may involve a lot of small write() calls, which can be
slow. So we introduced --buffer to improve this, but we
can't turn it on by default, as it would break the
interactive case above.

However, when --batch-all-objects is used, we do not read
stdin at all. We generate the output ourselves as quickly as
possible, and then exit. In this case buffering is a strict
win, and it is simply a hassle for the user to have to
remember to specify --buffer.

This patch makes --buffer the default when --batch-all-objects
is used. Specifying "--buffer" manually is still OK, and you
can even override it with "--no-buffer" if you're a
masochist (or debugging).

For some real numbers, running:

  git cat-file --batch-all-objects --batch-check='%(objectname)'

on torvalds/linux goes from:

  real    0m1.464s
  user    0m1.208s
  sys     0m0.252s

to:

  real    0m1.230s
  user    0m1.172s
  sys     0m0.056s

for a 16% speedup.

Suggested-by: Charles Bailey <[hidden email]>
Signed-off-by: Jeff King <[hidden email]>
---
 builtin/cat-file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 5d3180e..618103f 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -504,6 +504,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
 
  git_config(git_cat_file_config, NULL);
 
+ batch.buffer_output = -1;
  argc = parse_options(argc, argv, prefix, options, cat_file_usage, 0);
 
  if (opt) {
@@ -527,6 +528,9 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
  usage_with_options(cat_file_usage, options);
  }
 
+ if (batch.buffer_output < 0)
+ batch.buffer_output = batch.all_objects;
+
  if (batch.enabled)
  return batch_objects(&batch);
 
--
2.8.2.888.gecb1fe3
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html