Is there a --stat or --numstat like option that'll allow me to have my cake and eat it too?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there a --stat or --numstat like option that'll allow me to have my cake and eat it too?

Ævar Arnfjörð Bjarmason
I maintain a hook for Git that allows you to block binary pushes[1],
from other implementations I've seen it's the least stupid thing out
there that does that.

Basically on-push it parses this:

    git log --pretty=format:%H -M100% --stat=9000,9001 <old_ref>..<new_ref>

The --stat=9000,9001 is there to make sure we still get the filename
if it's long[2].

It's important that this is something like "git-log" instead of
"git-show for each" for performance (think a push with hundreds of
commits). It's also important that it's not "git diff" (think a push
that adds/removes a huge binary file within one push). I also don't
want to manually parse "git log --numstat -p" or whatever for
performance reasons since every push hangs on this.

It's somewhat of a pain to parse that  --stat output, because I have
to look for /\|\s+Bin / in the output to detect binary changes.

You might be thinking "why don't you use --numstat?". Because while
that option does most of what I want it doesn't show the old/new size
of the binary file, so I can't have a policy to allow e.g. <=1KB files
without doing a second pass with --stat or "git show".

Both formats also have various parsing edge cases, e.g. with -M100% I
have to parse out renames like "foo.png => bar.png", but you can also
create a file with " => " in the filename and there's no way to
disambiguate it.

Both formats also only show lines added/deleted, but --numstat doesn't
show the size before/after for binary files, so if I want to also
prohibit huge non-binary files I can't without running both --stat and
--numstat.

What I really want is something for git-log more like
git-for-each-ref, so I could emit the following info for each file
being modified delimited by some binary marker:

    - file name before
    - file name after
    - is rename?
    - is binary?
    - size in bytes before
    - size it bytes after
    - removed lines
    - added lines

I think no combination of git-log options or any built-in machinery
comes close to giving me all of that without having to do multiple
passes with some combination of git-log and git-show, but I'd love to
be proven wrong.

1. https://github.com/avar/pre-receive-reject-binaries
2. OVER NINE THOUSAND should be enough for everyone, right?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Is there a --stat or --numstat like option that'll allow me to have my cake and eat it too?

Junio C Hamano
Ævar Arnfjörð Bjarmason <[hidden email]> writes:

> What I really want is something for git-log more like
> git-for-each-ref, so I could emit the following info for each file
> being modified delimited by some binary marker:
>
>     - file name before
>     - file name after
>     - is rename?
>     - is binary?
>     - size in bytes before
>     - size it bytes after
>     - removed lines
>     - added lines
>
> I think no combination of git-log options or any built-in machinery
> comes close to giving me all of that without having to do multiple
> passes with some combination of git-log and git-show, but I'd love to
> be proven wrong.

I do not think such a thing exists.  From the look of the above
list, if I were implementing it, I'd think it would be the easiest
if it is built as a new output format of "diff" that sits next to
existing --stat, --patch, and --numstat formats, i.e. you would be
writing a new aevars_stat_consume() callback function and calling
xdi_diff_outf() like everybody else.  A possible output format may
look as if we are showing "log --patch" output with a bit more
extended diff header lines (e.g. in addition to "rename from", etc.,
you would have "bytes before" and other new types of headers), but
without the actual patch text.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Is there a --stat or --numstat like option that'll allow me to have my cake and eat it too?

Jeff King
In reply to this post by Ævar Arnfjörð Bjarmason
On Tue, Mar 08, 2016 at 04:08:21PM +0100, Ævar Arnfjörð Bjarmason wrote:

> What I really want is something for git-log more like
> git-for-each-ref, so I could emit the following info for each file
> being modified delimited by some binary marker:
>
>     - file name before
>     - file name after
>     - is rename?
>     - is binary?
>     - size in bytes before
>     - size it bytes after
>     - removed lines
>     - added lines

If you get the full sha1s of each object (e.g., by adding --raw), then
you can dump them all to a single cat-file invocation to efficiently get
the sizes.

I'm not quite sure I understand why you want to know about renames and
added/removed lines if you are just blocking binary files. If I were
implementing this[1], I'd probably just block based on blob size, which
you can do with:

  git rev-list --objects $old..$new |
  git cat-file --batch-check='%(objectsize) %(objectname) %(rest)' |
  perl -alne 'print if $F[0] > 1_000_000; # or whatever' |
  while read size sha1 file; do
        echo "Whoops, $file ($sha1) is too big"
        exit 1
  done

You can also use %(objectsize:disk) to get the on-disk size (which can
tell you about things that don't compress well, which tend to be the
sorts of things you are trying to keep out).

You can't ask about binary-ness, but I don't think it would unreasonable
for cat-file to have a "would git consider this content binary?"
placeholder for --batch-check.

The other things are properties of the comparison, not of individual
objects, so you'll have to get them from "git log". But with some clever
scripting, I think you could feed those sha1s (or $commit:$path
specifiers) into a single cat-file invocation to get the before/after
sizes.

-Peff

[1] GitHub has hard and soft limits for various blob sizes, and at one
    point the implementation looked very similar to what I showed here.
    The downside is that for a large push, the rev-list can actually
    take a fair bit of time (e.g., consider pushing up all of the kernel
    history to a brand new repo), and this is on top of the similar work
    already done by index-pack and check_everything_connected().

    These days I have a hacky patch to notice the too-big size directly
    in index-pack, which is essentially free. It doesn't know about the
    file path, so we pull that out later in the pre-receive hook. But we
    only have to do so in the uncommon case that there _is_ actually a
    too-big file, so normal pushes incur no penalty.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Is there a --stat or --numstat like option that'll allow me to have my cake and eat it too?

Ævar Arnfjörð Bjarmason
On Tue, Mar 8, 2016 at 9:51 PM, Jeff King <[hidden email]> wrote:

> On Tue, Mar 08, 2016 at 04:08:21PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
>> What I really want is something for git-log more like
>> git-for-each-ref, so I could emit the following info for each file
>> being modified delimited by some binary marker:
>>
>>     - file name before
>>     - file name after
>>     - is rename?
>>     - is binary?
>>     - size in bytes before
>>     - size it bytes after
>>     - removed lines
>>     - added lines
>
> If you get the full sha1s of each object (e.g., by adding --raw), then
> you can dump them all to a single cat-file invocation to efficiently get
> the sizes.
>
> I'm not quite sure I understand why you want to know about renames and
> added/removed lines if you are just blocking binary files. If I were
> implementing this[1], I'd probably just block based on blob size, which
> you can do with:

I want to know about renames because if you're just moving an existing
binary file around that's fine, it's not adding a new big blob to the
repo.

The hook also has a facility to commit binary stuff if you add "yes I
know what I'm doing and want to commit N bytes to the repo" to the
commit message. Mostly when people do this it's an accident.

I wanted to know about added/removed lines because I was looking into
extending this non-binary files. Today at work someone committed 300MB
of text files to a branch, we could delete it in that case, but it
would also be nice to have limits on that sort of thing too.

>   git rev-list --objects $old..$new |
>   git cat-file --batch-check='%(objectsize) %(objectname) %(rest)' |
>   perl -alne 'print if $F[0] > 1_000_000; # or whatever' |
>   while read size sha1 file; do
>         echo "Whoops, $file ($sha1) is too big"
>         exit 1
>   done
>
> You can also use %(objectsize:disk) to get the on-disk size (which can
> tell you about things that don't compress well, which tend to be the
> sorts of things you are trying to keep out).
>
> You can't ask about binary-ness, but I don't think it would unreasonable
> for cat-file to have a "would git consider this content binary?"
> placeholder for --batch-check.
>
> The other things are properties of the comparison, not of individual
> objects, so you'll have to get them from "git log". But with some clever
> scripting, I think you could feed those sha1s (or $commit:$path
> specifiers) into a single cat-file invocation to get the before/after
> sizes.
>
> -Peff
>
> [1] GitHub has hard and soft limits for various blob sizes, and at one
>     point the implementation looked very similar to what I showed here.
>     The downside is that for a large push, the rev-list can actually
>     take a fair bit of time (e.g., consider pushing up all of the kernel
>     history to a brand new repo), and this is on top of the similar work
>     already done by index-pack and check_everything_connected().
>
>     These days I have a hacky patch to notice the too-big size directly
>     in index-pack, which is essentially free. It doesn't know about the
>     file path, so we pull that out later in the pre-receive hook. But we
>     only have to do so in the uncommon case that there _is_ actually a
>     too-big file, so normal pushes incur no penalty.

All good tips / insights. I'll definitely check some of this out.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html