proposal for extending smudge/clean filters with raw file access

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

proposal for extending smudge/clean filters with raw file access

Joey Hess-3
I'm using smudge/clean filters in git-annex now, and it's not been an
entirely smooth fit between the interface and what git-annex wants
to do.

The clean filter has to consume the whole file content on stdin;
not reading it all will make git think the clean filter failed.
But, git-annex often doesn't need to read the whole content of a
work-tree file in order to clean it.

The smudge filter has to output the whole file content to stdout. But
git-annex often has the file's content on disk already, and could just
move it into place in the working tree. This would save CPU and IO and
often disk space too. But the smudge interface doesn't let git-annex use
the efficient approach.

So I propose extending the filter driver with two more optional
commands. Call them raw-clean and raw-smudge for now.

raw-clean would be like clean, but rather than being fed the whole
content of a large file on stdin, it would be passed the filename, and
can access the file itself. Like the clean filter, it outputs the
cleaned version on stdout.

raw-smudge would be like smudge, but rather than needing to output the
whole content of a large file on stdout, it would be passed a filename,
and can create that file itself.

To keep this backwards compatible, and to handle the cases where the
object being filtered is not a file on disk, the smudge and clean
filters would be required to be configured too, in order for raw-clean
and raw-smudge to be used.

It seems fairly easy to implement raw-clean. In sha1_file.c, index_path
would use raw-clean when available, while index_fd etc keep on using
the clean filter. I have not investigated what would be needed to implement
raw-smudge yet.

--
see shy jo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: proposal for extending smudge/clean filters with raw file access

Junio C Hamano
Joey Hess <[hidden email]> writes:

> I'm using smudge/clean filters in git-annex now, and it's not been an
> entirely smooth fit between the interface and what git-annex wants
> to do.
>
> The clean filter has to consume the whole file content on stdin;
> not reading it all will make git think the clean filter failed.
> But, git-annex often doesn't need to read the whole content of a
> work-tree file in order to clean it.
>
> The smudge filter has to output the whole file content to stdout. But
> git-annex often has the file's content on disk already, and could just
> move it into place in the working tree. This would save CPU and IO and
> often disk space too. But the smudge interface doesn't let git-annex use
> the efficient approach.

The smudge happens to be the last to run, so it is quite true that
it can say "Hey Git, I've written it out already".

I didn't check all codepaths to ensure that we won't need the
smudged result in core at all, but I am guessing you did so before
making this proposal, so in that case I would say this sounds fine.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: proposal for extending smudge/clean filters with raw file access

Junio C Hamano
In reply to this post by Joey Hess-3
Joey Hess <[hidden email]> writes:

> The clean filter has to consume the whole file content on stdin;
> not reading it all will make git think the clean filter failed.
> But, git-annex often doesn't need to read the whole content of a
> work-tree file in order to clean it.

This side, I do not think we even need a new variant.  We can just
update the code to interact with "clean" so that it the writer to
the pipe ignores SIGPIPE, detects EPIPE on write(2), says "ah, the
other end does not need the full input to produce its output".  The
reader from the pipe consumes its output without failing AS LONG AS
the "clean" filter exits with zero (we do check its exit status,
right?)

And I think we should do the same for any codepath that spawns
custom script and feeds it via a pipe from us (I am talking about
hooks here).  

What may require a new variant is when your clean filter may not
even need earlier contents of the file, in which case we are better
off not opening the file ourselves and slurping it into core, only
to pipe it and earlier part discarded by the filter.  "clean-from-fs"
filter that gets a path on the working tree and feeds us via pipe
would be appropriate to deal with such a requirement.

> The smudge filter has to output the whole file content to stdout.

We cannot do a similar "we can just unconditionally update" like I
said on the "clean" side to "smudge", so it would need a new
variant.  I do not want to call it anything "raw", as there is
nothing "raw" about it.  "smudge-to-fs" would be a better name.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: proposal for extending smudge/clean filters with raw file access

Joey Hess-3
Junio C Hamano wrote:
> This side, I do not think we even need a new variant.  We can just
> update the code to interact with "clean" so that it the writer to
> the pipe ignores SIGPIPE, detects EPIPE on write(2), says "ah, the
> other end does not need the full input to produce its output".  The
> reader from the pipe consumes its output without failing AS LONG AS
> the "clean" filter exits with zero (we do check its exit status,
> right?)

There are two problems with doing that. First, any clean filter that
relied on that would not work with older versions of git.

Secondly, and harder to get around, the filename passed to the clean
filter is not necessarily a path to the actual existing file that is
being cleaned. For example, git hash-object --stdin --path=whatever.
So the current clean filter can't really safely rely on accessing the
file to short-circuit its cleaning. (Although it seems to mostly work..
currently..)

> We cannot do a similar "we can just unconditionally update" like I
> said on the "clean" side to "smudge", so it would need a new
> variant.  I do not want to call it anything "raw", as there is
> nothing "raw" about it.  "smudge-to-fs" would be a better name.

"raw" was just a placeholder. "clean-from-fs" and "smudge-to-fs" are
pretty descriptive.

--
see shy jo

signature.asc (828 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: proposal for extending smudge/clean filters with raw file access

Junio C Hamano
Joey Hess <[hidden email]> writes:

> Junio C Hamano wrote:
>> This side, I do not think we even need a new variant.  We can just
>> update the code to interact with "clean" so that it the writer to
>> the pipe ignores SIGPIPE, detects EPIPE on write(2), says "ah, the
>> other end does not need the full input to produce its output".  The
>> reader from the pipe consumes its output without failing AS LONG AS
>> the "clean" filter exits with zero (we do check its exit status,
>> right?)
>
> There are two problems with doing that. First, any clean filter that
> relied on that would not work with older versions of git.

That's a fair point.

> Secondly, and harder to get around, the filename passed to the clean
> filter is not necessarily a path to the actual existing file that is
> being cleaned.

Either one of us is confused.  I was talking about updating the
current "clean" implementation without changing its interface,
i.e. gets fed via its standard input, expected to respond to its
standard output.  There is no filename involved.

And yes, "clean-from-fs" that is spawned by us with the name of the
file in the filesystem that has the contents as its argument, must
be prepared to see a filename that does not have any relation to the
filename in the working tree.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: proposal for extending smudge/clean filters with raw file access

Joey Hess-3
In reply to this post by Junio C Hamano
Junio C Hamano wrote:
> The smudge happens to be the last to run, so it is quite true that
> it can say "Hey Git, I've written it out already".
>
> I didn't check all codepaths to ensure that we won't need the
> smudged result in core at all, but I am guessing you did so before
> making this proposal, so in that case I would say this sounds fine.

Well, the idea is to only use smudge-to-file when the smudged content is
going to be written out to a file. Any other code paths that need to
smudge some content would use the smudge filter.

So, try_create_file would use it. Maybe some other places I have not
found yet could as well.

--
see shy jo

signature.asc (828 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: proposal for extending smudge/clean filters with raw file access

Joey Hess-3
In reply to this post by Junio C Hamano
Junio C Hamano wrote:
> > Secondly, and harder to get around, the filename passed to the clean
> > filter is not necessarily a path to the actual existing file that is
> > being cleaned.
>
> Either one of us is confused.  I was talking about updating the
> current "clean" implementation without changing its interface,
> i.e. gets fed via its standard input, expected to respond to its
> standard output.  There is no filename involved.

I'm talking about the %f that can be passed to the clean filter.
The context that I left out is that my clean filter could avoid reading
all of stdin, and quickly produce the cleaned result, but only if it
can examine the file that's being cleaned. Which is not currently
entirely safe to use the %f for.

There may be a way to make a clean filter that can do something useful
without reading all of stdin, and without examining the file that's
being cleaned. Maybe. Hard to see how. I don't feel such a hypothetical
clean filter is worth changing the current EPIPE behavior to support.

So I think it's better to add a separate clean-from-fs and keep the
current clean filter interface as it stands.

--
see shy jo

signature.asc (828 bytes) Download Attachment