Native write performance via short-circuit option

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Native write performance via short-circuit option

Teijo Holzer
Hi,

we are experimenting with FUSE to implement our own pass-through file system
and we have come across the performance & throughput issues previously
discussed on the FUSE mailing list:

2011:
http://fuse.996288.n3.nabble.com/quot-Passthrough-quot-file-descriptor-patch-td8002.html
2012:
http://fuse.996288.n3.nabble.com/bypassing-read-write-for-mirror-fs-td2522.html
2013:
http://fuse.996288.n3.nabble.com/Pass-through-filesystem-advice-requested-td11438.html

We have done some write tests with dd against tmpfs and the results are as follows:

File size: 2g
Fuse options: big_writes

Native (tmpfs): 1.2 GB/s ( 4k/8k/16k block size)

FUSE to tmpfs: 20.7 MB/s ( 4k block size)
FUSE to tmpfs: 39.4 MB/s ( 8k block size)
FUSE to tmpfs: 76.9 MB/s (16k block size)

Writing (and reading, if using the FUSE direct_io option on mount) via FUSE has
a significant performance impact. Using the splice_write option does not seem
to make a difference here.

Therefore, we have made some changes to the kernel FUSE driver to short-circuit
the read/write paths against the file descriptor returned by open, so they
never hit the FUSE daemon.

Our patch to the kernel FUSE driver works as follows:

- on mount, kernel looks up pid of FUSE daemon (via current->group_leader->pid)
- on open, FUSE daemon stores file descriptor in existing fuse_file_info->fh field
- on read/write, kernel performs the following steps:
-- look up task_struct for stored FUSE daemon pid
-- look up file_struct via task->files, fcheck_files, fh
-- call vfs_read/vfs_write with file_struct from FUSE daemon

With our performance patch, we now achieve native performance (16k block size):

Patched FUSE to tmpfs: 798 MB/s ( 4k block size)
Patched FUSE to tmpfs: 1.0 GB/s ( 8k block size)
Patched FUSE to tmpfs: 1.2 GB/s (16k block size)

The code of the performance patch looks something like the following (error
checking has been omitted):

// pid_daemon contains the pid of the FUSE daemon
// fh contains the open file descriptor from the FUSE daemon
rcu_read_lock();
struct task_struct *task = pid_task(find_get_pid(pid_daemon), PIDTYPE_PID);
get_task_struct(task);
rcu_read_unlock();
task_lock(task);
struct file *file = fcheck_files(task->files, fh);
task_unlock(task);
int ret = vfs_write(file, buf, count, &pos);
put_task_struct(task);

The changes to libfuse are minimal, no existing wire data structures need
changing. We simply pass the new short-circuit option via the existing
capability flag. The pid can be obtained automatically by the kernel and the
file descriptor is already part of the current implementation.

The changes to the kernel will only take effect if the short-circuit option exists.

So this approach is fully backwards & wire compatible with any existing FUSE
installations.

We would like to add this patch to the mainline FUSE repository as many other
people have also experienced this issue. We are happy about any suggestions &
feedback on how to achieve this.

I also have a few questions regarding this approach:

- Is there a general problem accessing file pointers like this outside the
current task (see kcmp) ?
- Is it OK to hold on to the task_struct across the vfs_write call ?
- What about fget/fput (file->f_count) ? Do we need them even when using
get_task_struct ?

Thank you for your consideration, we are looking forward to making this
short-circuit option part of the standard FUSE kernel module and library.

     Teijo Holzer

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Sven Utcke-5
Hi,

> Therefore, we have made some changes to the kernel FUSE driver to
> short-circuit the read/write paths against the file descriptor
> returned by open, so they never hit the FUSE daemon.

I wish you all the best of luck with your patch - alas, past
experience shows that patches of that sort are not exactly welcome,
even though many of us would wish they were...

Sven
--
  __ _  _ __  __ __
 / _` || '  \ \ \ /                                   http://www.svenutcke.de/
 \__, ||_|_|_|/_\_\                                    http://www.dr-utcke.de/
 |___/     Key fingerprint =  6F F8 55 1C F9 E3 A8 F7  09 DF F7 2C 25 0C 54 53

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Mike Shal
On Fri, Jan 17, 2014 at 8:38 AM, Sven Utcke <[hidden email]> wrote:

> Hi,
>
> > Therefore, we have made some changes to the kernel FUSE driver to
> > short-circuit the read/write paths against the file descriptor
> > returned by open, so they never hit the FUSE daemon.
>
> I wish you all the best of luck with your patch - alas, past
> experience shows that patches of that sort are not exactly welcome,
> even though many of us would wish they were...
>
>
Me too - I would love to see support for this :). Maybe one of these years
it will get through...

-Mike
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Nikolaus Rath
Mike Shal <[hidden email]> writes:

> On Fri, Jan 17, 2014 at 8:38 AM, Sven Utcke <[hidden email]> wrote:
>
>> Hi,
>>
>> > Therefore, we have made some changes to the kernel FUSE driver to
>> > short-circuit the read/write paths against the file descriptor
>> > returned by open, so they never hit the FUSE daemon.
>>
>> I wish you all the best of luck with your patch - alas, past
>> experience shows that patches of that sort are not exactly welcome,
>> even though many of us would wish they were...
>>
>>
> Me too - I would love to see support for this :). Maybe one of these years
> it will get through...

May I suggest to put this patch in some public repository/webpage if it
does not get accepted? I am sure Miklos would not mind linking to it
from the FUSE homepage, and this would at least prevent people from
reimplementing the same functionality every couple of months.

(I'm not going to do this myself because I have no interest in the
functionality).


Best,
Nikolaus

--
Encrypted emails preferred.
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C

             »Time flies like an arrow, fruit flies like a Banana.«

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Teijo Holzer
In reply to this post by Sven Utcke-5
Hi,

maybe there is an issue with this particular approach of accessing a file
struct from a foreign process, which I would agree is probably not the cleanest
solution (security, stability).

Maybe we can try a different approach. There is already a standard mechanism
for passing open file descriptors between different unrelated processes via
sendmsg/recvmsg:

SCM_RIGHTS (see man 7 unix and scm_fp_copy in net/core/scm.c)

This essentially allows the dup'ing of open file descriptors across unrelated
process boundaries via a socket.

So another suggested approach for this patch would be:

- add a FOPEN_SCM_RIGHTS open_flag
- the FUSE daemon opens the file, sets FOPEN_SCM_RIGHTS and returns the fd
- the kernel uses a variation of scm_fp_copy to dup this fd into the client
process and returns that fd
- read/write can now operate directly on this fd (no need to access any parts
of the the FUSE daemon process)
- the FUSE daemon can even close its own fd now (as the client process has a dup)

This seems to be a cleaner solution using existing approved mechanisms for fd
exchanging between processes.

Miklos, is it worth implementing this approach to be considered for a patch or
is it likely to be rejected ?

Cheers,

        T.

On 18/01/14 02:38, Sven Utcke wrote:

> Hi,
>
>> Therefore, we have made some changes to the kernel FUSE driver to
>> short-circuit the read/write paths against the file descriptor
>> returned by open, so they never hit the FUSE daemon.
>
> I wish you all the best of luck with your patch - alas, past
> experience shows that patches of that sort are not exactly welcome,
> even though many of us would wish they were...
>
> Sven
>


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Jean-Pierre André
Hi,

Teijo Holzer wrote:

> Hi,
>
> maybe there is an issue with this particular approach of accessing a file
> struct from a foreign process, which I would agree is probably not the cleanest
> solution (security, stability).
>
> Maybe we can try a different approach. There is already a standard mechanism
> for passing open file descriptors between different unrelated processes via
> sendmsg/recvmsg:
>
> SCM_RIGHTS (see man 7 unix and scm_fp_copy in net/core/scm.c)
>
> This essentially allows the dup'ing of open file descriptors across unrelated
> process boundaries via a socket.
>
> So another suggested approach for this patch would be:
>
> - add a FOPEN_SCM_RIGHTS open_flag
> - the FUSE daemon opens the file, sets FOPEN_SCM_RIGHTS and returns the fd
> - the kernel uses a variation of scm_fp_copy to dup this fd into the client
> process and returns that fd
> - read/write can now operate directly on this fd (no need to access any parts
> of the the FUSE daemon process)

How does the user-space file system gets called
when the client issues a read or write ?

Jean-Pierre

> - the FUSE daemon can even close its own fd now (as the client process has a dup)
>
> This seems to be a cleaner solution using existing approved mechanisms for fd
> exchanging between processes.
>
> Miklos, is it worth implementing this approach to be considered for a patch or
> is it likely to be rejected ?
>
> Cheers,
>
> T.
>
> On 18/01/14 02:38, Sven Utcke wrote:
>> Hi,
>>
>>> Therefore, we have made some changes to the kernel FUSE driver to
>>> short-circuit the read/write paths against the file descriptor
>>> returned by open, so they never hit the FUSE daemon.
>> I wish you all the best of luck with your patch - alas, past
>> experience shows that patches of that sort are not exactly welcome,
>> even though many of us would wish they were...
>>
>> Sven
>>
>


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Teijo Holzer
Hi Jean-Pierre,

 > How does the user-space file system gets called
> when the client issues a read or write ?

it doesn't. The short-circuit option is meant to provide open/close callbacks
into the FUSE daemon but bypass read/write altogether for performance reasons.

This option is useful when there is only a need for doing file path translation
on open, but no need for the read/write paths to enter the FUSE daemon.

Cheers,

        T.


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Miklos Szeredi
On Tue, Jan 21, 2014 at 9:40 PM, Teijo Holzer <[hidden email]> wrote:

> Hi Jean-Pierre,
>
>  > How does the user-space file system gets called
>> when the client issues a read or write ?
>
> it doesn't. The short-circuit option is meant to provide open/close callbacks
> into the FUSE daemon but bypass read/write altogether for performance reasons.
>
> This option is useful when there is only a need for doing file path translation
> on open, but no need for the read/write paths to enter the FUSE daemon.

Teijo,

First, it'd be nice if we could see the actual patch.

My main objection to this thing is the interface strangulation it
involves.  For example using fi->fh in the kernel as an actual file
descriptor is completely bogus.  To the kernel the "fh" is an opaque
identifier, it should only ever be used by the userspace filesystem,
never by the kernel.

We do have an interface for passing file descriptors into the library
and that is fuse_bufvec.  For example we could use
fuse_lowlevel_notify_store() to tell the kernel that (part of) a file
is equivalent to (part of) another one.

Anyway, I'm not convinced that this optimization is actually needed.
I see the write performance numbers, but they are not surprising and
Maxim Patlasov has been working on that for a long time.  The issue
here is is delaying writes, like any other filesystem, so that the
kernel can coalesce small writes into big requests.   I'll try to put
that patchset in a testable form into git, and then you can see if the
numbers are good enough.

And then there's still room for optimization in the splice
implementation, which in theory could allow zero copy reads and single
copy writes.

Thanks,
Miklos

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Fox, Kevin M
Coalescing small to big writes helps bandwidth, but hurts latency. Its a use case dependant tradeoff.

Random reads/writes from a database like access pattern would be hurt, not helped from coalescing.

Getting as much machinery out of the way in that access pattern helps improve latency, which ultimately helps system throughput.

For an example, look at running a mysql database on bare metal with an attached disk vs running it in a vm with a network attached block device. There are plenty of benchmarks out there. For example:
http://www.terena.org/activities/tf-storage/ws15/slides/20130919-Ceph_OpenStack.pdf

See slides 31, 32, and 33.

They manage to tune the network a bit and get better numbers then that, but are still a whole order of magnitude slower, even while using faster disks (ssd's).

I know this is an extreme example, but I know there are other HPC codes that are file system latency bound out there that behave similarly.

Thanks,
Kevin

________________________________________
From: Miklos Szeredi [[hidden email]]
Sent: Wednesday, January 22, 2014 8:33 AM
To: Teijo Holzer
Cc: fuse-devel
Subject: Re: [fuse-devel] Native write performance via short-circuit option

On Tue, Jan 21, 2014 at 9:40 PM, Teijo Holzer <[hidden email]> wrote:

> Hi Jean-Pierre,
>
>  > How does the user-space file system gets called
>> when the client issues a read or write ?
>
> it doesn't. The short-circuit option is meant to provide open/close callbacks
> into the FUSE daemon but bypass read/write altogether for performance reasons.
>
> This option is useful when there is only a need for doing file path translation
> on open, but no need for the read/write paths to enter the FUSE daemon.

Teijo,

First, it'd be nice if we could see the actual patch.

My main objection to this thing is the interface strangulation it
involves.  For example using fi->fh in the kernel as an actual file
descriptor is completely bogus.  To the kernel the "fh" is an opaque
identifier, it should only ever be used by the userspace filesystem,
never by the kernel.

We do have an interface for passing file descriptors into the library
and that is fuse_bufvec.  For example we could use
fuse_lowlevel_notify_store() to tell the kernel that (part of) a file
is equivalent to (part of) another one.

Anyway, I'm not convinced that this optimization is actually needed.
I see the write performance numbers, but they are not surprising and
Maxim Patlasov has been working on that for a long time.  The issue
here is is delaying writes, like any other filesystem, so that the
kernel can coalesce small writes into big requests.   I'll try to put
that patchset in a testable form into git, and then you can see if the
numbers are good enough.

And then there's still room for optimization in the splice
implementation, which in theory could allow zero copy reads and single
copy writes.

Thanks,
Miklos

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Mike Shal
In reply to this post by Miklos Szeredi
Hi Miklos,

On Wed, Jan 22, 2014 at 11:33 AM, Miklos Szeredi <[hidden email]> wrote:

>
> Anyway, I'm not convinced that this optimization is actually needed.
> I see the write performance numbers, but they are not surprising and
> Maxim Patlasov has been working on that for a long time.  The issue
> here is is delaying writes, like any other filesystem, so that the
> kernel can coalesce small writes into big requests.   I'll try to put
> that patchset in a testable form into git, and then you can see if the
> numbers are good enough.
>

I'd be happy to test it as well. But what about read performance?


>
> And then there's still room for optimization in the splice
> implementation, which in theory could allow zero copy reads and single
> copy writes.
>

What profiling analysis are you looking at to come to this conclusion? When
I run callgrind on my filesystem or fusexmp_fh, I don't see the memory
copies as a bottleneck. I ran a large linking command in my FUSE fs through
callgrind and annotated it with --inclusive=yes. I get:

2,329,370,357  /build/buildd/eglibc-2.17/nptl/pthread_create.c:start_thread
[/lib/x86_64-linux-gnu/libpthread-2.17.so]
2,105,567,176
/home/marf/tup/.tup/mnt/@tupjob-6788/home/marf/tup/src/tup/server/fuse_server.c:fuse_thread
[/home/marf/tup/tup]
2,105,557,243  /home/marf/fuse/lib/helper.c:fuse_main_real
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
2,105,557,241  /home/marf/fuse/lib/helper.c:fuse_main_common
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
2,100,878,866  /home/marf/fuse/lib/fuse.c:fuse_loop
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
2,100,878,088  /home/marf/fuse/lib/fuse_loop.c:fuse_session_loop
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
2,016,631,115  /home/marf/fuse/lib/fuse_session.c:fuse_session_process_buf
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
2,014,182,390  /home/marf/fuse/lib/fuse_lowlevel.c:fuse_ll_process_buf
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
 * 928,798,725  /home/marf/fuse/lib/fuse.c:fuse_lib_write_bu*f
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
  648,151,120  /home/marf/fuse/lib/fuse_lowlevel.c:do_read
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
*  644,343,792
/home/marf/fuse/lib/fuse.c:fuse_lib_read*[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
  568,147,557  /home/marf/fuse/lib/fuse.c:get_path_common
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
  553,772,941  /home/marf/fuse/lib/fuse.c:get_path_nullok
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
  526,118,927  /home/marf/fuse/lib/fuse.c:try_get_path
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
  332,964,195  /home/marf/fuse/lib/fuse.c:add_name
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]
*  316,559,433  /home/marf/fuse/lib/fuse.c:fuse_fs_write_buf *
[/home/marf/install-fuse-2.9.2/lib/libfuse.so.2.9.2]

So it looks to me like the overhead is from repeated calls to fuse_lib_read
and fuse_lib_write_buf - ie: the read() and write() calls that I'd like to
eliminate by using a FUSE passthrough implementation. These account for a
large chunk of the entire FUSE thread. Note also the large difference
between fuse_lib_write_buf() and fuse_fs_write_buf() - my actual
file-system write function hasn't even shown up yet, so even if it was
replaced with a no-op there is still a huge amount of overhead in libfuse
itself.

I'd be interested to know:

1) What filesystem you are profiling (fusexmp_fh? fusexmp? something else?)
2) What options you pass to the filesystem (-obig_writes, -osplice_read,
etc)
3) How you are driving the file-system (using dd? Or what?)
4) Why you think splice will help

Thanks,
-Mike
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Teijo Holzer
In reply to this post by Miklos Szeredi
Hi Miklos,

thanks very much for taking the time to respond.

I've attached the kernel patch to this email, I didn't bother with the libfuse
patch as that is trivial (it simply adds another command-line option & puts the
bit into the capability flag).

The kernel patch adds a field "pid_daemon" to "struct fuse_conn" (fuse_i.h) and
populates it automatically at mount time with the pid of the FUSE daemon
(inode.c). The main work is done in file.c.

I've applied the short-circuit patch to fuse_send_read & fuse_send_write only,
so currently, the "short-circuit" option also requires the use of the
"direct_io" option (which is fine, as any caching is done by the next layer
down anyway, e.g. NFS).

With respect to the kernel looking up a file descriptor from user-land
(fi->fh), this is what happens with normal read/write syscalls anyway.

Only the short-circuit option attaches semantic meaning to fi->fh, the user can
pass whatever they want (just like with read/write syscalls). If fi->fh is not
a valid, open file descriptor inside the FUSE daemon, the client simply gets EIO.

I'd be more than happy to test Maxim Patlasov write patch, I've been following
progress on that too. It is a large change set though.

With respect to using the slice options, to me it seemed the actual number of
calls into user-space are the performance killer (rather that the amount of
data that needs to be copied). This particular performance problem seems to be
CPU-bound.

This is shown by getting higher throughput when using larger block sizes. The
data that needs to be copied is the same, however the number of context
switches decreases.

Again, thanks very much for taking the time to look at this.

        Teijo Holzer

FUSE read/write short-circuit kernel patch
==========================================

diff -uprN /usr/src/linux-source-3.8.0-33/fs/fuse/file.c fs/fuse/file.c
--- /usr/src/linux-source-3.8.0-33/fs/fuse/file.c 2013-11-11 15:58:19.000000000
+1300
+++ fs/fuse/file.c 2014-01-23 10:02:07.712889000 +1300
@@ -15,6 +15,9 @@
  #include <linux/module.h>
  #include <linux/compat.h>
  #include <linux/swap.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/fdtable.h>

  static const struct file_operations fuse_direct_io_file_operations;

@@ -491,8 +494,55 @@ void fuse_read_fill(struct fuse_req *req
  req->out.args[0].size = count;
  }

+static struct file *fuse_get_file_ptr(struct fuse_file *ff, struct task_struct
**task)
+{
+ struct file *file = NULL;
+ *task = NULL;
+
+ if (!ff || !ff->fc || !ff->fc->pid_daemon || !ff->fh)
+ {
+ printk(KERN_INFO "fuse_get_file_ptr: no ff/ff->fc/ff->fc->pid_daemon/ff->fh\n");
+ return file;
+ }
+
+ rcu_read_lock();
+
+ *task = pid_task(find_get_pid(ff->fc->pid_daemon), PIDTYPE_PID);
+ if (!(*task))
+ {
+ printk(KERN_INFO "fuse_get_file_ptr: no task for %u\n", ff->fc->pid_daemon);
+ rcu_read_unlock();
+ return file;
+ }
+
+ get_task_struct(*task);
+ task_lock(*task);
+
+ if ((*task)->files)
+ file = fcheck_files((*task)->files, ff->fh);
+
+ rcu_read_unlock();
+ task_unlock(*task);
+
+ return file;
+}
+
+static ssize_t fuse_do_vfs_read(struct fuse_req *req, struct fuse_file *ff,
loff_t pos, size_t count, const char __user *buf)
+{
+ ssize_t ret = -1;
+ struct task_struct *task = NULL;
+ struct file *file = fuse_get_file_ptr(ff, &task);
+ if (!file || !buf)
+ printk(KERN_INFO "fuse_do_vfs_read: no file %p or buf %p, please use
direct_io\n", file, buf);
+ else
+ ret = vfs_read(file, (char __user *)buf, count, &pos);
+ if (task)
+ put_task_struct(task);
+ return ret;
+}
+
  static size_t fuse_send_read(struct fuse_req *req, struct file *file,
-     loff_t pos, size_t count, fl_owner_t owner)
+     loff_t pos, size_t count, fl_owner_t owner, const char __user *buf)
  {
  struct fuse_file *ff = file->private_data;
  struct fuse_conn *fc = ff->fc;
@@ -504,6 +554,10 @@ static size_t fuse_send_read(struct fuse
  inarg->read_flags |= FUSE_READ_LOCKOWNER;
  inarg->lock_owner = fuse_lock_owner_id(fc, owner);
  }
+
+ if (fc && fc->pid_daemon)
+    return fuse_do_vfs_read(req, ff, pos, count, buf);
+
  fuse_request_send(fc, req);
  return req->out.args[0].size;
  }
@@ -555,7 +609,7 @@ static int fuse_readpage(struct file *fi
  req->out.argpages = 1;
  req->num_pages = 1;
  req->pages[0] = page;
- num_read = fuse_send_read(req, file, pos, count, NULL);
+ num_read = fuse_send_read(req, file, pos, count, NULL, NULL);
  err = req->out.h.error;
  fuse_put_request(fc, req);

@@ -744,8 +798,22 @@ static void fuse_write_fill(struct fuse_
  req->out.args[0].value = outarg;
  }

+static size_t fuse_do_vfs_write(struct fuse_req *req, struct fuse_file *ff,
loff_t pos, size_t count, const char __user *buf)
+{
+ ssize_t ret = -1;
+ struct task_struct *task = NULL;
+ struct file *file = fuse_get_file_ptr(ff, &task);
+ if (!file || !buf)
+ printk(KERN_INFO "fuse_do_vfs_write: no file %p or buf %p, please use
direct_io\n", file, buf);
+ else
+ ret = vfs_write(file, buf, count, &pos);
+ if (task)
+ put_task_struct(task);
+ return ret;
+}
+
  static size_t fuse_send_write(struct fuse_req *req, struct file *file,
-      loff_t pos, size_t count, fl_owner_t owner)
+      loff_t pos, size_t count, fl_owner_t owner, const char __user *buf)
  {
  struct fuse_file *ff = file->private_data;
  struct fuse_conn *fc = ff->fc;
@@ -757,6 +825,10 @@ static size_t fuse_send_write(struct fus
  inarg->write_flags |= FUSE_WRITE_LOCKOWNER;
  inarg->lock_owner = fuse_lock_owner_id(fc, owner);
  }
+
+ if (fc && fc->pid_daemon)
+ return fuse_do_vfs_write(req, ff, pos, count, buf);
+
  fuse_request_send(fc, req);
  return req->misc.write.out.size;
  }
@@ -784,7 +856,7 @@ static size_t fuse_send_write_pages(stru
  for (i = 0; i < req->num_pages; i++)
  fuse_wait_on_page_writeback(inode, req->pages[i]->index);

- res = fuse_send_write(req, file, pos, count, NULL);
+ res = fuse_send_write(req, file, pos, count, NULL, NULL);

  offset = req->page_offset;
  count = res;
@@ -1087,9 +1159,9 @@ ssize_t fuse_direct_io(struct file *file
  }

  if (write)
- nres = fuse_send_write(req, file, pos, nbytes, owner);
+ nres = fuse_send_write(req, file, pos, nbytes, owner, buf);
  else
- nres = fuse_send_read(req, file, pos, nbytes, owner);
+ nres = fuse_send_read(req, file, pos, nbytes, owner, buf);

  fuse_release_user_pages(req, !write);
  if (req->out.h.error) {
diff -uprN /usr/src/linux-source-3.8.0-33/fs/fuse/fuse_i.h fs/fuse/fuse_i.h
--- /usr/src/linux-source-3.8.0-33/fs/fuse/fuse_i.h 2013-02-19
12:58:34.000000000 +1300
+++ fs/fuse/fuse_i.h 2014-01-23 09:56:51.765414000 +1300
@@ -528,6 +528,9 @@ struct fuse_conn {

  /** Read/write semaphore to hold when accessing sb. */
  struct rw_semaphore killsb;
+
+ /** FUSE daemon pid for FUSE_SHORTCIRCUIT option */
+ u32 pid_daemon;
  };

  static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff -uprN /usr/src/linux-source-3.8.0-33/fs/fuse/inode.c fs/fuse/inode.c
--- /usr/src/linux-source-3.8.0-33/fs/fuse/inode.c 2013-02-19
12:58:34.000000000 +1300
+++ fs/fuse/inode.c 2014-01-23 09:59:15.180539000 +1300
@@ -582,6 +582,7 @@ void fuse_conn_init(struct fuse_conn *fc
  fc->reqctr = 0;
  fc->blocked = 1;
  fc->attr_version = 1;
+ fc->pid_daemon = 0;
  get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
  }
  EXPORT_SYMBOL_GPL(fuse_conn_init);
@@ -846,6 +847,11 @@ static void process_init_reply(struct fu
  if (arg->minor >= 17) {
  if (!(arg->flags & FUSE_FLOCK_LOCKS))
  fc->no_flock = 1;
+ // TODO: Add the following magic number to fuse.h as FUSE_SHORTCIRCUIT
+ if (arg->flags & (1 << 31))
+ fc->pid_daemon = current->group_leader->pid;
+ else
+ fc->pid_daemon = 0;
  } else {
  if (!(arg->flags & FUSE_POSIX_LOCKS))
  fc->no_flock = 1;



On 23/01/14 05:33, Miklos Szeredi wrote:

> On Tue, Jan 21, 2014 at 9:40 PM, Teijo Holzer <[hidden email]> wrote:
>> Hi Jean-Pierre,
>>
>>   > How does the user-space file system gets called
>>> when the client issues a read or write ?
>>
>> it doesn't. The short-circuit option is meant to provide open/close callbacks
>> into the FUSE daemon but bypass read/write altogether for performance reasons.
>>
>> This option is useful when there is only a need for doing file path translation
>> on open, but no need for the read/write paths to enter the FUSE daemon.
>
> Teijo,
>
> First, it'd be nice if we could see the actual patch.
>
> My main objection to this thing is the interface strangulation it
> involves.  For example using fi->fh in the kernel as an actual file
> descriptor is completely bogus.  To the kernel the "fh" is an opaque
> identifier, it should only ever be used by the userspace filesystem,
> never by the kernel.
>
> We do have an interface for passing file descriptors into the library
> and that is fuse_bufvec.  For example we could use
> fuse_lowlevel_notify_store() to tell the kernel that (part of) a file
> is equivalent to (part of) another one.
>
> Anyway, I'm not convinced that this optimization is actually needed.
> I see the write performance numbers, but they are not surprising and
> Maxim Patlasov has been working on that for a long time.  The issue
> here is is delaying writes, like any other filesystem, so that the
> kernel can coalesce small writes into big requests.   I'll try to put
> that patchset in a testable form into git, and then you can see if the
> numbers are good enough.
>
> And then there's still room for optimization in the splice
> implementation, which in theory could allow zero copy reads and single
> copy writes.
>
> Thanks,
> Miklos
>


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Nikolaus Rath
In reply to this post by Miklos Szeredi
Miklos Szeredi <[hidden email]> writes:
> We do have an interface for passing file descriptors into the library
> and that is fuse_bufvec.  For example we could use
> fuse_lowlevel_notify_store() to tell the kernel that (part of) a file
> is equivalent to (part of) another one.

This (in contrast to telling the kernel that the *entire* file is
equivalent to another) would be a very useful feature for me.

> Anyway, I'm not convinced that this optimization is actually needed.

For the S3QL file system, this would make a big difference. S3QL is
written in Python, so being able to bypass all the Python machinery for
the majority of writes would improve performance significantly and
increase parallelism (Python has a global interpreter lock).

Writing a shim layer in C that bypasses Python if a known mapping to an
existing fd exists has been on my plate for a while. If this
functionality would be available in the kernel, it'd make my life much
easier :-).


Best,
Nikolaus

--
Encrypted emails preferred.
PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C

             »Time flies like an arrow, fruit flies like a Banana.«

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|

Re: Native write performance via short-circuit option

Han-Wen Nienhuys-5
On Wed, Jan 22, 2014 at 11:16 PM, Nikolaus Rath <[hidden email]> wrote:

>> Anyway, I'm not convinced that this optimization is actually needed.
>
> For the S3QL file system, this would make a big difference. S3QL is
> written in Python, so being able to bypass all the Python machinery for
> the majority of writes would improve performance significantly and
> increase parallelism (Python has a global interpreter lock).
>
> Writing a shim layer in C that bypasses Python if a known mapping to an
> existing fd exists has been on my plate for a while. If this
> functionality would be available in the kernel, it'd make my life much
> easier :-).

Python is not very good choice for high performance systems, and it
seems wrong to change the FUSE interface to work around Python's
woeful scalability.

(If you want decent performance with an easy-to-use programming
language, I'd recommend go-fuse instead , but I'm obviously biased :-)



--
Han-Wen Nienhuys - [hidden email] - http://www.xs4all.nl/~hanwen

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
fuse-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/fuse-devel