server stability problems running postgres/zfs/fuse/centos5/6/7

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

server stability problems running postgres/zfs/fuse/centos5/6/7

Justin Pryzby
Hi,

We've seen for the last 12 months on various customers' disparate servers and
VMs an issue which sseems to involve fuse-ZFS, where many processes get "stuck"
and queue up, probably making the problem worse, or at least not allowing
itself to recover.  Typically killing individual processes (including postgres
backend children) doesn't "unstick" the rest, and it's needed to forcibly stop
postgres (kill -9, causing it to go into recovery mode replaying its
WAL/journal).  *But* at least once I recall stopping our application and
regaining use of the system.

strace on a stuck process doesn't output anything.  But, I did see that many
processes have ps -o wchan => "wait_a" (wait_answer_interruptible?)

One consistent symptom is that df/stat don't respond.  Also many postgres
processes (but not all) get stuck, and end up queueing up until even SELECT *
FROM pg_stat_activity doesn't work (probably due to a stuck process taking a write locking pg_class?).

It seems to me this can't be a postgres problem, since df/stat/etc freeze..
But, we have at least two nondefault postgres tablespaces, one of which is a
temp_tablespace, used for sorts and temp tables.  I suspect that's relevant,
since one stuck query I saw was in the process of commiting a transaction which
had previously done: "CREATE TEMPORARY TABLE x ON COMMIT DROP".

Also, we have (too) many postgres inheritence children (200 for now).  In the
past, I suspected this may have something to do with many postgres locks, or
with number of opened files (most of which on ZFS/fuse), or with
effective_io_concurrency, or with transparent hugepages, but I'm not confident
about any of those...

Some of the servers where this happens are centos5 on raw hardware, and some
are centos6 VMs.  I just reproduced once using centos7
(3.10.0-514.16.1.el7.x86_64 and fuse-2.9.2-7.el7.x86_64).  In all but two,
recent cases, the "stuck"ness has been infrequent and random.  Twice, recently,
it's happened 5-10 times in a day.  In one case, I was able to avoid the issue
by combining some small child tables into a smaller number or larger tables.
In another case, that didn't work, but we worked around the issue by disabling
a cronjob, which would've been accessing a large number of children (total over
3000 children of ~30 paren tables, ith largest number of childs per parent
being 225).

I haven't been able to reproduce the problem, but wondered if you had seen or
heard anything similar, or any ideas what to try..

Let me know if I should report to LKML or otherwise.  I'm going to try to
gather some fuse debug logs for the next event.

Thanks in advance.
Justin

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
--
fuse-devel mailing list
To unsubscribe or subscribe, visit https://lists.sourceforge.net/lists/listinfo/fuse-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: server stability problems running postgres/zfs/fuse/centos5/6/7

Justin Pryzby
Here's the tail end of an debug log from zfs-fuse; I can provide more context
if needed.  Does this help ?  Do I need to send this somewhere else ?

Thanks

On Mon, Jul 31, 2017 at 11:38:55AM -0500, Justin Pryzby wrote:

> Hi,
>
> We've seen for the last 12 months on various customers' disparate servers and
> VMs an issue which sseems to involve fuse-ZFS, where many processes get "stuck"
> and queue up, probably making the problem worse, or at least not allowing
> itself to recover.  Typically killing individual processes (including postgres
> backend children) doesn't "unstick" the rest, and it's needed to forcibly stop
> postgres (kill -9, causing it to go into recovery mode replaying its
> WAL/journal).  *But* at least once I recall stopping our application and
> regaining use of the system.
>
> strace on a stuck process doesn't output anything.  But, I did see that many
> processes have ps -o wchan => "wait_a" (wait_answer_interruptible?)
>
> One consistent symptom is that df/stat don't respond.  Also many postgres
> processes (but not all) get stuck, and end up queueing up until even SELECT *
> FROM pg_stat_activity doesn't work (probably due to a stuck process taking a write locking pg_class?).
>
> It seems to me this can't be a postgres problem, since df/stat/etc freeze..
> But, we have at least two nondefault postgres tablespaces, one of which is a
> temp_tablespace, used for sorts and temp tables.  I suspect that's relevant,
> since one stuck query I saw was in the process of commiting a transaction which
> had previously done: "CREATE TEMPORARY TABLE x ON COMMIT DROP".
>
> Also, we have (too) many postgres inheritence children (200 for now).  In the
> past, I suspected this may have something to do with many postgres locks, or
> with number of opened files (most of which on ZFS/fuse), or with
> effective_io_concurrency, or with transparent hugepages, but I'm not confident
> about any of those...
>
> Some of the servers where this happens are centos5 on raw hardware, and some
> are centos6 VMs.  I just reproduced once using centos7
> (3.10.0-514.16.1.el7.x86_64 and fuse-2.9.2-7.el7.x86_64).  In all but two,
> recent cases, the "stuck"ness has been infrequent and random.  Twice, recently,
> it's happened 5-10 times in a day.  In one case, I was able to avoid the issue
> by combining some small child tables into a smaller number or larger tables.
> In another case, that didn't work, but we worked around the issue by disabling
> a cronjob, which would've been accessing a large number of children (total over
> 3000 children of ~30 paren tables, ith largest number of childs per parent
> being 225).
>
> I haven't been able to reproduce the problem, but wondered if you had seen or
> heard anything similar, or any ideas what to try..
>
> Let me know if I should report to LKML or otherwise.  I'm going to try to
> gather some fuse debug logs for the next event.
>
> Thanks in advance.
> Justin

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
--
fuse-devel mailing list
To unsubscribe or subscribe, visit https://lists.sourceforge.net/lists/listinfo/fuse-devel

zfs-fuse.debug5~x7.gz-tail.gz (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: server stability problems running postgres/zfs/fuse/centos5/6/7

Antonio SJ Musumeci
Unless you can show that this is a general FUSE or libfuse issue you should probably contact the zfs-fuse project.

On Tue, Aug 1, 2017 at 6:16 PM, Justin Pryzby <[hidden email]> wrote:
Here's the tail end of an debug log from zfs-fuse; I can provide more context
if needed.  Does this help ?  Do I need to send this somewhere else ?

Thanks

On Mon, Jul 31, 2017 at 11:38:55AM -0500, Justin Pryzby wrote:
> Hi,
>
> We've seen for the last 12 months on various customers' disparate servers and
> VMs an issue which sseems to involve fuse-ZFS, where many processes get "stuck"
> and queue up, probably making the problem worse, or at least not allowing
> itself to recover.  Typically killing individual processes (including postgres
> backend children) doesn't "unstick" the rest, and it's needed to forcibly stop
> postgres (kill -9, causing it to go into recovery mode replaying its
> WAL/journal).  *But* at least once I recall stopping our application and
> regaining use of the system.
>
> strace on a stuck process doesn't output anything.  But, I did see that many
> processes have ps -o wchan => "wait_a" (wait_answer_interruptible?)
>
> One consistent symptom is that df/stat don't respond.  Also many postgres
> processes (but not all) get stuck, and end up queueing up until even SELECT *
> FROM pg_stat_activity doesn't work (probably due to a stuck process taking a write locking pg_class?).
>
> It seems to me this can't be a postgres problem, since df/stat/etc freeze..
> But, we have at least two nondefault postgres tablespaces, one of which is a
> temp_tablespace, used for sorts and temp tables.  I suspect that's relevant,
> since one stuck query I saw was in the process of commiting a transaction which
> had previously done: "CREATE TEMPORARY TABLE x ON COMMIT DROP".
>
> Also, we have (too) many postgres inheritence children (200 for now).  In the
> past, I suspected this may have something to do with many postgres locks, or
> with number of opened files (most of which on ZFS/fuse), or with
> effective_io_concurrency, or with transparent hugepages, but I'm not confident
> about any of those...
>
> Some of the servers where this happens are centos5 on raw hardware, and some
> are centos6 VMs.  I just reproduced once using centos7
> (3.10.0-514.16.1.el7.x86_64 and fuse-2.9.2-7.el7.x86_64).  In all but two,
> recent cases, the "stuck"ness has been infrequent and random.  Twice, recently,
> it's happened 5-10 times in a day.  In one case, I was able to avoid the issue
> by combining some small child tables into a smaller number or larger tables.
> In another case, that didn't work, but we worked around the issue by disabling
> a cronjob, which would've been accessing a large number of children (total over
> 3000 children of ~30 paren tables, ith largest number of childs per parent
> being 225).
>
> I haven't been able to reproduce the problem, but wondered if you had seen or
> heard anything similar, or any ideas what to try..
>
> Let me know if I should report to LKML or otherwise.  I'm going to try to
> gather some fuse debug logs for the next event.
>
> Thanks in advance.
> Justin

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
--
fuse-devel mailing list
To unsubscribe or subscribe, visit https://lists.sourceforge.net/lists/listinfo/fuse-devel



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
--
fuse-devel mailing list
To unsubscribe or subscribe, visit https://lists.sourceforge.net/lists/listinfo/fuse-devel
Loading...