
We are having a major malfunction with our database server that started about three days ago. For some reason database connections are closing unexpectedly or at least reaching EOF when they shouldn't be.
I did all the normal stuff. Downgraded Postgresql from 8.3 back to 8.2 wondering if that was it. That didn't fix the problem. I rebooted the server, that didn't help. Restarted everything like ten times. I posted to the PgSQL mailing list, and I posted a ticket with Rackspace. The PgSQL mailing was as always fairly helpful, but Rackspace was worthless. If there was anybody else who even answered customer support requests on a reliable basis and gave bigger servers, I would totally ditch Rackspace, but they are the best of everyone we've worked with which is sad. Not much of a recipient of fanatical support so far (ServerBeach has been excellent, but they don't really provide the kind of server we need).
After few responses that were helpful in fixing the problem I decided that the best next step was to strace tomcat to figure out what was happening on the socket. The first strace the socket threw a SIGPIPE and tomcat's thread Segfaulted :( (but didn't throw a Java exception so it looked like nothing happened from the log file). Unfortunately I wasn't expecting tomcat to crash (we've seen the DB process segfault before, but not tomcat) so I wasn't running with a non zero core dump limit.
The next runs revealed little of anything. Then I got one that had more clear information in it. It looks like tomcat is sending a database request to postgres and the recv comes back with a 0 return value, which means postgresql is closing the connection. Super wierd.
I've posted this info to the Postgres mailing list. Not sure what else I can do at this point. Maybe my postgresql.conf is stupid, but I haven't changed it recently I don't think, so I don't know why this started happening all of a sudden.
I think my next step will be to strace postgresql, but that means setting up a new instance because I can't really strace the production instance of postgresql as that would generate so much log, it would be impossible to go through it all.