sysadmin fun

our story begins yesterday afternoon when /var filled up.
/var is a 15G lvm* volume.

Now, I haven't been doing a lot of sysadmin lately-- which is actually a really
good thing! Our Wagn server has been running for close to 2 years now, and the
only significant downtime was a couple days this fall when one of the CPUs died.

So I want a semi-permanent fix for the disk space problem, and after checking out the
server, I found two 10G lvm volumes on the primary mirror that weren't being used.
I thought Great! I'll use these to extend var. I removed those two volumes and used the
space to add 20G to the /var volume. now I just need to resize the filesystem.

but, to resize a filesystem, you need to unmount it.   to unmount a filesystem, there may not
be any processes accessing it. where does every system process keep it's logs and rapidly
changing files?   /var.   duh. scratch that idea.

plan B. I copied a bunch of stuff out of var onto another filesystem on our backup mirror.
(which has tons of space) then mounted that directory at the appropriate spot in var.
services all restarted (with some pid haggling), everythings peachy. right?

A few hours later it's all down again. why? filesystem errors from the backup mirror have
caused it to be remounted read-only. I still don't know why this happened- sure there was
a sudden jump in load on that filesystem, but that's not usually cause for linux to flake, and I
have no other evidence of a hardware problem.

I think to myself-- well, I didn't really like adding a system dependency on the backup mirror
anyway, and it's acting flaky, lets get those parts of /var back on the the primary mirror.
I can use those 20G for a new partition.

I use the lvm tools to reduce the size of the /var volume 20G to it's original size.
create a new volume with those 20G, add a filesystem, move the parts of opt there, mount.
now i unmount the backup mirror, fsck, fix errors, remount.   services all restarted fine,
everythings peachy. right?

next morning, I have a bad feeling and sure enough, it's all down again. why?
   Feb 26 05:25:27 localhost kernel: attempt to access beyond end of device

scheit! it turns out that in the world of lvm, +20G and -20G may not be exactly the same thing. Definitely my mistake-- sysadmin requires more paranoia than that-- but not the dumbest thing I've done by a long shot.

Is there any chance if i just add some space back it will be ok? (seems dicey-- the filesystem really needs checked) No, the lvm tools are broken-- they depend on access to /var. NOW I'm worried.

What's next? Fortunately we have backups! I extract /var from a backup only a few hours old onto the backup mirror. mounting that directory over the existing /var mount seems to work fine. Back in business again. Amazingly, postgres seems to restart fine despite all this moving and reseting of it's files -- i fully expected to have to rebuild the databases.

And where are we now? I think I got the original var partition back in order: I resized the new hooze-opt filesystem built on those spare 20G to 18G, shrank the partition to 19G, added that 1G back to the original /var partition, and repaired that filesystem. But Linux won't undo or overwrite the mounting that has /var using the backup mirror. The downside of this is the additional system dependency and somewhat reduced database performance, since the backup mirror isn't nearly as fast as the primary. But I think it's the best we
can do until the next reboot, which hopefully won't be until next time I'm in Portland in front of the machine.

*lvm = Logical Volume Management, a virtual partitioning
system that lets you layer virtual partitions called 'logical volumes' on top of physical disc
partitions in very flexible ways.

===

That is, indeed, a saga. Thanks for getting it up in time for BeaverBarCamp! --John Abbe

sharks

monkeys

platypuses

everyone

sysadmin fun