I was asked to take a look at a server that was no longer responding. The first thing I did after going to the co-lo and booting the console in single user repair mode was look at disk usage.
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 70557052 70557052 0 100% /
none 508792 224 508568 1% /dev
none 513040 0 513040 0% /dev/shm
none 513040 1076 511964 1% /var/run
none 513040 0 513040 0% /var/lock
none 513040 0 513040 0% /lib/init/rw
/dev/sdb1 70557052 46954944 20018012 71% /media/Backup
As you can see the server usage went to 100%. Normally it is near the size of the hot backup drive. So some file suddenly got large. When this happens, none of the services will respond. So you can’t ssh in and the websites stop responding.
Then I started looking for files that could be causing the problem. I was told that the last time this happened the log files weren’t being erased. I check on this server from time to time and this week it was at 82%, so that didn’t seem likely but I ran this command anyway.
sudo find /var/log/ -size +20M | xargs du | sort -n -r | less
Nothing really jumped out at me but there were lots of old logs that had been zipped, and some that were labelled ‘old’, so I ran this to get rid of them
rm *.gz
rm *.old
It made a little improvement, but not enough to get the df command to go below 100%. It did go to 68852672 which was enough to get ssh into the server after a reboot. I kept looking for the source of the large files but had no luck. We have a call in to the guy who fixed it last time hoping he remembers what he did.