Cracked by ZFS

2020-08-02

About Finding Cracked Systems

There’s a distinct feel to finding a compromised machine. It makes your stomach float. It’s like coming home and finding the door is open. Did I really forget to close the ~~firewall~~ door? Did I leave that ~~PHP application unpatched~~ window open? Everything seems to be in its place, but is it? Was somebody here? Did they touch something? … Maybe … maybe it’s alright? Maybe there’s another explanation to all this? However this turns out, you already know that this will cancel all your plans for the day. If you’re lucky.

I’ve had that feeling a few times in my career. You don’t want that. Even if it’s somebody else’s box. Even if it’s not your fault (be it personally or as a service provider). It’s a mess.

Something isn’t Right

The other morning I logged into my server and it immediately felt… not right. I’m a slave to IRC, so even when I’m dizzy from a bad night’s sleep, I start with coffee and irssi (running in screen, of course – I’m too old for tmux).

Irssi was laggy. It froze for seconds at a time. All the earmarks of a network-troubled SSH connection. But it wasn’t, everything else looked perfectly normal – from ping to SSH login and trying some of the websites hosted there. I suspected an address-family specific problem, so I tried using just IPv4 and just using IPv6. But all was good.

Next up: System stats. Lo and behold: Weird disk activity! My main disk, da0, was doing a lot of writing. Something like 3k IO/s, but just some 15 MByte/s. Which was pretty odd in itself. Is something fsyncing a lot? Something writing with O_DIRECT?

Prolonged periods of disk activity usually have a very obvious cause. Especially for write IOs. Maybe it’s a process that is logging like there’s no tomorrow (for example, because some developer thought debug-level logging is a brilliant idea in production). Maybe it’s a ~~DoS attempt~~ misconfigured client that causes your application to do its normal stuff, just a thousand times as often. Maybe a run-away batch job. Maybe your database ate too much and needs a really long checkpoint. Whatever the cause, it usually sticks out in the process list like a developer feeling guilty at the coffee machine.

By the way, atop is mind-blowingly good. I love it. Install it now, run atop 2, then never use any other *top again.

Anyway. No process seemed to be causing it. But about the same time, the activity vanished. Maybe I missed it, but it still felt odd. Actually it could have been ZFS, which operates in the kernel land – e.g. when scrubbing a storage pool or when it’s replacing a disk. I checked, but nothing was going on.

Time for another coffee. I kept thinking about this all day, but there was no way this made any sense, other than having rogue activity on my box. Believe me, it’s a shitty feeling.

A Late Epiphany

Later that day, the exact same activity started again. I managed to catch a csh that was busy writing. Just a bit, but it was stuck writing.

Actually, I noticed, it was one of my cshs. The one I had just logged out of. The one that had been hanging since then. Oh!

I immediately put a ktrace on the job. But all I could see was that it’s writing the .history file. The whole time. Albeit very slowly. Had I fucked up my history to be a multi-gigabytes file? I did not.

Ok, let’s try that again: New shell, logged out… the exact same thing.

Then it dawned on me: ZFS! I checked zfs list, and there we had it: The whole storage pool had about 15 MB space left. It’s well known that ZFS behaves pathologically when it runs out of free space. So all the write activity was just ZFS desperately shuffling blocks around, trying to make a tiny bit of space for that .history file.

I deleted a few gig of old snapshots. Suddenly, everything was back to normal.

Words of Advice

Back in the day, double-checking disk space was a routine thing to do if I wasn’t sure what’s going on. When an issue’s cause wasn’t obvious, checking a list of standard items proved to be very valuable: It was often mundane things like a full filesystem, even if all the error messages made you believe that it must be something within the application’s realm. I should have checked this way earlier here and I do feel quite stupid about it.

In conclusion: That’s why you want proper monitoring when you’re running any server. Even if it’s just your own personal trashcan server. Down the road it will save you a boatload of time. And these days, monitoring actually looks really good!

Hackers and Crackers

P.S.: History lesson!

cracker: n. One who breaks security on a system. Coined ca. 1985 by hackers in defense against journalistic misuse of hacker […]

I still think it’s unfortunate that hacker today is generally understood to be something shady, at best.

Discuss on Twitter