Senior UNIX/Linux Systems Administrator
rlpowell@digitalkingdom.org
After leaving my job at LookSmart, I realized that I've reached the
point in my career where a normal resume doesn't make a whole lot of
sense, at least once it's in the hands of a hiring manager rather
than a recruiter.
That deserves some explaining, but first I note that from here on
out, I'm assuming that I'm talking to a hiring manager. Here's the
thing: I've been a sysadmin for so long that a lot of the signs
you'd look for in a junior or intermediate sysadmin no longer apply.
If you are wondering "How many years of DNS experience does he
have?", or "Has he ever configured sendmail?", or "What flavour of
Linux is he best with?", you don't want me. If you're looking
through my resume for buzzwords, you don't want me. If you're
adding up my various jobs to try to figure out how many years of
experience I have with Solaris, you don't want me.
If you're looking for someone like me, you care about very different
things. You want to know, can you hand me a broken computer, or a
hundred of them, or a thousand, explain the problem, and simply
walk away, knowing that the problem will be fixed, and that no
further steps on your part are required to get it fixed just about
as fast as it possibly can? You want to know, can you give me a
specification for the performance requirements of your giant server
farm and expect back a coherent, well documented series of steps
required to go from where you are to where you want to be? You want
to know, after three months on the job will you be able to come to
me and ask me which part of the overall infrastructure is
bottlenecking performance?
My answer to all those questions is yes.
Now, obviously, this isn't going to be true for every possible
system; I'm a highly skilled sysadmin, not a god. If you're still reading,
you've decided to give me the benefit of the doubt, and now you want
to know which sorts of systems I can be relied on to work magic
with.
The rest of this resume answers that in the best way I know how: by
telling you about the things I've done that I'm particularly proud
of.
As an aside, though, I do ask that you don't take my word for any of
this. Talk to my former co-workers and bosses if you wish, but
better than that, if you're considering me for a position, please
bring me in, sit me down in front of a machine you've broken for
that purpose, and see what I can do. Regular interviews (asking me
about my greatest weakness, or whatever) aren't going to show you my
skills in any useful way. Why not actually test me properly?
Things I'm Proud Of
Not a complete list, obviously, but these are the things that come
to mind as exceptional. In approximately reverse date order (i.e.
newest first).
FAI Doesn't Like Our Environment
At LookSmart we had been using a couple of different internally
developed imaging applications, generally without benefit of
programmers assigned to them. This is about as fun as it sounds.
I decided to leave LookSmart (and let people know I would be
leaving) a fair bit before I actually did, and I wanted everybody to
have fond memories, so I spent several weeks (including weekends,
for the most part) making
FAI work for installed
Debian Linux, both etch and lenny, and both 32-bit and 64-bit.
This shouldn't have been a serious undertaking, except for a few
details:
Presumably because of the unusual things we were doing, I hit quite
a number of bugs in the various packages that FAI uses. I ended up
having to insert a sort of patching system into FAI so that after it
created the client image, it would overwrite fixed versions of
certain files.
FAI expects all machine's interface information be managed by DHCP
both before and after installation. We used DHCP only during
installations. This meant that scripting needed to be written to
modify the network information on the host at installation time.
FAI is designed for having one single client image. Making it deal
with 4 different images was quite a task all by itself, especially
since each test takes 30+ minutes (client image generation +
installation).
On top of all this, I did everything in
cfengine, so that simply by creating a
host with the same naming convention as the fai server I built, you
would end up with a working FAI server to our specifications. I
proved this by re-imaging the server before final testing.
Furl Actually Worked
Furl was one of the products we ran at
LookSmart. I say "worked" because it's been sold to Diigo. Before
that, though, I was basically the sole sysadmin on the project. I
also ended up working with Furl longer than anyone else, including
the creator.
When I arrived, LookSmart had just acquired Furl and didn't really
know how to lay it out in terms of what parts to put on what
servers. About a week after I arrived, it had a catastrophic
failure and I had to figure out how to regenerate a bunch of the
data.
Furl was always really hard to optimize, especially since it never
had any money, so we couldn't simply throw hardware at the problem.
It had a number of features that required collating data from every
user, so we couldn't separate users out into groups across different
servers. On top of that, management was basically actively hostile
to doing the things needed to keep it running smoothly. When I left
in March 2009, we had ~1.5TiB of user cache and index data, ~100GiB
of MySQL data, and we were still running MySQL 4.0.24, which had
had active support terminated in September of 2006. I had been
requesting upgrades to 5.0 or better since I arrived in December
2004, but that required developer time, and management never allowed
it.
When I left, however, it was running smoothly, and I take a lot
of the credit for that. I engaged in immense amounts of MySQL
tuning over the years. I implemented our load balancing scheme in
haproxy when our hardware load balancing
couldn't play well enough
with Ruby On Rails. I routinely found bad database queries and
forwarded them to the developers, often along with suggestions for
how to fix them.
For the first year or so of my time with LookSmart, Furl was one of
the two biggest offenders for oncall time. By the time I left, Furl
was running so smoothly that on the rare occasions that it did
wake somebody up, they usually had to call me anyways because
something completely bizarre and unique had happened.
If some sysadmin hadn't made Furl their pet project, it would very
quickly have become totally unusable. Furl was my baby, and I'm
proud of how well it ran until they took it away.
Multilingual MUDding
This isn't something I did for money, and it mostly isn't
sysadminning, but it was one hell of a hard problem to solve and I'm
still proud of it, so here it is.
MUDs are text-based virtual worlds; like MMOGs but text-only. I'm
heavily involved with the the Lojban
project, and wanted to make a MUD for it. Yes, I'm a giant nerd.
Anyway, MUDs are generally tied tightly to whatever language they
are designed to process (i.e., English, almost always). Like, the
language parsing is implicit throughout the code. Converting to
another language is usually quite tricky. But I decided that wasn't
good enough for me; I wanted a MUD that could do multiple languages
(theoretically, any languages, and any number of them) ''at the same
time''. This means that when a user enters the room, they are
presented with a description in their language, if such a thing
exists.
This is a lot harder than it sounds, because a room consists of a
bunch of objects, any of which can be made by any user on the MUD.
A given user may not realize (or care) that the MUD is multilingual,
and may or may not have the capability to translate into all the
languages commonly used there even if they do. So you can have
cases where for one user you want to display the information from
the object itself, because it's in the best language for them, but
for other users you want to get the attribute from the object's
parent, even though the attribute is defined on the object,
which is a rather substantial violation of the normal OO flow.
Figuring
out what to do algorithmically was hard, but I managed it (with
help), and I managed to code it, and it works. I used the
mooix server base, and this makes mooix the only
generally multilingual MUD on the planet. Believe me, I checked.
For relevance to my career, mooix actually uses UNIX itself (in this
case Debian Linux) as its infrastructure. That is, users of the MUD
are system users, code is run as actual UNIX processes, etc. Love the
security implications! This lead to my first real experience with
virtualization, so I could figure out how to make a separate
instance of Debian for the MUD to run in all by its lonesome. It
uses Linux
VServer.
Imaging In Boxes With dd
That's not a typo; I actually do mean imaging in boxes. We
ended up in a situation at Recourse where it was actually not in our
best interests to take the machines out of their cardboard boxes to
image them. You see, we produced fully-configured honeypot machines
for our customers. Sales would generally give us about a day of
warning to get them shipped, which wasn't even enough time to get
them to the office. We didn't have the office space, or money, to
keep them on hand, and we didn't have room or time to rack them.
It wasn't like we could just load the OS on to them like normal
people, either: for reason that remain not entirely clear to me,
they had to be Solaris, but we had gone with x86 machines for
cheapness. Better still, they were Dell servers with (for Solaris
x86 anyways) weird hardware. Figuring out how to get the first one
imaged at all actually took me something like a week. Even once
I had it down to a science, you had to swap floppies (yes, floppies;
Solaris x86, at least at the time, was a grotesque abomination that
absolutely required all the driver loading happen from floppies as
far as I ever could determine) something like 4 times.
So eventually, since the drives were always identical, I got into
the routine of simply keeping a copy of a finished drive around and
a copy of tomsrtbt on floppy, and running dd to do the copy. As an
aside, did you know that dd's runtime performance changes
dramatically if you futz with the block size? Try a large copy with
no bs= argument, and then run it again with bs=16M. Anyways, this
method worked fairly well for our purposes.
Then one day 44 servers showed up, and we were told that we needed
to ship them out by close of business that day.
I was unimpressed, but it needed to be done. So I started with the
first box, made 3 drive copies (there were 4 bays in these boxes),
and basically commandeered the rest of the operations team, cracking
open boxes, running power cables to them, slotting in drives, etc.
I'd then go around and get the dd commands started.
It was a hell of a day, but we got them all out of the door. I was
pretty pleased with myself, because there was nothing about our
previous requirements that had led me to set up a one-command
imaging system for these boxes: I hod done it solely because it
seemed like the right thing to do.