People monitor their systems for two main reasons: to keep their system healthy and to understand its performance. Almost everyone does both wrong, for the same reasons: they monitor so they can react to failures, rather than measuring their workload so that they can predict problems.
On the advice of a former colleague, I recently read Drift into Failure: From Hunting Broken Components to Understanding Complex Systems by Sidney Dekker.
An overview of Drift into Failure
By examining several recent disasters (ranging from the Challenger explosion to the housing market collapse of 2008), Dekker contends that …
I survey some dynamic tracers (e.g. perf, sysdig) available on Linux.
I document some pain points from recent performance investigations and then speculate that such issues are endemic to the Linux community.
Tonight, I sat down and read through every resume in the 2013 SCS senior resume book. Reading resumes for a company is really interesting, because I find myself looking at them very differently. As a student, I didn't really understand what sections of the resume are important. I thought it …