Job Qualifications and Ph.D. Prospects

Being near the end of my graduate studies, I’ve starting looking at the jobs that other people have a little more critically.  Whereas I used to think in terms of “oh, that’s interesting,” I now find myself wondering “am I qualified to do that?”  More often than not, the answer is “no,” and my prospects have had me feeling quite depressed as of late.  I’ll have spent five full years in graduate school by the time I get my Ph.D., but aside from the letters after my name, what have I really gotten out of it aside from debt?

Let’s take inventory of the qualifications I have developed.

Skill Qualification Comments

Computational Research Qualifications

general molecular simulation intermediate I’ve spent the better part of a decade doing MD simulations.  I know quite a bit about them, but that means I know enough to realize how much I don’t know.
electronegativity equalization intermediate/high I understand how EEM works, which is to say I realize how much nonsense it is as a semi-empirical theory.  I spent a year developing a new EEM-based model and hated every minute of it.
potential development intermediate/high I can tune and develop potentials, but this work is extremely tedious and decidedly un-fun
algorithms intermediate I know the basic integrators but don’t fully understand any of the more technical methods (e.g., Nosé-Hoover, Parrinello-Rahman, SHAKE/RATTLE, Ewald, etc).  I can write a fully functional MD code using basic algorithms (e.g, velocity rescaling, Verlet, and the Berendsen barostat) but have difficulty implementing extended-system-based algorithms.
commercial MD simulation packages low I’ve never used any off-the-shelf simulation packages other than LAMMPS and GULP.  Even then, I have not used either very extensively.  My group has always used its own code.
bio/pharma molecular sim low I know next to nothing about protein folding, docking, bio-centric models (CHARMM, etc), bonded interactions, SHAKE, etc.
quantum simulation low I know virtually nothing about quantum calculations.  I don’t have any understanding of basis sets, dispersion forces, DFT, Møller-Plesset, path integral, Car-Parrinello, etc.
continuum simulation low I know nothing about phase-field methods, finite element/finite difference, etc.

Other Research Qualifications

data analysis high I have a strong grasp of many computational tools useful in efficiently analyzing large data sets and correlating data, and I am quite good at leveraging those tools to extract meaningful data and complex relationships.  Some of the tools that I regularly use are Perl, Python, Maple, Mathematica, sh, and awk.
technical writing intermediate/high I have a strong grasp of the English language and can put together sensible manuscripts handily.  Of my three first-author papers published to date, none have ever come back from peer review requiring any major revisions.  Roughly 80-90% of the written text in these manuscripts was in my words.
technical presentation intermediate I’m not a bad speaker and can assemble presentations that follow a logical path.  I design presentations for specific audiences that aren’t overbearingly technical but at the same time not superficial.  I’ve won a poster award and spoken at several international conferences.
laboratory work none/low I’m not very good with my hands, which is why I’ve stayed out of experimental labs.  I know lab safety but am afraid of dangerous machinery (machine shops, furnaces) and chemicals (highly caustic, toxic, etc) due to lack of experience.

Scientific Knowledge

ceramics intermediate/high I know about ceramics, crystal structures, point defects, grain boundaries, processing, microstructure, etc.  I don’t know much about specific technical ceramics
glasses intermediate/high I know a lot about silica.  As you introduce additives and exotic processing, my level of knowledge drops.  I know about its atomic structure, general properties, and mechanical behavior.  I know less about specific modern silicates (mesoporous, etc)
physics (general) intermediate/high I have a pretty strong understanding of general physics and why things happen.  I’ve also got an aptitude for solving analytical problems.  I could teach undergraduate-level physics pretty adeptly.
physics (mechanics) low/intermediate I know enough to know that I don’t know very much.  I do not have a strong background in Lagrangian/Hamiltonian formalisms (which is to say nobody has ever taught me of their existence).  I am self-teaching this stuff though.
physics (quantum) low I know the basics, but I haven’t solved a differential equation in half a decade.  I have no real experience working in modern physics outside of a classroom.
physics (thermo/stat mech) intermediate I’ve taken thermodynamics three or four times and have a fair grasp of it.  My limited knowledge of mathematics prevents me from fully grasping more complicated formalisms (e.g., n-dimensional space)
physics (chemical) intermediate I’m familiar with many chemico-physical processes, reaction pathways, energetics, etc.

Technical Computing

architecture intermediate/high I have a reasonably good understanding of what makes computers fast.  I understand memory and cache layouts, pipelining, SIMD/vectorization, bandwidth, data locality, out-of-order execution, registers, and how to program efficiently with these features in mind.  I do not know x86 assembly.
programming intermediate/high I have a good sense of proper programming, program structure, and good practices.  I have years of experience in C, Fortran 77, Perl, and bash/sh.  I have some experience with Python, awk, C++, and Fortran 90.
SMP parallel programming low/intermediate I am reasonably comfortable with OpenMP.  I have basic familiarity with pthreads.  I have never applied either of these to a real project.
distributed parallel programming low/intermediate I am familiar with MPI, but I have not used it very extensively.  I am familiar with the concepts and considerations of distributed computing.  I have no experience in fault tolerance or large scaling.
GPGPU programming low I am familiar with the basics of CUDA.  I can write basic kernels, but have no experience using CUDA for research.

Systems Administration

Linux administration intermediate I run a few general-purpose Linux servers.  I am comfortable compiling from source (e.g., Apache, PHP), implementing basic security measures (firewalls, quotas, IDS), managing user accounts, working with LVM, etc.  I do not have much experience with SAN, clustered systems, advanced networking, automatic deployment, PXE, virtualization, packaging, etc.
UNIX administration intermediate I’ve run a lot of Solaris servers and am comfortable with Solaris 10′s way of doing things.  I am experienced with Sun hardware, ZFS, and general administration.  I am unfamiliar with the details of SMF and dtrace.  I have intermediate familiarity with HP-UX 11i, IRIX 6.5, and AIX 5.
cluster administration low/intermediate I use clusters and can configure a basic one, but I lack experience in diskless nodes, infiniband, low-level tuning
hardware intermediate/high I have a lot of experience debugging hardware ranging from workstations to enterprise devices.  I have advised purchasing decisions on cluster hardware, assembled clusters, performed upgrades and troubleshooting, and inventory management.

Granted, the fact that I listed some things as having low qualifications still means I’m probably more qualified than the average Joe off the street who doesn’t even know such things exist.  Furthermore, the fact that I’ve listed it means that I know it’s a shortfall and am willing to bone up on that skill if given the time and opportunity.

With that being said, where do my qualifications leave me?  I’m more equipped to do molecular simulation than most other researchers since I have intimate knowledge of simulation code, algorithms, and theory, but I also know that there’s a lot of the detail I don’t understand.  Do most graduate students know this sort of stuff by the time they finish?  My postdoc coworker wrote a Nosé-Hoover + Parrinello-Rahman thermostat-barostat integrator routine for our group’s simulation code back when he was a graduate student.  I’m almost done with my degree and I really have no idea how to do this.  Granted, his undergraduate degree was in physics while mine was in “ceramic engineering.”

This sort of thing makes me feel like my education is holding me back.  I know very few simulations people who are in materials science.  The vast majority of molecular modelers are

  1. in physics and understand the things I wish I understood (e.g., the statistical mechanical implications of various modifications to the Lagrangian)
  2. in chemistry and also understand things I wish I understood (e.g., potentials of mean force, free energy of reactions, quantum chemical aspects)
  3. in biology, and understand ??? (I suspect the bio people are using black-box code and don’t really understand or care about the nitty gritty)

I went to graduate school so I wouldn’t have to get silicosis in some batch house or rebuild furnaces for a living, but I’m just not seeing where to go from here.  I could slide into a vanilla post doc position and spend the next 2-6 years of my life bouncing around between short-term appointments, pumping out papers about obscure scientific problems about which nobody cares, and floating around technical conferences I hate attending.  That’s a miserable existence, but it seems like it’s the one for which I am most qualified.

Contrary to my thoughts going into this business, professional science isn’t all that great. There are a few superstar scientists who make clear and evident breakthroughs, and that’s really exciting.  But the majority of scientific progress is in painfully slow, small baby steps.  Publications address some tiny facet of some tiny problems that only a tiny group of other scientists care about, and even then, they rarely stand on their own.  Nobody will believe it unless there are enough of these tiny findings that don’t contradict each other that pile up.

And when that happens?

There’s still only a dozen people on the planet who care.

There’s no real rewarding fulfillment that comes with publishing these obscure results.  Even to the layperson, it’s not like curing cancer.  Nobody really benefits from most of the science that gets published today.  As I often tell people, being a janitor would be more fulfilling to me.  At least in that case, I’d be able to go home at night knowing that I made the toilets cleaner than they were when I started the day.

Posted in Personal, Research, Science | Tagged , , , | Leave a comment

Revisiting Perl and Python’s Speed

I was really surprised to see the discussion that was generated as the result of my previous post comparing the speed of Python and Perl.  Many people much wiser than me posted a lot of valuable comments and suggestions, and two people were kind enough to post total rewrites of my routines which (to nobody’s surprise) were much faster than the codes I wrote.

A few people (both here on the blog and through other discussion) raised legitimate points:

  1. My Python code was recompiling the regex every loop iteration because I was confused by how regex compilation and regex match objects work.  Fixing this problem alone increased speed by 10%-25%.
  2. The timings I posted were sub-second and someone suggested that startup overhead may have been hurting Python.  To address this, I used a more “real-life” input file that was 3750 MB rather than the 8.588 MB input file I used earlier.
  3. The style of Perl I was using was archaic, and the style of Python I was using wasn’t terribly Pythonic.  I live in a programming bubble; I learned both of these languages from their respective O’Reilly books and that’s it.  I don’t know anyone who knows either Perl or Python in real life, and I have never seen anyone else’s code in either language.  But as it turns out, poorly written Perl and poorly written Python follow the same trends as well-written Perl and Python (see below).

So as to be a little more scientific about this (since I am a scientist and all), here are my starting parameters:

  • Software
    • Ubuntu Server 10.04 LTS
    • Python 2.6.5 provided by the distribution
    • Perl 5.10.1 provided by the distribution
    • data resides on an ext4 lvm
  • Hardware
    • HP DL360 G7
    • 2x Xeon X5672, 3200 MHz
    • 24GB DDR3 RAM
    • data resides on 6Gbit SAS RAID5
  • Codes

Methodology: I ran each code on the same 3750 MB input file five times in serial succession.  Each execution was timed using the `time` builtin  provided by the bash 4.1.5(1) included with Ubuntu 10.04.  stdout was redirected straight to /dev/null.

Trial Walltime Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Old Python 309.032 310.971 308.228 311.331 307.170 307.461
New Python 176.880 178.099 174.742 175.463 178.235 177.863
Old Perl 167.051 166.916 165.911 167.361 168.735 166.333
New Perl 126.860 125.913 124.709 130.125 127.809 125.746

So even cleaner code runs over 40% faster in Perl than Python, which is not far off from the 50% slowdown I noted with my two crumbier versions of the code.  Furthermore, it seems easier for a relative novice like myself to write inefficient Python code over Perl code.  Of course, it’s also easier to write Perl code that doesn’t do what you expect, and trying to understand someone else’s code is a crapshoot.

Judging by what others have told me and some comments have pointed out though, Python just isn’t optimized for “practical extraction and reporting.”  Maybe someday I’ll find a use for Python in my work.

In case the links to the codes I used ever go bad, here they are on pastebin:

I’d post the input files I used, but I don’t have anywhere I can anonymously host 3.7 GB (or even 8 MB) files.  If you’re interested in the input data, let me know and I can send a private link.

Posted in Technology | Tagged , , | Leave a comment

Switching from Perl to Python: Speed

The job listings in scientific computing these days seem to show a mild preference for applicants with backgrounds in Python over Perl. It has high-profile (or just highly visible?) packages like NumPy and Python’s MPI bindings for scientific computing, and some molecular dynamics packages (e.g., LAMMPS) include analysis routines written in Python. Although I’ve invested a few years into Perl, I’ve decided to not pigeonhole myself and start picking up Python. After all, Perl is unintelligible after it’s been written, and it’s sometimes frustrating to deal with its odd quirks.

To this end, I reimplemented one of my most-used Perl analysis routines in Python. Here is my Perl version, written back in 2009:

#!/usr/bin/perl

@show = qw/ Siloxane SiO4 Si3O SiO3 SiO2 SiO1 NBO FreeOH H2O H3O SiOH SiOH2 Si2OH/;

printf("\n%-8.8s ", "ird");
foreach $specie ( @show )
{
  printf("%8.8s ", $specie);
}
print "\n";

$current = 0;
$isave = 0;
while ( $line = <> )
{
  chomp($line);
  $line =~ s/^\s+//g;
  @arg = split(/\s+/, $line);
  next unless $line =~ m/^\d+\s+[\d\w]+\s+\d+\s+[\w\.]+\s+[\w\.]+\s+[\w\.]+\s*$/ ;
  if ( $current == 0 )
  {
    $current = $arg[0];
    $isave = $current;
  }
  if ( $arg[0] != $current )
  {
    &printargs();
    $current = $arg[0];
    $isave++;
  }
  $type{$arg[1]}++;
}
&printargs();

sub printargs( )
{
  printf("%-8s ", $isave);
  foreach $specie ( @show )
  {
    printf("%8d ", $type{$specie});
  }
  print "\n";
  foreach $i ( keys(%type) )
  {
    $type{$i} = 0;
  }
}

And here is the Python version I cooked up today:

#!/usr/bin/env python2

import fileinput
import re

show = [ "Siloxane", "SiO4", "Si3O", "SiO3", \
         "SiO2", "SiO1", "NBO", "FreeOH", \
         "H2O", "H3O", "SiOH", "SiOH2", "Si2OH" ]

def printargs( counts, isave ):
  print "%-8s" % isave,
  for s in show:
    print "%8d" % counts[s],
    counts[s] = 0
  print "\n",

print "%-8s" % "ird",
counts = {};
for s in show:
  counts[s] = 0
  print "%8s" % s,
print "\n",

isave = 0;
current = 0;

RE_LINE = \
  re.compile(r'\s*(\d+)\s+([\d\w]+)\s+\d+\s+[\w\.]+\s+[\w\.]+\s+[\w\.]+\s*$')

# method #1:
# for line in fileinput.input():

# method #2:
# for line in file('coord.out'):

# method #3:
contents = file('coord.out').readlines()
for line in contents:
  match = re.match(RE_LINE, line)
  if not match: continue

  specie = match.group(2)
  icur = int(match.group(1))

  if current == 0:
    current = icur
    isave = current
  elif current != icur:
    printargs(counts, isave)
    current = icur
    isave += 1

  if show.count(specie) > 0:
    counts[specie] += 1;

printargs(counts,isave)

In the Python version, there are several ways to tear through a file and I tried all three. Method #1 is closest to the Perl functionality, where I can specify multiple input files on the command line and have all of them parsed sequentially. Method #2 is the method that the Python documentation seems to advocate the most. Method #3 loads the whole file contents into memory and works from there.

Unfortunately, in all three cases, Python seems to be slower than Perl. Average execution times for a typical input file are:

Python Method #1: 0.794 seconds
Python Method #2: 0.692 seconds
Python Method #3: 0.686 seconds
Perl: 0.469 seconds

Maybe there’s something I’m missing in the Python version, but the Perl version isn’t exactly a shining example of simplicity in itself. What gives here? For a language that’s being venerated in the scientific computing world, in the case of basic text parsing of large files, it isn’t shining. At best, it’s almost 50% slower than Perl.

Posted in Computations, Technology | Tagged , , | 24 Comments

The Sad State of our Storage Situation

Despite being in the field of technical computing for over thirty years now, the level of technological sophistication in my research group has been stagnant for somewhere around twenty years.  Take for example our user authentication:

  1. We use NIS for user authentication.  Period.  No Kerberos.
  2. Our NIS server is an SGI Indigo2 that is over fifteen years old now…
  3. …and fifteen years ago, single DES was considered good enough for password hashes

So our entire password database and all password hashes are fully exposed to the network, and the hashes are in an extremely insecure format.  Oh, and did I mention that our sole NIS server is running on a fifteen year old machine that is still using its factory-installed hard drive?  One of these days it’s going to blow out, and we don’t have any sort of drop-in replacement or backups for when that happens.

In fact, the matter of not having backups has been a major issue for as long as I’ve worked here.  When I started here as an undergraduate researcher, the entire lab’s backup strategy was to tar up directories and sftp them to a 250GB external USB drive connected to our already-old Power Mac G4.  When that filled up, we started just backing up data to whatever disks had free space.  One of the researchers started copying his data into the /usr partition on the compute nodes of our cluster since they had around 30GB free per node.  Another copied backups to various workstations that weren’t being used at the time.  And the third full-time researcher simply didn’t back up his data at all.  The cluster had automatic tape backup, after all, so why waste the effort?

On the day of my graduation, my department threw a luncheon for the new graduates where faculty could meet the parents of the graduating students.  My research advisor (under whom I am now finishing my Ph.D.) introduced himself to my parents, said a number of congratulatory and flattering things, and finished by turning to me and saying “Oh, and the cluster went down yesterday.  All the data is gone.  When are you going to be back in the lab?”

The following Monday I was back in the lab, and a lot of data was lost.  The cluster did have a tape drive with amanda installed, but nobody knew if the tape backups were ever actually running.  Nobody knew how to examine the contents of the tape, and nobody had rotated tapes recently.  In fact, although the tape drive was sold with two DLT tapes, the second one was never even unwrapped, much less rotated in.  I’d be pretty confident that even if amanda was doing regular tape backups, the tape had more writes on it than was safe for a DLT cartridge.

This story isn’t particularly interesting; the internet is full of similar anecdotal backup horror stories.  But it really sucks when it happens to you, and after returning to my group some time later to do graduate work, I took it upon myself to establish data redundancy and automated backups to make sure I never lost my data again.

Unfortunately, the technological sophistication of my group never went beyond buying an external USB hard drive, plugging into a Mac, and letting OS X magically set it up so that it can be written to.  And for some reason, purchasing decisions were continually made by those with perhaps the least qualifications to be making them.  The end result was our entire storage infrastructure being plugged into the USB ports of a Power Mac G5.

Our situation remained this way for some years despite the fact that the USB to SATA bridges used in external Seagate disks seem to fail under high throughput, and OS X does not handle failed drives gracefully at all.  I finally got fed up with the constant outages and failures, voided the warranties on our bigger USB disks, ripped them out of their enclosures, and installed them properly into whatever workstations had the drive cage space and SATA channels to support them.

The end result is a bit of a mess:

Backup and storage layout

Some systems are automatically backed up, some are not.  Some have RAID1, others do not.  And none of the backup disks have any redundancy, so if one goes, the backups on it are gone.  It would please me to no end to replace this mess with a single storage solution; even something simple like a dozen terabytes of NAS would be a huge improvement over the spiderweb of small disks we’re currently using.

Unfortunately, the cost of a semi-serious storage solution (on the order of a few thousand dollars) is hard to sell on my boss.  We have backup disks, they can store data, and nobody’s lost anything important since that cluster failed many years ago.  Something must be going right, so why spend the money on storage when we can burn it on more inkjet printers to replace those whose cartridges have gone empty, or to hire more undergraduates who aren’t qualified to touch a UNIX workstation?

As backup space becomes a little tighter, I am tempted to halt the automated backups of my coworkers’ data and just automatically back up my data.  After all, they have all been told to do their own backups, and I’ve told them not to trust that the automatic backups are actually working.  Yet none of them have done a manual backup in at least a year, and nobody but me has been checking to see that the automated backup system is even working.

I struggle to convince my boss to spend more on storage to accommodate my backup needs, so maybe I should just use what I’ve got available to me and let the others worry about their own data.  After all, it isn’t my job to keep their data backed up.  I am not a system administrator; I don’t even have administrator privileges on any of the clusters I am automatically backing up.  I just don’t want to lose any more of my data.

Posted in Technology | Tagged , , | Leave a comment

Wolf’s approximation of Madelung potentials

Wolf (Wolf, Keblinksi, Phillpot, and Eggebrecht, J. Chem. Phys. 110 (1999) 8254) came up with a very clever way to approximate the Madelung potential in infinite solids that is much less expensive to calculate than the traditional reciprocal-space-based Ewald methods, and in his seminal paper, Wolf showed that his approach works wonderfully for both crystalline and amorphous solids like NaCl and MgO.

Implementing this so-called Wolf sum isn’t hard; there are two parameters to devise and picking the right one is quite straightforward:

Wolf-approximated Madelung potential for halite

Wolf-approximated Madelung potential for halite as a function of the two empirical parameters rc and beta (=1/alpha)

In the case of halite, if you want a cutoff of 10 Å, it looks like β = 3.46 Å is a good choice; the Madelung potential appears fully converged, and unlike the β=2.46Å case, there is no systematic error due to overdamping. Life is good, right?

As it turns out, applying the Wolf summation method to slightly more complicated crystals isn’t as nice. Take, for example, alumina (Al2O3):

Madelung potential for alumina

Wolf-approximated Madelung potential for alumina.

Suddenly the Madelung potential doesn’t oscillate nicely around the true value; rather, the converged value decreases monotonically with increasing damping. This offers no indication of what the true converged Madelung energy for this crystal is. What about other relevant materials that aren’t a simple 1:1 stoichiometry?

Madelung potential for water

Madelung potential for water assuming partial charges

Water looks a lot like alumina. The Wolf sum isn’t working very well here.

Madelung potential for water

Madelung potential for water assuming partial charges with diffuse character

Using a more realistic (but still empirical) treatment of the Coulombic nature of water makes things worse.

Madelung potential for amorphous silica

Madelung potential for amorphous silica assuming partial charges with diffuse character

…and amorphous silica also doesn’t work.

So what’s going on here? This method seems scientifically sound, and I’m reasonably sure my implementation of it is correct since I can match the results of other codes, but the only systems with which it seems to work reliably are electronically very simple.

What frustrates me about this is that my work for the last five years has been using this Wolf method for both water and silica. The parameters were published before I started, and I had assumed that they were derived using some sort of sensible procedure. Unfortunately, I can’t figure out what that method was because re-parameterizing the Wolf method myself has revealed the process to be nothing but fragile and murky.

I can’t say that this is unexpected given how my research has almost invariably panned out for the last five years, but it is frustrating nonetheless. Consequentially, I will probably spend the next few days fighting code and methods rather than doing real science that can contribute to my dissertation. But such is the nature of graduate work.

Posted in Science | Tagged , , , | Leave a comment