I was interested to read Rebecca Murphey’s Baseline for Front End Developers in which the author realizes that she has begun to take for granted a basic set of tools of the trade. She presents “a few things that [she] wants to start expecting people to be familiar with, along with some resources you can use if you feel like you need to get up to speed”. I’d like to pay homage to that article and revise it in light of the tools I believe are essential for an Information Analyst. There is considerable overlap in the tools suggested although the application will differ.
GNU/ UNIX Command Line Tools
These tools are especially powerful because they don’t require you to load the entire file into memory. This makes it possible to work with very large files efficiently.
catetc for looking at files
trfor processing and altering files
joinagain for editing and combining text files
wcfor counting lines/ words/ characters
iconvfor converting between different encodings
Some useful resources include:
- Unix as an IDE
- Simplify data extraction using Linux text utilities
- Text Processing Commands in the Advanced Bash-Scripting Guide
- Replacing returns with commas in Sed
- Oh my zsh - zshell a replacement for bash
Regexs are extremely powerful, if sometimes counter-intuitive. They are vital for defining patterns and replacements in text processing. Although some debate exists a to whether they should be used to parse structured formats like XHTML, they are certainly a vital tool.
Having said which, I find this quote pretty amusing!
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.
- RegExr - Regex building and testing tool
- Regexper - visualisation tool
- Reference Docs
- An example for validating credit card numbers or validating UK Postcodes
A Programming/ Scripting Language
If nothing else, these provide the glue to connect different sections of your data pipeline. The incredible range of libraries avavailable to most scripting languages will make quick work of many common information processing tasks: parsing, spidering, ETL etc. The choice of language is down to personal preference or compatibility with the team.
It should suffice to say that knowledge of a scripting language is necessary and I don’t feel the need to recommend a particular language. Having said which, I can’t resist saying that I prefer
ruby but have dabbled in
python. I find the following ruby gems indispensible:
CSV in the 1.9 standard library), and
json. I’ve also been using
- Why The Lucky Stiff’s (poignant) Guide to Ruby
- A guide written by Matz (the original author of Ruby)
- Github’s Ruby Style Guide
- Ruby Toolbox - for finding and selecting gems
Text data files
Probably the most basic technology used in information analysis, the flat file is the lowest common denominator. You can almost guarantee that everyone will be able to cope with
csv files (even if they need to be translated before loading into another tool). The expressiveness and rigour of
XML makes it another vital tool although it’s rapidly being superceded in bandwidth-conscious realm of http by the less verbose
Information analysts ought to be able to handle (read, understand, parse and translate) files in these all of formats even if they immediately convert them to a prefered type.
- XSLT for CSV 2 XML
- XPath Syntax
- Improving an XML feed through CSS and XSLT
- JSON reference with links to parsers
Relational Databases, SQL, and NoSQL
This shouldn’t really come as a surprise to anyone. If you’re going to analyse information, you’re going to need to deal with databases. An analyst ought to be able to write SQL, understand normalisation/ denormalisation, and possibly also have some knowledge of ORM. NoSQL databases are also useful when the data structure is better represented by document, graph or key:value structures (and where tables would be very sparse).
- MySQL Cheat Sheet
- database normalisation and denormalisation
- NoSQL Guide
- Mongodb - document database
- Neo4j - graph database
- CouchDB - json + map reduce + http
Unless you really want to code algorithms for every statistical process from scratch, you’re going to want to adopt and familiarise yourself with a statistical library. I would strongly recommend
GNU-R although other options include
Matlab etc. Of course there is probably a library available for your prefered programming language.
numpy are very popular choices for
- The R Project
- R Seek - R-specific search engine
- R tag on Stackoverflow (bonus: Cross Validated - the stats-specific Stack Exchange Q&A site)
- Revolution Analytics Blog
- gretl - Gnu Regression, Econometrics and Time-series Library
Similarly you’ll probably want to use some sort of framework for generating visualisations. I’ve tried many and have found myself coming back to a few:
- ggplot2 - the tool that usually attracts people to R; excellent for quickly sketching out data graphics
Usually you’re going to want to place your analysis in context with some commentary. If you’re going to be repeating a given report - either in part or in full - they you stand to benefit greatly from using some form of template. Again this will depend upon your prefered programming languages. My favoured tools are listed below.
- Sweave - R + Latex and Open Document Format. I’ve not tried it yet but Knitr looks like it could be a useful package).
- HAML - HTML abstraction markup language - very clear markup with indentation (so you don’t have to write out every tag twice), you’re able to use ruby inline.
- Sass - Syntactically Awesome Stylesheets - an extension of CSS3, adding nested rules, variables, mixins, selector inheritance, and more.
- d3 data driven documents - for programmatically creating html from data
Version management is vital for tracking progress, making experimental branches in your work, and coordinating across teams. The clear leader in this regard is git and the amazing project hosting environment github. I’ve got in the habit of versioning everything - not just my code - but my home file configurations too.
And the rest…
This is just a quick list I wrote off the top of my head. What have I missed? What else do you use?