OpenRefine: Cleaning Messy Data

Throughout my career I’ve spent quite a bit of time figuring out how to clean up messy, inconsistent data sets — typically descriptive metadata about digital collections.

In the past, this has usually involved complex sequences of Edit/Replace operations in a text editor, and lots of cool Microsoft Excel functions or Microsoft Access queries — and sometimes even just good, old-fashioned manual editing (sophisticated software products, known as ETL — Extraction, Transformation and Load — tools have existed for a few years now, but they are usually both too expensive and too complex for the types of projects I have worked on).

So I was very excited today when I discovered OpenRefine, a free desktop application that promises to make many of the most common data cleaning operations simple, quick and easy — “a free, open source, power tool for working with messy data”, as it describes itself.

I haven’t played with it yet, but if the videos are anything to go by, it could be a Godsend!

Search Engine Optimization Cheat Sheet

Search Engine Optimization (or “SEO” for short), the tweaking of web content so that it ranks highly in search engine results, can often seem like either a dark art or snake oil.

If you manage a website, you will almost certainly have received emails from various shadowy “SEO Experts,” promising to make your website show up at the top of web searches for your chosen keywords. SEO spam emails can sometimes read exactly like ads for miracle drugs that enhance male bedroom performance!

Some of these so-called SEO “experts” will use dubious “black hat” techniques, such as “spamdexing,” to achieve temporarily high page ranks — although the search engine providers are constantly on the lookout for such nefarious tactics, and sometimes even penalize offending website by artificially lowering their page rank!

The truth is that the best way to optimize your ranking in search engine results is to follow the principles of good web design, as suggested by the search engine providers themselves.

The basic rules are pretty simple:

  • Make sure you pages are valid, properly-structured (X)HTML;
  • Ensure every page on your site has a valid, descriptive <TITLE> tag in the <HEAD> section. Google uses this as the hyperlinked title of the page in search results, so make it good!
  • Use short, descriptive, human-readable domain names and hyperlinks, rather than gobbledygook (like /PageID?123) from content management systems, wherever possible;
  • Use the <META> Description tag to provide a short human-readable description of the page. Search engines, including Google, sometimes display these as the brief description of the page in search results (provided it doesn’t trigger any metatag spam warnings);
  • Use the ALT attribute of the <IMG> tag to provide a text alternative to images;
  • Provide quality content and keep it up to date!

Google WebMaster Tools have produced a handy one-page cheatsheet on SEO basics, Optimize Your Website, which contains this basic advice, plus links to more detailed information.

ISO 8601 Date Format

I’ve been a fan of the ISO 8601 date format ever since I first learned about it at a Dublin Core Metadata Initiative conference in the late nineties: It’s:

  • Logical (the elements are ordered from the most significant to the least from left to right);
  • Y2K-proof, since it uses a 4-digit year (remember this was a big issue in the late nineties!);
  • Internationally unambiguous (unlike the absurd U.S. MM/DD/YYYY format);
  • Extensible to include international time to any level of precision
  • Recommended by both the DCMI, the W3C, AND the web comic XKCD!
XKCD Public Service Announcement Graphic: ISO 8601

XKCD Public Service Announcement: ISO 8601

Plus if you use it for file names, they even sort in the correct order.

You should use it, too!