Category Archives: ETL

OpenRefine: Cleaning Messy Data

Throughout my career I’ve spent quite a bit of time figuring out how to clean up messy, inconsistent data sets — typically descriptive metadata about digital collections.

In the past, this has usually involved complex sequences of Edit/Replace operations in a text editor, and lots of cool Microsoft Excel functions or Microsoft Access queries — and sometimes even just good, old-fashioned manual editing (sophisticated software products, known as ETL — Extraction, Transformation and Load — tools have existed for a few years now, but they are usually both too expensive and too complex for the types of projects I have worked on).

So I was very excited today when I discovered OpenRefine, a free desktop application that promises to make many of the most common data cleaning operations simple, quick and easy — “a free, open source, power tool for working with messy data”, as it describes itself.

I haven’t played with it yet, but if the videos are anything to go by, it could be a Godsend!