TL;DR: csvkit is a great little library and set of UNIX-style utilities for dealing with CSV data. Here I show how I solved a problem with some CSV Data I had at work and how I solved it with Python and csvkit!
Basically the problem boiled down to the classical GIGO problem where I had a CSV Input Source containing inconsistent data as well as inconsitent formatting (superfluous whitespace).
In fact the CSV Input Source originally came from a Microsoft Excel Spreadsheet which itself came from a MySQL Dump of a Drupal Website :/
Drupal Site -> MySQL Dump -> Excel -> CSV
Sadly this is the no. of steps the data took to get to me where I began the arduous task of cleaning it up and writing an importer for it for our Plone Site.
One problem I had in particular was inserting a new column of slugified
URL(s) based on the
$ pip install csvkit slugify $ csvcut -x -e utf-8 -c 1,3 -S < nccarf.csv > titles.csv $ csvpy -e utf-8 titles.csv Welcome! "titles.csv" has been loaded in a CSVKitReader object named "reader".
- Install csvkit and slugify.
- Extract columns 1 and 3 (Node ID and Title) from the input CSV
- Start a
csvpyinteractive session to inspect the resulting data.
>>> from slugify import slugify >>> xs = list(reader) >>> slugify(xs) 'nccarf-brochure' >>> ys = [xs] + [[x, slugify(x)] for x in xs[1:]] >>> ys [u'Node ID', u'URL'] >>> ys [u'54', 'nccarf-brochure'] >>> ys [u'61', 'an-assessment-of-the-vulnerability-of-australian-forests-to-the-impacts-of-climate-change-two-page-summary-of-key-findings'] >>> ys [u'568', 'nccarf-strategic-plan-a-summary'] >>> ys [u'862', 'climate-change-adaptation-research-in-australia-an-overview-of-research-funded-by-nccarf'] >>> header = ys >>> rows = ys[1:] >>> import csv >>> with open("urls.csv", "wb") as f: ... writer = csv.writer(f, header) ... writer.writerow(header) ... writer.writerows(rows) ... >>> ^D
- Import the
slugifyfunction from the slugify module.
- Convert the
xsso we can manipulate the list.
- Test “slugifying” a sample of the data.
- Generate a list of
yswhere each Title (indexed at 1) is “slugified” using the
- Check the resulting new data in
- Write out a new
urls.csvoutput CSV which we will later join with the original input CSV (
$ csvjoin -c 1,1 -e utf-8 --left nccarf.csv urls.csv | csvcut -e utf-8 -C 18 > nccarf.csv
This is where the final bit of the “magic” happens.
csvjoin from the
csvkit set of tools is a
wonderful thing for creating new CSV based data sets from two or more
pieces of input CSV data.
This basically performs a left inner join on column one of the two input csv files and pipes the output to a new resulting output file.