Dissecting Elevators part 2: EZ Scrape

Intro: This is part two of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Scraping Data

Having worked on many data apps now, there’s always that initial excitement of figuring out how to import a dataset. But soon, the months (or hours) pass, and you forget how to do the import. Well, that’s no good. Elevators was my latest iteration on doing simple, repeatable data updates. I say latest iteration, because while I’ve managed to improve, there’s plenty to improve upon. For this post, I’ll cover my process up to the actual data import.

tx_elevators_data.git

Because of Reasons
by Ryan Pequin

If your source data is in plain text and has line breaks, you should keep it in source control. This is an extra step, but there are reasons:

  • There’s history. I can look at the previous state of the data and see when they fixed a typo.
  • Can quickly see what happened. Browsing the diff gives you insight into how the data changes.
  • Adds portability. If I had to pull 12 different data sources, I can tell a coworker to do one pull from my git remote to get data in exactly the same state I have it.

To do updates, I use Make. Mike Bostock has a good post on why to use Make called: Why Use Make. To do source control, I use git. You could host it online at github or bitbucket, but that feels wrong. What I do instead is make a bare git repo in Dropbox as my remote origin. To share the makefile, I put a copy of it in the main tx_elevators repo. You can see it at https://github.com/texastribune/tx_elevators/blob/2013-april-fools/data/Makefile.

Keeping the Diffs Cleans

After my first update, it was apparent that I was getting the data back in no particular order and my diffs were going to be unusable. The first thing I tried was a sort:

mv $@ $@.orig
awk 'NR==1; NR > 1 {print $0 | "sort -n"}' $@.orig > $@

It worked, except I discovered that my data had one record with a line break that broke many things. So I turned to csvkit. The current iteration uses csvkit’s csvsort:

mv $@ $@.orig
cat $@.orig | csvsort > $@

It’s significantly slower, but not terrible, and it doesn’t break the data. For reference, the whole command only takes 31 seconds on my old machine:

$ time make clean elevator_data_file.csv
...snip...
real    0m30.641s
user    0m19.841s
sys     0m1.588s

Improvements

Maintaining two copies of the data importer’s makefile is annoying. I’m still trying to figure out that one, but I have an idea. I think I can adjust my  make scrape command to assume there’s a  ./tx_elevators_data directory alongside tx_elevators, and fallback to tx_elevators/data. I should also change the csv filenames, keeping elevator_data_file.csv the original and making a new elevator_data_file.sorted.csv.

Conclusion and Next Time…

Once again, I’ve chosen to use Make to automate things. I was hoping to go into actually importing the data here, but then this post would be way too long. The gist of it is tx_elevator‘s makefile calls  tx_elevators_data‘s makefile to make sure it has fresh data before importing it. The entire process is condensed to one command:  make scrape. In addition to making the process simple and repeatable, it’s also important for the it to be easy to understand. Document everything as if you expect contributors, because you’re going to forget how everything worked.

Leave a Reply

Your email address will not be published. Required fields are marked *