Category Archives: Case Study

Best Practices Case Study

Optimizing Docker Image Size For Real

I’ve come across tips on how to keep Docker images small and Dockerfiles with strange lines that seem to exist only to optimize image size. Well, it turns out they’re all wrong.

They may have an effect with flat Docker images, but everything else (i.e. 99% of what people do), cleanup steps are just extra steps. When Docker builds an image from a Dockerfile, every step is a checkpoint, and every step is saved. If you add 100 MB in one step, then delete it the next, that 100 MB still needs to be saved so other Dockerfiles with the same step can reuse it.

Results

REPOSITORY               TAG             IMAGE ID            CREATED             VIRTUAL SIZE
test/baseline            latest          7b590dec9b43        7 hours ago         272.6 MB
test/baseline_lines      latest          e165025980f7        9 minutes ago       272.6 MB
test/baseline_lists      latest          b40f9e108a93        About an hour ago   272.6 MB
test/combo               latest          744b502e0052        2 seconds ago       269.8 MB
test/combo2              latest          be8f1c1de02e        About an hour ago   249.8 MB
test/combo3              latest          da948e2838d9        About an hour ago   249.8 MB
test/install             latest          e7cadcbb5a05        12 hours ago        269.8 MB
test/install_clean       latest          dd1383285e85        12 hours ago        269.8 MB
test/install_lists       latest          e55f6f8ebac8        12 hours ago        269.8 MB
test/purge               latest          ef8c2aa7400b        About an hour ago   273.5 MB
test/remove              latest          75e3e5c4e246        About an hour ago   273.5 MB

Hypothesis: Docker’s base Ubuntu image does not need `apt-get clean`

I did an experiment around Docker 0.6. I think my conclusion was that `apt-get install … && apt-get clean` saved a few megabytes. But I head that you didn’t need to do that. If you compare the “test/install” and “test/install_clean” size, you’ll see there is no difference. So you don’t need `apt-get clean`.

Hypothesis: `rm -rf /var/lib/apt/lists/*` saves some space

I’ve been seeing a lot of Dockerfiles lately with this line. Including lots of official Docker images. If those guys are all doing it, surely it must have some effect. Nope.

Hypothesis: Combining similar lines saves space

There’s some overhead for each line in a Dockerfile. How significant is it? Well, it turns out it’s not. What I did find out though, is that it does save a significant amount of time and saves a lot of disk thrashing. So combining lines does not save space, but saves time.

Hypothesis: Combining multiple steps saves space

This makes sense. If you skip making checkpoints, you’re not storing intermediate states. And it turns out this is the only way to get a Docker image made from a Dockerfile smaller. But this is at the cost of readability, and more importantly, at the cost of reduced redundancy between images.

Hypothesis: `apt-get purge` saves some space

Well this hypothesis seems silly now. But I see it used now and then. Deletions do not save space.

Conclusion

Write your Dockerfiles the same way you run commands. Don’t prematurely optimize by adding extra cruft you saw someone else do. If you’re actually worried about image size, use some sort of automation to rebuild Docker images behind the scenes. Just keep that logic out of the Dockerfile. And always keep on measuring. Know your bottlenecks.

Case Study Nerd

How to Ignore the Needle Docs

At PyCon 2014, I learned about a package called “needle” from Julien Phalip’s talk, Advanced techniques for Web functional testing. When I tried using it with a Django project, I immediately ran into problems:

  1. The needle docs aren’t written for Django, so they don’t explain how to use NeedleTestCase with LiveServerTestCase.
  2. I wasn’t using nose as my test runner, and didn’t want to start using it just to run Needle.

The first problem turned out to be easy; use both:

class SiteTest(NeedleTestCase, LiveServerTestCase):
    pass

The second problem wasn’t that bad either. If you examine the Nose plugin Needle adds, it just adds a  save_baseline attribute to the test case.

There were a lot of random hacks and tweaks I threw together. I think the best way to show them all is with an annotated example:

import os
import unittest

from django.test import LiveServerTestCase
from needle.cases import NeedleTestCase


# This is a configuration variable for whether to save the baseline screenshot
# or not. You can flip it by manually changing it, with an environment variable
# check, or monkey patching.
SAVE_BASELINE = False


# You should be taking screenshots at multiple widths to capture all your
# responsive breakpoints. Only the width really matter,s but I include the
# height for completeness.
SIZES = (
    (1024, 800),  # desktop
    (800, 600),  # tablet
    (320, 540),  # mobile
)


# To keep the test runner from running slow needle tests every time, decorate
# it. In this example, 'RUN_NEEDLE_TESTS' has to exist in your environment for
# these tests to run. So you would run needle tests like:
#
#     RUN_NEEDLE_TESTS=1 python manage.py test python.import.path.to.test_needle
@unittest.skipUnless('RUN_NEEDLE_TESTS' in os.environ, 'expensive tests')
class ScreenshotTest(NeedleTestCase, LiveServerTestCase):
    # You're going to want to make sure your pages look consistent every time.
    fixtures = ['needle.json']

    @classmethod
    def setUpClass(cls):
        """
        Sets `save_baseline`.

        I don't remember why I did it here. Maybe the timing didn't work when
        I put it as an attribute on the test class.
        """
        cls.save_baseline = SAVE_BASELINE
        super(ScreenshotTest, cls).setUpClass()

    def assertResponsive(self, scope, name):
        """Takes a screenshot for every responsive size you set."""
        for width, height in SIZES:
            self.set_viewport_size(width=width, height=height)
            try:
                self.assertScreenshot(
                    scope,
                    # include the name and browser in the filename
                    '{}_{}_firefox'.format(name, width)
                )
            except AssertionError as e:
                print(e)
                # suppress the error so needle keeps making screenshots. Needle
                # is very fickle and we'll have to judge the screenshots by eye
                # anyways instead of relying on needle's pixel perfect
                # judgements.
                pass

    def test_homepage(self):
        urls_to_test = (
            ('/', 'homepage'),
            ('/login/', 'login'),
            ('/hamburger/', 'meat'),
            ('/fries/', 'potatoes'),
            ('/admin/', 'admin'),
        )
        for url, name in urls_to_test:
            self.driver.get(self.live_server_url + url)
            self.assertResponsive(
                # for now, I always want the full page, so I use 'html' as the
                # scope for my screenshots. But as I document more things,
                # that's likely to change.
                'html',
                # passing in a human readable name helps it generate
                # screenshots file names that make more sense to me.
                name,
            )

Well I hope that that made sense.

When you run the tests, they’re saved to the ./screenshots/ directory, which I keep out of source control because storing so many binary files is a heavy burden on git. We experimented with  git-annex but it turned out to be more trouble than it was worth.

My typical workflow goes like this:

  1. Make sure my reference baseline screenshots are up to date: git checkout master && grunt && invoke needle --make
  2. Generate screenshots for my feature branch: git checkout fat-buttons && grunt && invoke needle
  3. Open the screenshots directory and compare screenshots.

In that workflow,  grunt is used to generate css,  invoke is used as my test runner, and  --make is a flag I built into the needle invoke task to make baseline screenshots.

Now I can quickly see if a change has the desired effect for multiple browser widths faster than it takes to actually resize a browser window. Bonus: I can see if a change has undesired effects on pages that I would have been too lazy to test manually.

Next steps: I still haven’t figured out how to run the same test in multiple browsers.

Case Study Nerd Portfolio

Boating Lake, aka Photoshop is still useful for the web

So I made a silly thing one evening. I’ve embedded it above, and you can see the source at https://github.com/crccheck/boating-lake for more details about the inspiration and implementation. I thought my two readers might be interested in my process for preparing the graphics.

Finding the source

A quick Internet search revealed a clean source for the comic. I downloaded it and brought it into Photoshop.

Creating a mask from line art

Making cutouts is the infantry of the Photoshop army. There are more tools and techniques for making cutouts than any other Photoshop task; I thought I’d show favorite technique of cutting out line art.

boating-lake-step1-original

Duplicate the layer to preserve your original. The goal of the next steps is to use the art as its own alpha mask. Select All (Ctrl+A), Copy (Ctrl+C), Create a layer mask, Alt+Click to Enter the layer mask, Paste (Ctrl+P), and then Invert the the mask (Ctrl+I).

If you exit the mask, you’ll see that you’ve sort of created a mask without using a brush, eraser, lasso, or wand!

boating-lake-step2-inverted-mask

Now to fix the alpha, go back and re-enter the layer mask (Alt+click the mask in the Layers palette).

boating-lake-step3-edit-mask-alpha

Edit the mask’s levels (Ctrl+L). I’m going to make a mask for the characters in the “Key” area first, so I’ll set the black point and white point accordingly.

boating-lake-step4-mask-black-white-points

It’s a good idea to tighten up the black point and white point a bit to make the alpha cleaner.

boating-lake-step5-mask-black-white-points2

Now I have a nice mask for the characters.

boating-lake-step6-masked

Now I just Copy Merged (Shift+Ctrl+C)/Paste (Ctrl+V) each shape into its own layer. I kept finding mistakes and had to go back to fill in the robot’s face, and resize the characters to be 15% smaller so they would fit in the boats.

The process is the same for getting the boat and waves, except you pick the water as the black point for the layer mask.

boating-lake-step7-lake-mask-black-white

Repeat this process to create a mask for any other element you want to isolate. I also grabbed the title and the entire key.

Automating the export

To export each sprite, make sure their layer is visible, and hide all the rest.

Then go to File -> Scripts -> Export Layers to Files…

boating-lake-step8-export

Make sure File Name Prefix is empty, File Type is PNG, Transparency is on, and Trim Layers is on. The process may take some time, but you’ll end up with a nice pile of sprites.

Extra Hints

  • Depending on how clean the source material is, you’ll probably want to run the “Despeckle” de-noising filter to remove jpeg noise.
  • If your alpha isn’t very clean, you can try tightening the black point of the layer mask.
  • I have a layer called “bg fill” that I usually fill with a solid color. I use this to help catch mask spills and refine edges.
  • If you need to do isolated manual cleaning on a layer, I create a new document with just that layer. There’s a shortcut for doing this! Layer -> Duplicate Layer… Then set the Destination Document to “New”.
  • You’ll notice the file names look like  _0000_boat.png or _0000s_0001_robot.png. The number is the layer number, and multiple numbers mean it was in a group. I once wrote a Photoshop script that did the same thing but without those numbers, but Photoshop’s built in script got better, and I didn’t feel like upgrading my script, so I just use the built in script now and manually clean off the numbers. If you need to do a lot of files, you can use a mass file renamer (For Windows, I recommend ReNamer by den4b, which has preview and supports regular expressions).
  • To speed up the export, delete as many hidden layers as possible. The script isn’t very smart.

Errata

I ended up exporting the waves separately for some reason, so they’re not documented here. Also, I prefer Photoshop for Windows over Mac.

Case Study Nerd

Dissecting Elevators Part 8: deploying

If you examine the repo, you’ll see it’s a Django app; and the Procfile and requirements.txt would make you think the app itself was deployed on Heroku, but you would be wrong!

You may be surprised to find out that the Elevators Explorer is a static HTML app. Or maybe not if you read the previous seven parts. This was inspired by NPR Apps’s app-templates project and general malaise dealing with servers. At The Texas Tribune, we’ve talked about turning our data apps into static files; the general reasons being:

  1. Data apps are generally not dynamic. Making a server render the same output over and over is a waste of resources and takes away resources from rendering truly dynamic content.
  2. Static sites do not go down. When they do go down, you’ve got big problems.
  3. They don’t need fancy infrastructure. No database, no caching, no app server, no media server. They are the easiest kind of website to deploy.

So how do you turn a site served from a fancy framework like Django into a static site? Let’s start by looking at what I actually do in my terminal:

  1. I make sure my local dev server is running. For now that looks like python manage.py runserver 0.0.0.0:8000.
  2. In another window, I run make site; make upload. I can’t do make site upload right now because I need to handle a non-zero exit status in  make site.

make site

And now looking at the make site command in my Makefile, here’s the abridged version of that command:

site:
    cd site && wget -r localhost:8000 --force-html -e robots=off -nH -nv --max-redirect 0

And the detailed breakdown of that command:

  • cd site: this is the working directory I want the files downloaded to.
  • wget: this is the command that does the actual work. “GNU Wget is a free utility for non-interactive download of files from the Web.”
  • r: recursive
  • localhost:8000: the host to download files from
  • --force-html: Without this, wget would not interpret the django trailing-slash urls as html
  • -e robots=off: Not really necessary, but does prevent a 404 request to robots.txt
  • -nH: Disable generation of host-prefixed directories so the file paths are cleaner
  • -nv: Turn off verbose without being completely quiet, but no so quiet that I wouldn’t see errors
  • --max-redirect 0: Right now I’m using OpenDNS, and when I do hit a 404, this keeps me from downloading OpenDNS’s stupid search page.

If you’re interested in wget, I highly recommend skimming the wget manual to learn about other cool options and checking out commandlinefu’s wget recipes.

Levar Burton enjoys reading UNIX Man pages, so should you!

Out of the box, wget will do a very job of getting everything needed to build a Django site, as long as you aren’t using AJAX to pull things in. But the Elevator Explorer does use AJAX. To trick wget into downloading these, I added hints to the templates:

<script src="{{ STATIC_URL }}tx_elevators/js/search.js"></script>
<a class="prefetch-hint" href="/chart/search/data.json" rel="nofollow" style="display: none;"></a>

In the future, I think I’ll refine this by putting the anchor directly before the script tag, switch to using the {% url %} templatetag, and then you can get at that url with something like this jQuery pseudocode:

<a href="{% url 'my_data_source' %}"></a>
<script>
  var url = $('script:last').prev().attr('href'); $.ajax(url, ...);
</script>

This will hopefully be a decent solution the common problem of trying to use named Django urls in JavaScript. The next problem I discovered is that I needed the json I sent to look like a file to wget so it wouldn’t mangle it to index.html (breaking the requests inside the static site). I just changed the url patterns to end in “.json$” instead of “/$” in 36f276.

Another idea I might try is using the <link> tag, but I’d have to make sure wget still downloaded the resources, and you can only put them in <head>.

make upload

The next part of the process is to upload the site directory someplace. I chose to use S3 and its ability to host a static site. A regular webhost would also work, and might even work better if you had the ability to use rsync instead of what I had to do. But let’s continue and go into what exactly I did, starting with the make command:

upload:
    python $(PROJECT)/manage.py sync_s3 --dir site --gzip

The sync_s3 command here is a basic fork of django-extensions’s  sync_media_s3 command. I only made one tweak to it to so it’ll gzip json and html too. If you don’t need that, you can use the original sync_media_s3 command. To set up the credentials for S3, you can either read django-extensions’s docs or just try it. The error messages will tell you exactly what you need to fix whatever doesn’t work. I will point out that  the source directory is set by the --dir flag, and the destination S3 bucket is controlled by the general configuration.

Performance

We know the performance of a static website is awesome, but the trade-off is it’s slow to do the work up front of generating a static website from a dynamic one. You may have noticed the terrible timing figures documented in the makefile. It takes over half an hour to crawl the site, and over three hours to upload to S3. I fiddle a few knobs in a futile effort to speed things up: turning off debug and using gunicorn, but they didn’t help. There are some other ideas I pondered for improving performance:

  • It would be possible to write a django specific spider instead of wget that could read the url patterns. This is what app-template does for flask if you look at its fab render task.
  • I could make the process multi-threaded. There are some alternatives to wget, but wget is good at what it does and is ubiquitous. I’d rather speed up the upload process. Github user billyvg did some work on making sync_media_s3 multithreaded in issue #161.

Other ideas that are probably bad ideas or not worth the effort:

  • Combine the download/upload steps
  • Prioritize pages so important pages are updated first

And for comparison, using rsync is so much faster it seems silly not to use it:

time rsync -avz site/ remote:elevators
sent 45200666 bytes  received 561995 bytes  92356.53 bytes/sec
total size is 123778877  speedup is 2.70

real    8m14.409s

Conclusion

For staging, I did deploy to Heroku. But I didn’t want to pay to keep a database online (this app exceeds the free database limit of 10,000 rows) and I didn’t know how much traffic I could serve from the free tier. The static site worked really well, except for the 4 hour deploys. I think it’s possible to get it to 2 hours, which is great for a 25,000 page site.

Case Study Nerd

Dissecting Elevators Part 7: Search + D3

With almost 25 thousand buildings, having a way to search the database would be nice. And just like how I did geo searches client side, searching buildings was done client side too. The inspiration for the interface was The Texas Tribune’s Tribpedia landing/search page. The final building search page can be found at elevators.texastribune.org/building/ and the JavaScript source can be found at github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/static/tx_elevators/js/search.js.

How it’s made

The one lonely chart on the home page isn’t the only place where D3 is used. The building search UI is built using D3. “But it looks like HTML, not SVG!” the astute reader might remark, and you would be correct. D3 is not just for SVG, it can be used for HTML too.

The first step was to build the NOSCRIPT version, which was just a list with links. This list also serves as a way to map building ELBI IDs to urls and as a map for search engines. This was partially because the original API didn’t have a way to return urls, and because I wanted to keep the json light. As a side note: When I moved to putting slugs in the building URLs, I had to go back and modify my geolocation API to return building URLs (and thanks to generators, it was not hard to do).

The first thing I did was bin each building by name. The important thing to note is that I split it up into several steps. The binData function organizes the raw data into bins, which almost parallels the structure of the final HTML. The binned data then gets passed to the prepNameData function which finesses that data to look exactly like the structure of the HTML. The next block of code, prepBinsHtml, is just basic D3 to to map the finessed binned data to the final HTML. The rest of the file is just jQuery for the user interaction. This demonstrates the power of D3: Once I had the code to map the one set of raw data to HTML, changing the data, either by filtering it for search or re-binning the data, just magically worked.

Performance details

I didn’t have a lot of time to make tweaks, or even do basic documentation, but I did make one optimization: I kept the number of dom elements to a minimum. I used divs for items to avoid ul>li nesting, and I didn’t use a tags because it would have doubled the number of elements needed for each building. And when you have nearly 25,000 buildings, it matters. The major downside is I had to fake a tag behavior, which is still incomplete. For example, you can’t middle-click to open in a new tab, and the hover interaction is only partially there. Much of that optimization was inspired by these slides by Jon Rohan of GitHub. And if you compare the interaction of the Tribpedia landing/search page with the building search page, you’ll see that this approach results in a much faster experience.

Case Study Nerd

Dissecting Elevators Part 6: GeoDjango? We don’t need no stinkin’ GeoDjango

Being able to locate the nearest elevator was a must have feature for me. Being able to deploy using the free tier on Heroku was a nice to have feature. Luckily, I managed to find a way to do both: do the calculations client side in JavaScript.

Lucky for me, I found a post, Calculate distance, bearing and more between Latitude/Longitude points, with some JavaScript algorithms on how to get distances between two lat/long points. There’s no such thing as perfect distance algorithm, but if you start making assumptions about the shape of the Earth, you can get a pretty good guess. I implemented four different ways to calculating distance; each a tradeoff between accuracy and speed. The four ways being: the Haversine formula, the spherical law of cosines, the equirectangular projection (Pythagorean), and rectilinear (Taxicab) distance. For my own reference, I’m putting a copy of the JavaScript algorithms here:

// distance approximators
/** Converts numeric degrees to radians */
if (!Number.prototype.toRad){
  Number.prototype.toRad = function(){
    return this * Math.PI / 180;
  };
}

var distance = {
  R: 6371,  // kilometers
  haversine: function(lat1, lng1, lat2, lng2){
    var dLat = (lat2-lat1).toRad();
    var dLon = (lng2-lng1).toRad();
    lat1 = lat1.toRad();
    lat2 = lat2.toRad();

    var a = Math.sin(dLat/2) * Math.sin(dLat/2) +
      Math.sin(dLon/2) * Math.sin(dLon/2) * Math.cos(lat1) * Math.cos(lat2);
    var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
    return distance.R * c;
  },
  spherical: function(lat1, lng1, lat2, lng2){
    lat1 = lat1.toRad();
    lat2 = lat2.toRad();
    lng1 = lng1.toRad();
    lng2 = lng2.toRad();
    return Math.acos(Math.sin(lat1)*Math.sin(lat2) +
      Math.cos(lat1) * Math.cos(lat2) *
      Math.cos(lng2 - lng1)) * distance.R;
  },
  pythagorean: function(lat1, lng1, lat2, lng2){
    lat1 = lat1.toRad();
    lng1 = lng1.toRad();
    lat2 = lat2.toRad();
    lng2 = lng2.toRad();
    var x = (lng2 - lng1) * Math.cos((lat1 + lat2) / 2),
        y = lat2 - lat1;
    return Math.sqrt(x * x + y * y) * distance.R;
  },
  taxicab: function(lat1, lng1, lat2, lng2){
    lat1 = lat1.toRad();
    lng1 = lng1.toRad();
    lat2 = lat2.toRad();
    lng2 = lng2.toRad();
    return (Math.abs(lat1 - lat2) + Math.abs(lng1 - lng2)) * distance.R;
  }
};

For more information about the first three distance metrics, you should check out the original post. You can see that the computational complexity for each algorithm decreases dramatically, from using trigonometry and a square root, to pure trig, to a simple trig and a square root, to basic arithmetic. The code for getting the ten closest buildings turned out pretty simple. Here, _data is a global array of all the buildings:

// Get the closest `Building`s to `lat` and `lng`.
//
// Modifies the global `_data` by storing the distance and also sorts it.
var closestBuildings = function(lat, lng){
  var metric = distance.spherical, x;
  for (var i = 0; i < _data.length; i++){
    x = _data[i];
    x.distance = metric(lat, lng, x.latitude, x.longitude);
  }
  // go ahead and sort in place.
  _data.sort(function(a, b){ return a.distance - b.distance; });
  return _data.slice(0, 10);
};

Evaluating the different distance metrics

Just to see what the difference was between the four different distance metrics, I compared the result of the same search. I picked an arbitrary point where there weren’t too many elevators so it would have to go out a long ways before getting 10 elevators. I like BBQ, so I picked a place in Lockhart, TX. To my surprise, not only did the Haversine, spherical law of cosines, and equirectangular projection give the same results in the same order, but they also gave the same distances to 4 significant digits (your results on the live site may look different from mine because of differences between the data on my local machine and the live site):

Closest Elevators to Lockhart, TX
Closest elevators to Lockhart

The results of the search using the taxicab metric were very not bad either. You can see that the taxicab metric punishes going diagonally:

Closest elevators to Lockhart using the taxicab metric
Closest elevators to Lockhart using the taxicab metric

They were even better (obviously) over shorter distances, as you can see in this comparison for downtown Austin:

Closest elevators, downtown Austin
Closest elevators, downtown Austin
Closest elevators, downtown Austin, using the taxicab metric
Closest elevators, downtown Austin, using the taxicab metric

You can see I got the same results in a slightly different order.

Trickery!

If you’re wondering how I got the map pins to go from A to J and match the labels on the legend, wonder no more. The answer is: I cheated. I created the map pins using the old Google charts API in my mapping code:

"http://chart.apis.google.com/chart?chst=d_map_pin_letter&chld=" +
        String.fromCharCode(65 + i) + "|" + pinColor,

If you vary i from 0 to 9, you get the letters A to J. The legend is an ordered list with the css: list-style: upper-alpha, which handles going from A to J on the map legend. Since both are built from the same source in the same order, they match up.

Conclusion

And there you have it. Basic closest point searches done client side. The nearly fully geocoded dataset with 23,205 buildings takes a megabyte to transfer, but it could be optimized. At launch, I only had a fraction of the buildings geocoded, so it was less than 150 kilobytes then. I may change the default distance metric from spherical to Pythagorean to get some more speed. It would be an interesting exercise to convert to GeoDjango and compare the distance results again.

Case Study Nerd

Dissecting Elevators Part 5: nosql circa 1986

Intro: This is part five of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

I knew I wanted to geocode all the addresses for the buildings, but I didn’t quite know how my models would look. I knew from past experience that doing a pass of geocoding, then resetting the database, would mean I would have to start geocoding again from square one. How could I make this better?

If only I had a wrapper around geopy that would persist old queries to disk. So I started writing one. At first, I thought I would need to do this in sqlite, but after doing a search for “python+key+value+store”, I found anydbm. What is anydbm? Anydbm is a generic interface to any dbm database. What a name. In my case, it was using Berkley DB. It’s really easy to use: 1) open a file 2) treat it like a dict. Way easier than trying to get a sqlite database going. But my database kept getting corrupted! I finally figured out that I needed to open and close the file for every transaction. Since the anydbm library is pretty dated and I couldn’t use it like a context manager, I had to manually close the file.

My working version of the GoogleV3 geocoder looks like this. I also made a script for dumping my existing geo data back to an anydbm database; that’s viewable here.

So after all that, I ended up with a library that mimicked the GoogleV3 geocoder. To use it, instead of the standard syntax of:

>>> from geopy import geocoders
>>> g = geocoders.GoogleV3()
>>> place, (lat, lng) = g.geocode("10900 Euclid Ave in Cleveland")
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
10900 Euclid Ave, Cleveland, OH 44106, USA: 41.50489, -81.61027

my database cached version of that is:

>>> from geopydb import geocoders
>>> g = geocoders.GoogleV3()
>>> place, (lat, lng) = g.geocode("10900 Euclid Ave in Cleveland")
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
10900 Euclid Ave, Cleveland, OH 44106, USA: 41.50489, -81.61027

Pretty convenient, and made my life easier. You may have noticed I’m not using GeoDjango. That’s because I wanted to deploy to the free tier at Heroku.

Improvements

If I had to write this now, I would switch to using dataset. Dataset came out around the same time as the Elevator Explorer. If it was out a week earlier, I could have used it.

Case Study Nerd

Dissecting Elevators part 4: Django Data Driven Documents

Intro: This is part four of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Hey everybody, come look at the D3

So first off, this is going to go over the sketchiest code in the project. It seemed to work well for Elevators, but I have no idea how well this would translate to other projects.

At the bottom of the landing page, there’s this chart:

Texas Elevators Chart
The distribution of age vs floors for almost every elevator in Texas.

If you examine the source, you’ll see that it’s just a DIV with data attrs that the JavaScript uses to create an IFRAME version of the real chart full of D3 magic.

Responsive Iframes

So why use an iframe? Why not just load the chart directly into the container div? The iframe may cost us a bunch of extra HTTP requests and make it harder to integrate with the page*, but they also give us an instant embeddable. But this is the era of responsive design, and iframes are even harder to make responsive than SVGs. My solution was to insert the iframe with JavaScript. I could have just operated directly on the IFRAME, but adding another layer (hooray more complexity) also adds the ability to serve different content based on feature detection (noscript, no svg, mobile) in the future. Why didn’t I use a preexisting library like NPR’s responsiveiframe? They were all too complicated to use (e.g. I couldn’t get them to work in 5 minutes) and I didn’t have to worry about same origin issues so I could write simpler JavaScript:

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/static/tx_elevators/js/tx_elevators.js#L327-L363

With this technique, adding an interactive visualization is as easy as adding an image. How it works is you put markup into the page similar to how you would do a  VIDEO tag:

<section>
  <h2>Distribution of Age and Floors of Elevators</h2>
  <div data-src="/chart/elevatorlist/">
    <noscript>
      Missing Figure
    </noscript>
  </div>
</section>

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/templates/tx_elevators/landing.html#L46-L50

And the JavaScript knows how to interpret that to insert the chart.

*Integration problems with iframes:

  • Tooltips can’t extend outside the bounds of the iframe.
  • Interactivity between the chart and other elements becomes difficult.

One chart, one view, one queryset

Doing iframes for charts also solves another headache: managing all the querysets that power each visualization. Powering the visualizations off a generic api is an admirable goal, but falls flat in reality. And mixing in all the querysets you need into your views makes for some massive views and maintenance headaches. Writing simple single-purpose views keeps the payload small and the code neat. You may end up with a maintenance problem of another kind, but I would rather have 10 simple small views than one unintelligible mammoth view. You can see what I ended up doing here:

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/chart_views.py#L9-L45

The same basic Django TemplateView handles both the chart HTML and the data needed for the visualization. Another feature is the ajax endpoint for the chart is always at the same relative URL, so for the visualization at: /chart/elevatorlist/, the data lives at /chart/elevatorlist/data.json. You don’t have to worry about moving information about your Django urlconf into JavaScript. And I’m foreshadowing here, but it’s at ./data.json and not ./data/ because it’s friendlier for wget. Another tip is to make sure you’re gzipping your json, especially for development. In my urls.py, you can see I’m using the gzip_page decorator. But you can also enable GZipMiddleware. This makes it much easier to know right away if your json is too large. For production, you should think about turning off gzipping in the application server and moving it up the stack.

Next steps

Like I said, this code was the sketchiest. There are many improvements to be made. Some big ones I can think of include:

  1. Making the chart loader JS capable of loading the chart directly into the container div instead of in an iframe.
  2. Adding a URL pattern helper for automatically setting up the routes for the chart template and data in one line.
  3. Making the API return an array of arrays instead of an array of objects to save bandwidth. It’s a little easier to throw into D3 this way too.

Live demo

Conclusion

So the title was a lie. I didn’t go into the D3 at all. Gotcha. If you’re curious, here it is. Next time, I’ll show you how a rarely used database from 1986 helped me geocode.