Tag Archives: python

Best Practices

requirements.txt==2.0

There is a war going on. A war between those that say Python requirements should be explicit and those that say requirements should be implicit. Before I continue, I’m going to be talking about requirements.txt, not setup.py. The difference between explicit and implicit requirements comes down to whether the line says Django==1.9.7 or Django, respectively. Going deeper, you could also say that adding dependencies of dependencies is explicit, and you could loosely pin like Django<1.10.

The advantage of explicit requirements is you get a repeatable environment. Especially if you’re also specifying dependencies of dependencies. The advantages of implicit requirements are readability and automatic security upgrades.

Here at TabbedOut, we’ve developed a technique that works very well I’d like to share: Use pip-tools to manage your requirements. You get the best of both worlds, at the expense of some extra boilerplate. Here’s how we do it:

  1. Be in a virtualenv
  2. Use our Makefile boilerplate (see below)
  3. pip install pip-tools
  4. Write a “sloppy” requirements.txt using implicit requirements, but name it requirements.in
  5. Run make requirements.txt
  6. Check all this into your codebase

Advantages

  • requirements.in is easy to maintain
  • requirements.txt has pinned versions so your virtualenv matches your collaborators and production
  • You automatically get patches and security fixes when you run make requirements.txt, and there are no surprises because it goes through your code review process

Tips

  • Try to loosely pin requirements in your requirements.in. Though it doesn’t matter that much because you’ll catch it when you see a major version change in requirements.txt.
  • Specifying an exact version in requirements.in is an anti-pattern, and you should document why. Often it’s because there’s a bug or backwards-incompatible change.

Makefile boilerplate

Here’s what a Makefile might contain:

help: ## Shows this help
	@echo "$$(grep -h '#\{2\}' $(MAKEFILE_LIST) | sed 's/: #\{2\} /	/' | column -t -s '	')"

install: ## Install requirements
	@[ -n "${VIRTUAL_ENV}" ] || (echo "ERROR: This should be run from a virtualenv" && exit 1)
	pip install -r requirements.txt

.PHONY: requirements.txt
requirements.txt: ## Regenerate requirements.txt
	pip-compile --upgrade --output-file $@ requirements.in
  • help: This is just a fast way of making your Makefile self-documenting.
  • install: Nowadays, you need Python and non-Python requirements. Putting it all in one make target makes it easier for developers to jump into a project.
  • PHONY: When you run make requirements.txt, you want it to run every time. Not just when requirements.in changes. That’s because new versions may have been uploaded to PyPI. I always group my PHONY with my target. Even though it adds more lines, your Makefile will be more maintainable because you’re not trying to keep a list off the screen up to date.
  • requirements.txt: Why make requirements.txt over make requirements? Because best practice dictates that if the output of a make target is a file, that file should also be the name of the target. That way, you can use the automatic variable $@ and it’s explicit, even at the cost of needing the PHONY.
  • –upgrade: Without this, pip-tools doesn’t actually upgrade your dependencies.
  • –output-file $@: pip-tools does this by default, but explicit is better than implicit. I would prefer to do pip-compile --upgrade requirements.in > $@ but pip-tools 1.6 does a poor job of dealing with stdout (see below).

Caveats

  • When you change requirements.in, you do have to remember to run make requirements, but you could automate that with a git-hook or CI process. In practice, we’ve found that running make requirements.txt is fine.
  • pip-tools==1.6 does not work with the latest pip (8.1.2). See #358
  • pip-tools==1.6 has a poor understanding of how stdin and stdout are supposed to work. Hopefully this gets fixed soon but is only a minor annoyance. #362 #360 #353 #104
  • The compilation step can depend on your platform. I’ve only noticed this with ipython, which needs packages for interacting with the terminal like gnureadline. It hasn’t been trouble for us, but it could be for you. A workaround is to run the process in a Docker container.

Sample Scenarios

If you need more convincing, here are some problems this approach solves for us:

I thought I was getting xyzpackage version 3, why is version 2 getting installed? Pip tools flattens all your requirements, and annotates which package specified what. So in requirements.txt, you’ll see xyzpackage==2.4    # via scumbagpackage and see that scumbagpackage was responsible.

What packages am I actually using? In a large project, your requirements.txt will balloon as you run into bugs and start pinning dependencies of dependencies. Then one day, you’ll realize you don’t know what packages you’re actually using. With a much simpler requirements.in, there’s less to sort through and fully pinned packages stick out like sore thumbs.

It works for me Sometimes a project will work only for you. You check your installed versions against requirements.txt and they match. But what you didn’t realize is a dependency of a dependency broke something. Since pip-tools freezes everything, you’ll have the same version of every package installed. And if something does break, you’ll have history to trace down what changed.

Finish Writing Me Plz Nerd

Apache Bench

For years, my tool for simple load tests of HTTP sites has been ApacheBench.

For years, my reference for how to visualize ApacheBench results has been Gnuplot

For years, my reference for how to use Gnuplot has been http://www.bradlanders.com/2013/04/15/apache-bench-and-gnuplot-youre-probably-doing-it-wrong/

But do you really want to be writing Gnuplot syntax? It turns out that Pandas will give you great graphs pretty much for free:

df = pd.read_table('../gaussian.tsv')
# The raw data as a scatterplot
df.plot(x='seconds', y='wait', kind='scatter')
scatter
# The traditional Gnuplot plot
df.plot(y='wait')
wait
# Histogram
df.wait.hist(bins=20)

distribution

 

You can see the full source code at tsv_processing.ipynb

And re-recreate these yourself by checking out the parent repo: github/crccheck/abba

So now you might be thinking: How do you get a web server that outputs a normal distribution of lag? Well, I wrote one! I made a tiny Express.js server that just waits a random amount, packaged it in a Docker image, and and you can see exactly how I ran these tests by checking out my Makefile.

Meh Practices Nerd Patterns

Django management commands and verbosity

Ren and Stimpy

[update: This post has been corrected, thanks to my commenters for your feedback]

Every Django management command gets the verbosity option for free. You may recognize this:

optional arguments:
  -h, --help            show this help message and exit
  -v {0,1,2,3}, --verbosity {0,1,2,3}
                        Verbosity level; 0=minimal output, 1=normal output,
                        2=verbose output, 3=very verbose output

We rarely use it because doing so usually means lots of if statements scattered through our code to support this. If you’re writing quick n’ dirty code, this may look familiar in your management commands:


if options.get('verbosity') == 3:
    print('hi')

In a recent Django project, I came up with a few lines of boilerplate to support the verbosity option, assuming you’re using also the logging library and not relying on print:


import logging
class Command(BaseCommand):
    def handle(self, *args, **options):
        verbosity = options.get('verbosity')
        if verbosity == 0:
            logging.getLogger('my_command').setLevel(logging.WARN)
        elif verbosity == 1:  # default
            logging.getLogger('my_command').setLevel(logging.INFO)
        elif verbosity > 1:
            logging.getLogger('my_command').setLevel(logging.DEBUG)
        if verbosity > 2:
            logging.getLogger().setLevel(logging.DEBUG)

github.com/texas/tx_mixed_beverages/blob/master/mixed_beverages/apps/receipts/management/commands/geocode.py

So what does this do?

At the default verbosity, 1, I display INFO logging statements from my command. Increasing verbosity to 2, I also display DEBUG logs from my command. And going all the way to verbosity 3, I also enable all logging statements that reach the root logger.

Go forth and log!

Best Practices Patterns

Patterns: don’t mess up the prod db!

With 12factor style environment configs, it’s a very easy to accidentally connect to your production database when you think you’re connecting to dev. Here’s a simple guard you can add to make sure your  DATABASE_URL isn’t somehow pointed to someplace it’s not supposed to (assuming you’re using Amazon AWS):

bash:

if [[ ${DATABASE_URL} == *"amazonaws"* ]]; then exit -1; fi

Python:

if 'amazonaws' in os.environ['DATABASE_URL']:
   exit('Cannot be run against a production database')

Django:

if 'amazonaws' in settings.DATABASES['default']['HOST']:
    raise CommandError('Cannot be run against a production database')

(thanks to x110dc for the bash syntax and 12factor link)

You didn't say the magic word

Meh Practices Nerd

Autobuilding Sphinx Docs With inotify

So tonight I decided I’ve been putting it off far enough, and started my first Sphinx documentation project. Read The Docs’s Getting Started guide wasn’t bad, but it really bothered me that there wasn’t a good way to autobuild the docs.

I ended up just adding this to the makefile:

# If you don't have inotify-tools, install it first: `apt-get install inotify-tools`
watch:
    @while true; do \
    inotifywait --recursive -e modify -e move -e create -e delete \
    --exclude $(BUILDDIR) . && $(MAKE) --silent html; done

This blog post by JKM is often cited as a way to do it too, but the install_requires section of watchdog’s setup.py was longer than I felt like sitting through.

There’s a promising project, sphinx-autobuild, but the readme currently says “sphinx-autobuild relies on a not-yet-released version of python-livereload” and it also uses watchdog; so forget that. If I end up spending a lot of time writing docs, it’s worth getting this up and running though because it comes with LiveReload.

Case Study Nerd

Dissecting Elevators Part 8: deploying

If you examine the repo, you’ll see it’s a Django app; and the Procfile and requirements.txt would make you think the app itself was deployed on Heroku, but you would be wrong!

You may be surprised to find out that the Elevators Explorer is a static HTML app. Or maybe not if you read the previous seven parts. This was inspired by NPR Apps’s app-templates project and general malaise dealing with servers. At The Texas Tribune, we’ve talked about turning our data apps into static files; the general reasons being:

  1. Data apps are generally not dynamic. Making a server render the same output over and over is a waste of resources and takes away resources from rendering truly dynamic content.
  2. Static sites do not go down. When they do go down, you’ve got big problems.
  3. They don’t need fancy infrastructure. No database, no caching, no app server, no media server. They are the easiest kind of website to deploy.

So how do you turn a site served from a fancy framework like Django into a static site? Let’s start by looking at what I actually do in my terminal:

  1. I make sure my local dev server is running. For now that looks like python manage.py runserver 0.0.0.0:8000.
  2. In another window, I run make site; make upload. I can’t do make site upload right now because I need to handle a non-zero exit status in  make site.

make site

And now looking at the make site command in my Makefile, here’s the abridged version of that command:

site:
    cd site && wget -r localhost:8000 --force-html -e robots=off -nH -nv --max-redirect 0

And the detailed breakdown of that command:

  • cd site: this is the working directory I want the files downloaded to.
  • wget: this is the command that does the actual work. “GNU Wget is a free utility for non-interactive download of files from the Web.”
  • r: recursive
  • localhost:8000: the host to download files from
  • --force-html: Without this, wget would not interpret the django trailing-slash urls as html
  • -e robots=off: Not really necessary, but does prevent a 404 request to robots.txt
  • -nH: Disable generation of host-prefixed directories so the file paths are cleaner
  • -nv: Turn off verbose without being completely quiet, but no so quiet that I wouldn’t see errors
  • --max-redirect 0: Right now I’m using OpenDNS, and when I do hit a 404, this keeps me from downloading OpenDNS’s stupid search page.

If you’re interested in wget, I highly recommend skimming the wget manual to learn about other cool options and checking out commandlinefu’s wget recipes.

Levar Burton enjoys reading UNIX Man pages, so should you!

Out of the box, wget will do a very job of getting everything needed to build a Django site, as long as you aren’t using AJAX to pull things in. But the Elevator Explorer does use AJAX. To trick wget into downloading these, I added hints to the templates:

<script src="{{ STATIC_URL }}tx_elevators/js/search.js"></script>
<a class="prefetch-hint" href="/chart/search/data.json" rel="nofollow" style="display: none;"></a>

In the future, I think I’ll refine this by putting the anchor directly before the script tag, switch to using the {% url %} templatetag, and then you can get at that url with something like this jQuery pseudocode:

<a href="{% url 'my_data_source' %}"></a>
<script>
  var url = $('script:last').prev().attr('href'); $.ajax(url, ...);
</script>

This will hopefully be a decent solution the common problem of trying to use named Django urls in JavaScript. The next problem I discovered is that I needed the json I sent to look like a file to wget so it wouldn’t mangle it to index.html (breaking the requests inside the static site). I just changed the url patterns to end in “.json$” instead of “/$” in 36f276.

Another idea I might try is using the <link> tag, but I’d have to make sure wget still downloaded the resources, and you can only put them in <head>.

make upload

The next part of the process is to upload the site directory someplace. I chose to use S3 and its ability to host a static site. A regular webhost would also work, and might even work better if you had the ability to use rsync instead of what I had to do. But let’s continue and go into what exactly I did, starting with the make command:

upload:
    python $(PROJECT)/manage.py sync_s3 --dir site --gzip

The sync_s3 command here is a basic fork of django-extensions’s  sync_media_s3 command. I only made one tweak to it to so it’ll gzip json and html too. If you don’t need that, you can use the original sync_media_s3 command. To set up the credentials for S3, you can either read django-extensions’s docs or just try it. The error messages will tell you exactly what you need to fix whatever doesn’t work. I will point out that  the source directory is set by the --dir flag, and the destination S3 bucket is controlled by the general configuration.

Performance

We know the performance of a static website is awesome, but the trade-off is it’s slow to do the work up front of generating a static website from a dynamic one. You may have noticed the terrible timing figures documented in the makefile. It takes over half an hour to crawl the site, and over three hours to upload to S3. I fiddle a few knobs in a futile effort to speed things up: turning off debug and using gunicorn, but they didn’t help. There are some other ideas I pondered for improving performance:

  • It would be possible to write a django specific spider instead of wget that could read the url patterns. This is what app-template does for flask if you look at its fab render task.
  • I could make the process multi-threaded. There are some alternatives to wget, but wget is good at what it does and is ubiquitous. I’d rather speed up the upload process. Github user billyvg did some work on making sync_media_s3 multithreaded in issue #161.

Other ideas that are probably bad ideas or not worth the effort:

  • Combine the download/upload steps
  • Prioritize pages so important pages are updated first

And for comparison, using rsync is so much faster it seems silly not to use it:

time rsync -avz site/ remote:elevators
sent 45200666 bytes  received 561995 bytes  92356.53 bytes/sec
total size is 123778877  speedup is 2.70

real    8m14.409s

Conclusion

For staging, I did deploy to Heroku. But I didn’t want to pay to keep a database online (this app exceeds the free database limit of 10,000 rows) and I didn’t know how much traffic I could serve from the free tier. The static site worked really well, except for the 4 hour deploys. I think it’s possible to get it to 2 hours, which is great for a 25,000 page site.

Case Study Nerd

Dissecting Elevators Part 5: nosql circa 1986

Intro: This is part five of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

I knew I wanted to geocode all the addresses for the buildings, but I didn’t quite know how my models would look. I knew from past experience that doing a pass of geocoding, then resetting the database, would mean I would have to start geocoding again from square one. How could I make this better?

If only I had a wrapper around geopy that would persist old queries to disk. So I started writing one. At first, I thought I would need to do this in sqlite, but after doing a search for “python+key+value+store”, I found anydbm. What is anydbm? Anydbm is a generic interface to any dbm database. What a name. In my case, it was using Berkley DB. It’s really easy to use: 1) open a file 2) treat it like a dict. Way easier than trying to get a sqlite database going. But my database kept getting corrupted! I finally figured out that I needed to open and close the file for every transaction. Since the anydbm library is pretty dated and I couldn’t use it like a context manager, I had to manually close the file.

My working version of the GoogleV3 geocoder looks like this. I also made a script for dumping my existing geo data back to an anydbm database; that’s viewable here.

So after all that, I ended up with a library that mimicked the GoogleV3 geocoder. To use it, instead of the standard syntax of:

>>> from geopy import geocoders
>>> g = geocoders.GoogleV3()
>>> place, (lat, lng) = g.geocode("10900 Euclid Ave in Cleveland")
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
10900 Euclid Ave, Cleveland, OH 44106, USA: 41.50489, -81.61027

my database cached version of that is:

>>> from geopydb import geocoders
>>> g = geocoders.GoogleV3()
>>> place, (lat, lng) = g.geocode("10900 Euclid Ave in Cleveland")
>>> print "%s: %.5f, %.5f" % (place, lat, lng)
10900 Euclid Ave, Cleveland, OH 44106, USA: 41.50489, -81.61027

Pretty convenient, and made my life easier. You may have noticed I’m not using GeoDjango. That’s because I wanted to deploy to the free tier at Heroku.

Improvements

If I had to write this now, I would switch to using dataset. Dataset came out around the same time as the Elevator Explorer. If it was out a week earlier, I could have used it.

Case Study Nerd

Dissecting Elevators part 4: Django Data Driven Documents

Intro: This is part four of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Hey everybody, come look at the D3

So first off, this is going to go over the sketchiest code in the project. It seemed to work well for Elevators, but I have no idea how well this would translate to other projects.

At the bottom of the landing page, there’s this chart:

Texas Elevators Chart
The distribution of age vs floors for almost every elevator in Texas.

If you examine the source, you’ll see that it’s just a DIV with data attrs that the JavaScript uses to create an IFRAME version of the real chart full of D3 magic.

Responsive Iframes

So why use an iframe? Why not just load the chart directly into the container div? The iframe may cost us a bunch of extra HTTP requests and make it harder to integrate with the page*, but they also give us an instant embeddable. But this is the era of responsive design, and iframes are even harder to make responsive than SVGs. My solution was to insert the iframe with JavaScript. I could have just operated directly on the IFRAME, but adding another layer (hooray more complexity) also adds the ability to serve different content based on feature detection (noscript, no svg, mobile) in the future. Why didn’t I use a preexisting library like NPR’s responsiveiframe? They were all too complicated to use (e.g. I couldn’t get them to work in 5 minutes) and I didn’t have to worry about same origin issues so I could write simpler JavaScript:

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/static/tx_elevators/js/tx_elevators.js#L327-L363

With this technique, adding an interactive visualization is as easy as adding an image. How it works is you put markup into the page similar to how you would do a  VIDEO tag:

<section>
  <h2>Distribution of Age and Floors of Elevators</h2>
  <div data-src="/chart/elevatorlist/">
    <noscript>
      Missing Figure
    </noscript>
  </div>
</section>

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/templates/tx_elevators/landing.html#L46-L50

And the JavaScript knows how to interpret that to insert the chart.

*Integration problems with iframes:

  • Tooltips can’t extend outside the bounds of the iframe.
  • Interactivity between the chart and other elements becomes difficult.

One chart, one view, one queryset

Doing iframes for charts also solves another headache: managing all the querysets that power each visualization. Powering the visualizations off a generic api is an admirable goal, but falls flat in reality. And mixing in all the querysets you need into your views makes for some massive views and maintenance headaches. Writing simple single-purpose views keeps the payload small and the code neat. You may end up with a maintenance problem of another kind, but I would rather have 10 simple small views than one unintelligible mammoth view. You can see what I ended up doing here:

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/tx_elevators/chart_views.py#L9-L45

The same basic Django TemplateView handles both the chart HTML and the data needed for the visualization. Another feature is the ajax endpoint for the chart is always at the same relative URL, so for the visualization at: /chart/elevatorlist/, the data lives at /chart/elevatorlist/data.json. You don’t have to worry about moving information about your Django urlconf into JavaScript. And I’m foreshadowing here, but it’s at ./data.json and not ./data/ because it’s friendlier for wget. Another tip is to make sure you’re gzipping your json, especially for development. In my urls.py, you can see I’m using the gzip_page decorator. But you can also enable GZipMiddleware. This makes it much easier to know right away if your json is too large. For production, you should think about turning off gzipping in the application server and moving it up the stack.

Next steps

Like I said, this code was the sketchiest. There are many improvements to be made. Some big ones I can think of include:

  1. Making the chart loader JS capable of loading the chart directly into the container div instead of in an iframe.
  2. Adding a URL pattern helper for automatically setting up the routes for the chart template and data in one line.
  3. Making the API return an array of arrays instead of an array of objects to save bandwidth. It’s a little easier to throw into D3 this way too.

Live demo

Conclusion

So the title was a lie. I didn’t go into the D3 at all. Gotcha. If you’re curious, here it is. Next time, I’ll show you how a rarely used database from 1986 helped me geocode.