Tag Archives: makefiles

Meh Practices Nerd

Autobuilding Sphinx Docs With inotify

So tonight I decided I’ve been putting it off far enough, and started my first Sphinx documentation project. Read The Docs’s Getting Started guide wasn’t bad, but it really bothered me that there wasn’t a good way to autobuild the docs.

I ended up just adding this to the makefile:

# If you don't have inotify-tools, install it first: `apt-get install inotify-tools`
watch:
    @while true; do \
    inotifywait --recursive -e modify -e move -e create -e delete \
    --exclude $(BUILDDIR) . && $(MAKE) --silent html; done

This blog post by JKM is often cited as a way to do it too, but the install_requires section of watchdog’s setup.py was longer than I felt like sitting through.

There’s a promising project, sphinx-autobuild, but the readme currently says “sphinx-autobuild relies on a not-yet-released version of python-livereload” and it also uses watchdog; so forget that. If I end up spending a lot of time writing docs, it’s worth getting this up and running though because it comes with LiveReload.

Case Study Nerd

Dissecting Elevators Part 8: deploying

If you examine the repo, you’ll see it’s a Django app; and the Procfile and requirements.txt would make you think the app itself was deployed on Heroku, but you would be wrong!

You may be surprised to find out that the Elevators Explorer is a static HTML app. Or maybe not if you read the previous seven parts. This was inspired by NPR Apps’s app-templates project and general malaise dealing with servers. At The Texas Tribune, we’ve talked about turning our data apps into static files; the general reasons being:

  1. Data apps are generally not dynamic. Making a server render the same output over and over is a waste of resources and takes away resources from rendering truly dynamic content.
  2. Static sites do not go down. When they do go down, you’ve got big problems.
  3. They don’t need fancy infrastructure. No database, no caching, no app server, no media server. They are the easiest kind of website to deploy.

So how do you turn a site served from a fancy framework like Django into a static site? Let’s start by looking at what I actually do in my terminal:

  1. I make sure my local dev server is running. For now that looks like python manage.py runserver 0.0.0.0:8000.
  2. In another window, I run make site; make upload. I can’t do make site upload right now because I need to handle a non-zero exit status in  make site.

make site

And now looking at the make site command in my Makefile, here’s the abridged version of that command:

site:
    cd site && wget -r localhost:8000 --force-html -e robots=off -nH -nv --max-redirect 0

And the detailed breakdown of that command:

  • cd site: this is the working directory I want the files downloaded to.
  • wget: this is the command that does the actual work. “GNU Wget is a free utility for non-interactive download of files from the Web.”
  • r: recursive
  • localhost:8000: the host to download files from
  • --force-html: Without this, wget would not interpret the django trailing-slash urls as html
  • -e robots=off: Not really necessary, but does prevent a 404 request to robots.txt
  • -nH: Disable generation of host-prefixed directories so the file paths are cleaner
  • -nv: Turn off verbose without being completely quiet, but no so quiet that I wouldn’t see errors
  • --max-redirect 0: Right now I’m using OpenDNS, and when I do hit a 404, this keeps me from downloading OpenDNS’s stupid search page.

If you’re interested in wget, I highly recommend skimming the wget manual to learn about other cool options and checking out commandlinefu’s wget recipes.

Levar Burton enjoys reading UNIX Man pages, so should you!

Out of the box, wget will do a very job of getting everything needed to build a Django site, as long as you aren’t using AJAX to pull things in. But the Elevator Explorer does use AJAX. To trick wget into downloading these, I added hints to the templates:

<script src="{{ STATIC_URL }}tx_elevators/js/search.js"></script>
<a class="prefetch-hint" href="/chart/search/data.json" rel="nofollow" style="display: none;"></a>

In the future, I think I’ll refine this by putting the anchor directly before the script tag, switch to using the {% url %} templatetag, and then you can get at that url with something like this jQuery pseudocode:

<a href="{% url 'my_data_source' %}"></a>
<script>
  var url = $('script:last').prev().attr('href'); $.ajax(url, ...);
</script>

This will hopefully be a decent solution the common problem of trying to use named Django urls in JavaScript. The next problem I discovered is that I needed the json I sent to look like a file to wget so it wouldn’t mangle it to index.html (breaking the requests inside the static site). I just changed the url patterns to end in “.json$” instead of “/$” in 36f276.

Another idea I might try is using the <link> tag, but I’d have to make sure wget still downloaded the resources, and you can only put them in <head>.

make upload

The next part of the process is to upload the site directory someplace. I chose to use S3 and its ability to host a static site. A regular webhost would also work, and might even work better if you had the ability to use rsync instead of what I had to do. But let’s continue and go into what exactly I did, starting with the make command:

upload:
    python $(PROJECT)/manage.py sync_s3 --dir site --gzip

The sync_s3 command here is a basic fork of django-extensions’s  sync_media_s3 command. I only made one tweak to it to so it’ll gzip json and html too. If you don’t need that, you can use the original sync_media_s3 command. To set up the credentials for S3, you can either read django-extensions’s docs or just try it. The error messages will tell you exactly what you need to fix whatever doesn’t work. I will point out that  the source directory is set by the --dir flag, and the destination S3 bucket is controlled by the general configuration.

Performance

We know the performance of a static website is awesome, but the trade-off is it’s slow to do the work up front of generating a static website from a dynamic one. You may have noticed the terrible timing figures documented in the makefile. It takes over half an hour to crawl the site, and over three hours to upload to S3. I fiddle a few knobs in a futile effort to speed things up: turning off debug and using gunicorn, but they didn’t help. There are some other ideas I pondered for improving performance:

  • It would be possible to write a django specific spider instead of wget that could read the url patterns. This is what app-template does for flask if you look at its fab render task.
  • I could make the process multi-threaded. There are some alternatives to wget, but wget is good at what it does and is ubiquitous. I’d rather speed up the upload process. Github user billyvg did some work on making sync_media_s3 multithreaded in issue #161.

Other ideas that are probably bad ideas or not worth the effort:

  • Combine the download/upload steps
  • Prioritize pages so important pages are updated first

And for comparison, using rsync is so much faster it seems silly not to use it:

time rsync -avz site/ remote:elevators
sent 45200666 bytes  received 561995 bytes  92356.53 bytes/sec
total size is 123778877  speedup is 2.70

real    8m14.409s

Conclusion

For staging, I did deploy to Heroku. But I didn’t want to pay to keep a database online (this app exceeds the free database limit of 10,000 rows) and I didn’t know how much traffic I could serve from the free tier. The static site worked really well, except for the 4 hour deploys. I think it’s possible to get it to 2 hours, which is great for a 25,000 page site.

Best Practices Case Study Nerd

Dissecting Elevators part 2: EZ Scrape

Intro: This is part two of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Scraping Data

Having worked on many data apps now, there’s always that initial excitement of figuring out how to import a dataset. But soon, the months (or hours) pass, and you forget how to do the import. Well, that’s no good. Elevators was my latest iteration on doing simple, repeatable data updates. I say latest iteration, because while I’ve managed to improve, there’s plenty to improve upon. For this post, I’ll cover my process up to the actual data import.

tx_elevators_data.git

Because of Reasons
by Ryan Pequin

If your source data is in plain text and has line breaks, you should keep it in source control. This is an extra step, but there are reasons:

  • There’s history. I can look at the previous state of the data and see when they fixed a typo.
  • Can quickly see what happened. Browsing the diff gives you insight into how the data changes.
  • Adds portability. If I had to pull 12 different data sources, I can tell a coworker to do one pull from my git remote to get data in exactly the same state I have it.

To do updates, I use Make. Mike Bostock has a good post on why to use Make called: Why Use Make. To do source control, I use git. You could host it online at github or bitbucket, but that feels wrong. What I do instead is make a bare git repo in Dropbox as my remote origin. To share the makefile, I put a copy of it in the main tx_elevators repo. You can see it at https://github.com/texastribune/tx_elevators/blob/2013-april-fools/data/Makefile.

Keeping the Diffs Cleans

After my first update, it was apparent that I was getting the data back in no particular order and my diffs were going to be unusable. The first thing I tried was a sort:

mv $@ $@.orig
awk 'NR==1; NR > 1 {print $0 | "sort -n"}' $@.orig > $@

It worked, except I discovered that my data had one record with a line break that broke many things. So I turned to csvkit. The current iteration uses csvkit’s csvsort:

mv $@ $@.orig
cat $@.orig | csvsort > $@

It’s significantly slower, but not terrible, and it doesn’t break the data. For reference, the whole command only takes 31 seconds on my old machine:

$ time make clean elevator_data_file.csv
...snip...
real    0m30.641s
user    0m19.841s
sys     0m1.588s

Improvements

Maintaining two copies of the data importer’s makefile is annoying. I’m still trying to figure out that one, but I have an idea. I think I can adjust my  make scrape command to assume there’s a  ./tx_elevators_data directory alongside tx_elevators, and fallback to tx_elevators/data. I should also change the csv filenames, keeping elevator_data_file.csv the original and making a new elevator_data_file.sorted.csv.

Conclusion and Next Time…

Once again, I’ve chosen to use Make to automate things. I was hoping to go into actually importing the data here, but then this post would be way too long. The gist of it is tx_elevator‘s makefile calls  tx_elevators_data‘s makefile to make sure it has fresh data before importing it. The entire process is condensed to one command:  make scrape. In addition to making the process simple and repeatable, it’s also important for the it to be easy to understand. Document everything as if you expect contributors, because you’re going to forget how everything worked.

Best Practices Case Study Nerd

Dissecting Elevators part 1: rapid models.py development

Intro: This is part one of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Rapid models.py development

Introduction

Every minute doing boilerplate at the beginning of a project is a minute you’re not developing. And the beginning of a project is when you really need that momentum to code ideas, not fiddle with settings.  This post goes into how I do my database boilerplate.

It is very common to use sqlite in early Django development because it’s so easy to migrate your database after editing your models. All you have to do is delete your old database file and run syncdb again. I even made a generic make command to find and delete all .sqlite files then run syncdb so I could reuse the same make resetdb command in every project. But what if you want to use another database like PostgreSQL? You could port to using dropdb/createdb, but you may find it annoying to set up for every new project.  Luckily, there is a generic way thanks to django-extensions and DJ-Database-URL, both of which I include on every project. In my settings.py, I still use sqlite by default with:

DATABASES = {'default':dj_database_url.config(default='sqlite:///' +
    project_dir('example_project.sqlite'))}

but in my virtualenv’s postactivate (or in your .env file), I have:

export DATABASE_URL='postgres:///tx_elevators'

so it actually uses PostgreSQL. And by using django-extensions’s reset_db management command, the modifications to the makefile actually end up making things simpler:

# old version:
#   $(foreach db, $(wildcard $(PROJECT)/*.sqlite),\
#      rm $(db);)
#
# new version:
resetdb:
    python $(PROJECT)/manage.py reset_db --router=default --noinput
    python $(PROJECT)/manage.py syncdb --noinput

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/Makefile#L22-L24

Just remember to put django-extensions in your installed apps. And make sure you’re using a recent version of django-extensions (>= 1.1.0), because there was a bug in reset_db prior to then.

Now after every model change, you can still use make resetdb to reset the db, no matter what database engine you use. Well… as long as that database engine is either sqlite, MySQL, or PostgreSQL.

anonymous db

Why not South?

South is a pain. Even if you script away the repetitiveness, it is an overly-complicated, blunt instrument designed for established projects, not for rapidly developing pre-alpha projects. And wiping away the database is a good thing. It means you can’t accumulate baggage in your data that you’ll never be able to recreate. Another benefit is that bootstrapping development on a new machine is a breeze because you’ve make that process simple and repeatable. And if you do need to do a quick migration, my first choice is django-extensions’s sqldiff command. For example, let’s say I wanted to make the year_installed date null-able to indicate bad data, and to make Building.city a foreign key to a new City model. If I changed my original models.py to do that, the output of ./manage.py sqldiff tx_elevators becomes:

BEGIN;
-- Application: tx_elevators
-- Model: City
-- Table missing: tx_elevators_city
-- Model: Building
ALTER TABLE "tx_elevators_building"
    DROP COLUMN "city";
ALTER TABLE "tx_elevators_building"
    ADD "city_id" integer;
CREATE INDEX "tx_elevators_building_city_id_idx"
    ON "tx_elevators_building" ("city_id");
ALTER TABLE "tx_elevators_building"
    ALTER "city" TYPE integer;
ALTER TABLE "tx_elevators_building"
    ALTER COLUMN "city" SET NOT NULL;
-- Model: Elevator
ALTER TABLE "tx_elevators_elevator"
    ALTER COLUMN "year_installed" DROP NOT NULL;
COMMIT;

Which I can pipe into the database. You do have to know some SQL, because the SQL it produces is not always right, but it does get you 95% of the way there.

Afterwards

Once you do a release, you should abandon this and switch to using South. You can still use make reset_db and blow everything away, but you should at least be providing migrations once your project is stable.

Next time…

I’ll go over how I made importing data a one-liner.