Category Archives: Best Practices

Best Practices

requirements.txt==2.0

There is a war going on. A war between those that say Python requirements should be explicit and those that say requirements should be implicit. Before I continue, I’m going to be talking about requirements.txt, not setup.py. The difference between explicit and implicit requirements comes down to whether the line says Django==1.9.7 or Django, respectively. Going deeper, you could also say that adding dependencies of dependencies is explicit, and you could loosely pin like Django<1.10.

The advantage of explicit requirements is you get a repeatable environment. Especially if you’re also specifying dependencies of dependencies. The advantages of implicit requirements are readability and automatic security upgrades.

Here at TabbedOut, we’ve developed a technique that works very well I’d like to share: Use pip-tools to manage your requirements. You get the best of both worlds, at the expense of some extra boilerplate. Here’s how we do it:

  1. Be in a virtualenv
  2. Use our Makefile boilerplate (see below)
  3. pip install pip-tools
  4. Write a “sloppy” requirements.txt using implicit requirements, but name it requirements.in
  5. Run make requirements.txt
  6. Check all this into your codebase

Advantages

  • requirements.in is easy to maintain
  • requirements.txt has pinned versions so your virtualenv matches your collaborators and production
  • You automatically get patches and security fixes when you run make requirements.txt, and there are no surprises because it goes through your code review process

Tips

  • Try to loosely pin requirements in your requirements.in. Though it doesn’t matter that much because you’ll catch it when you see a major version change in requirements.txt.
  • Specifying an exact version in requirements.in is an anti-pattern, and you should document why. Often it’s because there’s a bug or backwards-incompatible change.

Makefile boilerplate

Here’s what a Makefile might contain:

help: ## Shows this help
	@echo "$$(grep -h '#\{2\}' $(MAKEFILE_LIST) | sed 's/: #\{2\} /	/' | column -t -s '	')"

install: ## Install requirements
	@[ -n "${VIRTUAL_ENV}" ] || (echo "ERROR: This should be run from a virtualenv" && exit 1)
	pip install -r requirements.txt

.PHONY: requirements.txt
requirements.txt: ## Regenerate requirements.txt
	pip-compile --upgrade --output-file $@ requirements.in
  • help: This is just a fast way of making your Makefile self-documenting.
  • install: Nowadays, you need Python and non-Python requirements. Putting it all in one make target makes it easier for developers to jump into a project.
  • PHONY: When you run make requirements.txt, you want it to run every time. Not just when requirements.in changes. That’s because new versions may have been uploaded to PyPI. I always group my PHONY with my target. Even though it adds more lines, your Makefile will be more maintainable because you’re not trying to keep a list off the screen up to date.
  • requirements.txt: Why make requirements.txt over make requirements? Because best practice dictates that if the output of a make target is a file, that file should also be the name of the target. That way, you can use the automatic variable $@ and it’s explicit, even at the cost of needing the PHONY.
  • –upgrade: Without this, pip-tools doesn’t actually upgrade your dependencies.
  • –output-file $@: pip-tools does this by default, but explicit is better than implicit. I would prefer to do pip-compile --upgrade requirements.in > $@ but pip-tools 1.6 does a poor job of dealing with stdout (see below).

Caveats

  • When you change requirements.in, you do have to remember to run make requirements, but you could automate that with a git-hook or CI process. In practice, we’ve found that running make requirements.txt is fine.
  • pip-tools==1.6 does not work with the latest pip (8.1.2). See #358
  • pip-tools==1.6 has a poor understanding of how stdin and stdout are supposed to work. Hopefully this gets fixed soon but is only a minor annoyance. #362 #360 #353 #104
  • The compilation step can depend on your platform. I’ve only noticed this with ipython, which needs packages for interacting with the terminal like gnureadline. It hasn’t been trouble for us, but it could be for you. A workaround is to run the process in a Docker container.

Sample Scenarios

If you need more convincing, here are some problems this approach solves for us:

I thought I was getting xyzpackage version 3, why is version 2 getting installed? Pip tools flattens all your requirements, and annotates which package specified what. So in requirements.txt, you’ll see xyzpackage==2.4    # via scumbagpackage and see that scumbagpackage was responsible.

What packages am I actually using? In a large project, your requirements.txt will balloon as you run into bugs and start pinning dependencies of dependencies. Then one day, you’ll realize you don’t know what packages you’re actually using. With a much simpler requirements.in, there’s less to sort through and fully pinned packages stick out like sore thumbs.

It works for me Sometimes a project will work only for you. You check your installed versions against requirements.txt and they match. But what you didn’t realize is a dependency of a dependency broke something. Since pip-tools freezes everything, you’ll have the same version of every package installed. And if something does break, you’ll have history to trace down what changed.

Best Practices Patterns

Patterns: don’t mess up the prod db!

With 12factor style environment configs, it’s a very easy to accidentally connect to your production database when you think you’re connecting to dev. Here’s a simple guard you can add to make sure your  DATABASE_URL isn’t somehow pointed to someplace it’s not supposed to (assuming you’re using Amazon AWS):

bash:

if [[ ${DATABASE_URL} == *"amazonaws"* ]]; then exit -1; fi

Python:

if 'amazonaws' in os.environ['DATABASE_URL']:
   exit('Cannot be run against a production database')

Django:

if 'amazonaws' in settings.DATABASES['default']['HOST']:
    raise CommandError('Cannot be run against a production database')

(thanks to x110dc for the bash syntax and 12factor link)

You didn't say the magic word

Best Practices Case Study

Optimizing Docker Image Size For Real

I’ve come across tips on how to keep Docker images small and Dockerfiles with strange lines that seem to exist only to optimize image size. Well, it turns out they’re all wrong.

They may have an effect with flat Docker images, but everything else (i.e. 99% of what people do), cleanup steps are just extra steps. When Docker builds an image from a Dockerfile, every step is a checkpoint, and every step is saved. If you add 100 MB in one step, then delete it the next, that 100 MB still needs to be saved so other Dockerfiles with the same step can reuse it.

Results

REPOSITORY               TAG             IMAGE ID            CREATED             VIRTUAL SIZE
test/baseline            latest          7b590dec9b43        7 hours ago         272.6 MB
test/baseline_lines      latest          e165025980f7        9 minutes ago       272.6 MB
test/baseline_lists      latest          b40f9e108a93        About an hour ago   272.6 MB
test/combo               latest          744b502e0052        2 seconds ago       269.8 MB
test/combo2              latest          be8f1c1de02e        About an hour ago   249.8 MB
test/combo3              latest          da948e2838d9        About an hour ago   249.8 MB
test/install             latest          e7cadcbb5a05        12 hours ago        269.8 MB
test/install_clean       latest          dd1383285e85        12 hours ago        269.8 MB
test/install_lists       latest          e55f6f8ebac8        12 hours ago        269.8 MB
test/purge               latest          ef8c2aa7400b        About an hour ago   273.5 MB
test/remove              latest          75e3e5c4e246        About an hour ago   273.5 MB

Hypothesis: Docker’s base Ubuntu image does not need `apt-get clean`

I did an experiment around Docker 0.6. I think my conclusion was that `apt-get install … && apt-get clean` saved a few megabytes. But I head that you didn’t need to do that. If you compare the “test/install” and “test/install_clean” size, you’ll see there is no difference. So you don’t need `apt-get clean`.

Hypothesis: `rm -rf /var/lib/apt/lists/*` saves some space

I’ve been seeing a lot of Dockerfiles lately with this line. Including lots of official Docker images. If those guys are all doing it, surely it must have some effect. Nope.

Hypothesis: Combining similar lines saves space

There’s some overhead for each line in a Dockerfile. How significant is it? Well, it turns out it’s not. What I did find out though, is that it does save a significant amount of time and saves a lot of disk thrashing. So combining lines does not save space, but saves time.

Hypothesis: Combining multiple steps saves space

This makes sense. If you skip making checkpoints, you’re not storing intermediate states. And it turns out this is the only way to get a Docker image made from a Dockerfile smaller. But this is at the cost of readability, and more importantly, at the cost of reduced redundancy between images.

Hypothesis: `apt-get purge` saves some space

Well this hypothesis seems silly now. But I see it used now and then. Deletions do not save space.

Conclusion

Write your Dockerfiles the same way you run commands. Don’t prematurely optimize by adding extra cruft you saw someone else do. If you’re actually worried about image size, use some sort of automation to rebuild Docker images behind the scenes. Just keep that logic out of the Dockerfile. And always keep on measuring. Know your bottlenecks.

Best Practices Finish Writing Me Plz Nerd

Managing Technical Debt in Django Projects

This is fine

I’ve been thinking about this subject a lot, and I’ve been meaning to write something. Rather than procrastinate until I have a lot of material, I’m going to just continuously edit this post as I discover things. Many of these principles aren’t specific to Django, but most of this experience comes from dealing with Django.

Some of these tips don’t cost any time, but some involve investing extra time to do things differently. It’s in the name of saving time in the long run. If you’re writing a few lines of JavaScript that’s going to be thrown away in a day, then you shouldn’t waste time building up a castle of tests around it. Managing technical debt is a design tradeoff. You’re sacrificing some agility and features for developer happiness.

Don’t reuse app names and model names

You can have Django apps with the same name. And you can have models with the same name, but your life will be easier if you can reference a model without having to think about which app it came from. This makes it easier to understand the code; makes it easier to to use tools (grep) to analyze and search makes it easier to use shell_plus, which automatically loads every model for you in the shell.

Leave XXX comments

You should leave yourself task comments in your code, and you should have three levels (like critical, error, warning or high, medium, low) for the priority. I commonly use  FIXME for problems,  TODO for todos,  DELETEME for things that really should be deleted,  DEPRECATED for things I really ought to look at later, and  XXX for documenting code smells and anti-patterns. Some examples:

  • FIXME this will break if user is anonymous
  • TODO  make this work in IE9
  • DELETEME This code block is impossible to reach anyways
  • DEPRECATED use new_thing.method instead of this
  • WISHLIST make this work in IE8
  • XXX magic constant
  • FIXME DELETEME this needs csrf

The comment should be on the same line, so when you grep TODO, you’ll be able to quickly scan what kind of todos you have. This is what other tools like Jenkin’s Task Scanner expect too. Many people say you shouldn’t add TODO comments to your code. You should just do them. In practice, that is not practical, and leads to huge diffs that are hard to review.

There is such a thing has too many comments. For example, instead of writing a comment to explain a poorly named variable, you should just rename the variable.

Naming Things

One of Phil Karlton’s famous two hard things, finding the write names can make a huge difference. With Django code, I find that I’m happiest when I’m following the same naming conventions as the Django core. You should be familiar with the concepts of Hungarian notation and coding without comments (remember the previous paragraph?).

Single letter variables are almost always a bad idea, except with simple code. They’re un-greppable and except for the following, have no meaning:

  • i (prefer idx) — a counter
  • x — When iterating through a loop
  • k, v — Key/Value when looping through dict items
  • na count/total/summation (think traditional for-loop)
  • a, b — When looking at two elements, like in a reduce function (very rare)

An exception is math, where x and y, and m and n, etc. are commonplace.

If the name of your variable implies a type, it should be that type. You would not believe how common the name of the variable lies.

When writing code, it should read close to English. Getter functions should begin with “get_”. Booleans and functions that return booleans commonly begin with “is_” (though anything that is readable and obvious will work).

Write tests

My current philosophy is “if you liked it then you should have put a test on it”. The worst part of technical debt is accidentally breaking things. To its credit, Django is the best framework I’ve used for testing. Unit tests are good for TDD, but functional tests are probably better for managing technical debt because they verify the output of your system for various inputs. Doing TDD, getting 100% coverage, and taking into account edge cases… never happens in practice. That does not mean you shouldn’t try. Adding tests to preventing regressions is the second best thing you can do. And the best thing you can do is to write those tests to begin with.

Get coverage

Running coverage is commonly done at the same time as tests. I skip the html reports and use  coverage report from the command line to get faster feedback. When you have good coverage, you can have higher confidence in your tests.

Be cognizant of when you’re creating technical debt

Of course, every line of code is technical debt, but I’ve started adding a “Technical Debt Note” too all pull requests. This is inspired by how legislation will have a fiscal note to assess how much it would cost. Bills can get shot down because they cost to much for what they promise. Features should be the same. Hopefully, you’re already catching these before you even write code, and you’re writing small pull requests. If we find that a pull requests increases technical debt to an unreasonable degree, we revise the requirements and the code.

Clean as you cook

Most people dread cooking because of the mountain of mess that has to get cleaned after the meal. But if you can master cleaning as you cook, there’s a much more reasonable and manageable mess. As you experiment with code, don’t leave behind unused code clean up inconsistencies as you go. Don’t worry about deleting something that might be useful or breaking something obscure. That’s what source control and tests are there for. Plan ahead for the full life cycle. That means if you’re experimenting with a concept, don’t stop when it works: update everything else to be consistent across the project. If it didn’t work, tear it down and kill it. Don’t get into a situation where you have to support two (or three or four or more) different paradigms.

The Boy Scout rule

“Always leave the campground leaner then you found it”. Feel free to break paradigm that a pull request should only do one thing. If you happen to clean something while working on a feature, there’s no shame in saying you took a little diversion.

Make it easy for others to jump in

Projects with a complicated bootstrapping processes are probably also difficult to maintain. Wipe your virtualenv once in a while. Wipe your database. If it’s painful, and you have to do it often enough, you’ll make it better. Code that doesn’t get touched often has the most technical debt.

tetris game over

Educate your organization that they can’t just ask for a parade of features

This problem fixes itself, one way or another. Either you keep building features and technical debt until you’re buried and everything comes to a standstill and you yell at each other, or you find a way to balance adding features.

Prevent Dependency Spaghetti

Just like how you should try to avoid spaghetti code, having a lot of third party apps that try to pull in dependencies will come back to bite you later on.

  1. Specify requirements with ==, not >=. Not every package uses semantic versioning. And using semver does not guarantee that a minor or bugfix release won’t break something.
  2. Don’t specify requirements of requirements. This is to avoid an explosion of requirements to keep track of.
  3. There’s no easy way to know when it’s safe to delete a requirement. Even if you have good test coverage, your test environment is not the same as production. For example, you can safely delete psycopg2 and still run your tests, but have fun trying to connect to your PostgreSQL database in production.

Don’t support multiple paths

If you’re writing a library consumed by many people, supporting get_queryset and get_query_set is a good idea. For yourself, only support one thing. If you have an internal library that’s used by multiple parts of the code base and you want add functionality while preserving backwards compatibility, you can write a compatibility layer, but then you should update it all within the same pull request. Or at least create an issue to clean it up. Supporting multiple code paths is technical debt.

Avoid Customizing the Admin

The moment you start writing customizations for the admin, you’ve now pinned yourself to whatever the admin happened to be doing that version. Unless you’re running automated browser tests to verify their functionality, you’re setting yourself up for things to break in the future. The Django admin always changes in major ways every version, and admin customizations always have weak test coverage.

Do Code Review and Peer Programming

Code review makes sure that more than one person’s input goes into a feature, and peer programming takes that even further. It helps make sure that crazy functionality and hard to read code doesn’t get into the main code base that others will then have to maintain. If you’re a team of one, do pull requests anyways. You’ll be amazed at all the mistakes and inconsistencies you’ll find when you view your feature all at once. Even better: sleep on your own pull requests so you can see them in a new light.

Don’t Write Unreadable Code

Code review and peer programming are supposed to keep you from writing unmaintainable code. We’ve embraced linters so that we can write code in the same style, coverage so we know when we need to write tests, but what if we could automatically know when we were writing complicated code? We can, using radon or PyMetric‘s McCabe’s Cyclomatic Complexity metric.

Additional Resources

  • http://youtu.be/SaHbtEeu37M Docker and DevOps by Gene Kim, Dockercon ’14
  • Inheriting a Sloppy Codebase by Casey Kinsey, DjangoCon US ’14

Special thanks to Noah S.

Best Practices Case Study Nerd

Dissecting Elevators part 3: Wherein data is actually imported

Intro: This is part three of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Consuming Data

Consuming Data
Python consuming data. Also an accurate depiction of the author.

Now that you’ve got some juicy government data. How about getting it into a database? Having done this quite a few times before, there are some tips and tricks I’d like to share.

Make your models mimic the data

I’ve gone back and forth on this, and I’ve settled on making my Django models look like the source data. The reason most people do this is:

  1. It’s easy.

For Texas higher education data, I tried making the Django models an API to the data, but that proved to be a disaster. Writing the importers took longer, and when collaborators tried incorporating new data sources, they ended up creating their own models instead of using the existing models; and we ended up in a worse state than if we had just made the models mimic the data. The technique I’m trying now is making models that strongly resemble the source data, and then creating additional models to hold denormalized and supplemental data. My co-worker, Noah, has gotten a little farther with this idea and hopefully we’ll be able to unveil it soon. One benefit of this technique is the ability to wipe and re-import data while keeping supplemental data intact. For example, if the source data was dirty and said there was a 2000 floor elevator in Lockhart, but a human corrects it, the curated data is preserved even between import cycles.

It should be obvious that I modeled my models.py off the csv because I even documented the source field names:

class Building(models.Model):
    """Some place with an address that holds `Elevator` object(s)."""
    # LICNO
    elbi = models.IntegerField(u'Building Number', unique=True)
    # BNAME1
    name_1 = models.CharField(max_length=100)
    # BNAME2
    name_2 = models.CharField(max_length=100)
    # BADDR1
    address_1 = models.CharField(max_length=100)
    # BADDR2
    address_2 = models.CharField(max_length=100)
    # BCITY
    city = models.CharField(max_length=50)
    # BZIP
    zip_code = models.CharField(max_length=5)
    # BCOUNTY
    county = models.CharField(max_length=20)
    # ONAME1
    owner = models.CharField(max_length=100)
    # CNAME1
    contact = models.CharField(max_length=100)

Import all the data!

Since the models and the data are so tightly coupled, the importer is basically just a simple mapping of the CSV fieldnames to the model fieldnames. There’s one technique I’m playing with that I’m liking a lot. It goes like this:

  1. Iterate over the csv rows.
  2. Separate your row data into fields that are unique, and fields that aren’t.
  3. Put the non-unique data into a default_data dict.
  4. Do a get_or_create(id=row['id'], defaults=default_data).
  5. If the object wasn’t created, update it like obj.__dict__.update(default_data).
  6. Except don’t do the update that way, only save the changed fields, and only if you have to. I wrote some helpers to do this:
    def setfield(obj, fieldname, value):
        """Fancy setattr with debugging."""
        old = getattr(obj, fieldname)
        if str(old) != str(value):
            setattr(obj, fieldname, value)
            if not hasattr(obj, '_is_dirty'):
                obj._is_dirty = []
            obj._is_dirty.append("%s %s->%s" % (fieldname, old, value))
    
    def update(obj, data):
        """
        Fancy way to update `obj` with `data` dict.
    
        Returns True if data changed and was saved.
        """
        for key, value in data.items():
            setfield(obj, key, value)
        if getattr(obj, '_is_dirty', None):
            logger.debug(obj._is_dirty)
            obj.save()
            del obj._is_dirty
            return True
    
  7. So instead of obj.__dict__.update(default_data), I do update(obj, default_data).
  8. Django 1.5 supports saving a subset of a model’s fields. It would be trivial to adapt the code to take advantage of this.

This technique is slow, but the same speed a typical Django data import. If speed is an issue, you should truncate your tables and do bulk inserts.

Log all the things!

So you may have noticed the log statement inside there. Every time I do an import, I get a stream of how the data is changing from my logger:

['last_inspection 2011-11-01->2012-11-27']
['last_inspection 2012-02-29->2013-02-13', u'drive_type UNKNOWN->HYDRAULIC']
['year_installed 2008->2007']
[u'equipment_type ESCALATOR->PASSENGER', 'year_installed 2008->2007']
[u'equipment_type NEW EQUIPMENT->PASSENGER', 'last_inspection None->2013-02-21', 'floors 0->2', u'drive_type ELEVATOR DRIVE TYPE->HYDRAULIC', 'year_installed 0->1982']

And it’s just Python logging. I can change the verbosity and where it gets sent. The best part is, it helps identify dirty data. If you do an import with the same data twice in a row, updates you see in the second pass mean there’s some duplicate data in the source because the same changes will keep getting made.

So the terrible thing about logging is by default, it has ugly output. I have a solution for that. In my settings.py logging configuration, I have a custom handler that does pretty colorized output. The handler is based on this gist by Vinay Sajip. I simplified it by stripping out Windows support and include it in all my projects now.

Conclusion

The elevator data set was easy to work with; there’s fewer than 50,000 entries and only one CSV file. The techniques I’ve learned working with it will definitely translate to bigger projects. The create or update code could use a lot of work, but I’ve already used it on more complicated data importers elsewhere with DateFields and DecimalFields and  TextFields, (Oh My!). And just in case you’re wondering, there is a make command for grabbing the latest data and importing it. Next time, I’m going to go over how I got the data out to do a visualization.

Best Practices Case Study Nerd

Dissecting Elevators part 2: EZ Scrape

Intro: This is part two of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Scraping Data

Having worked on many data apps now, there’s always that initial excitement of figuring out how to import a dataset. But soon, the months (or hours) pass, and you forget how to do the import. Well, that’s no good. Elevators was my latest iteration on doing simple, repeatable data updates. I say latest iteration, because while I’ve managed to improve, there’s plenty to improve upon. For this post, I’ll cover my process up to the actual data import.

tx_elevators_data.git

Because of Reasons
by Ryan Pequin

If your source data is in plain text and has line breaks, you should keep it in source control. This is an extra step, but there are reasons:

  • There’s history. I can look at the previous state of the data and see when they fixed a typo.
  • Can quickly see what happened. Browsing the diff gives you insight into how the data changes.
  • Adds portability. If I had to pull 12 different data sources, I can tell a coworker to do one pull from my git remote to get data in exactly the same state I have it.

To do updates, I use Make. Mike Bostock has a good post on why to use Make called: Why Use Make. To do source control, I use git. You could host it online at github or bitbucket, but that feels wrong. What I do instead is make a bare git repo in Dropbox as my remote origin. To share the makefile, I put a copy of it in the main tx_elevators repo. You can see it at https://github.com/texastribune/tx_elevators/blob/2013-april-fools/data/Makefile.

Keeping the Diffs Cleans

After my first update, it was apparent that I was getting the data back in no particular order and my diffs were going to be unusable. The first thing I tried was a sort:

mv $@ $@.orig
awk 'NR==1; NR > 1 {print $0 | "sort -n"}' $@.orig > $@

It worked, except I discovered that my data had one record with a line break that broke many things. So I turned to csvkit. The current iteration uses csvkit’s csvsort:

mv $@ $@.orig
cat $@.orig | csvsort > $@

It’s significantly slower, but not terrible, and it doesn’t break the data. For reference, the whole command only takes 31 seconds on my old machine:

$ time make clean elevator_data_file.csv
...snip...
real    0m30.641s
user    0m19.841s
sys     0m1.588s

Improvements

Maintaining two copies of the data importer’s makefile is annoying. I’m still trying to figure out that one, but I have an idea. I think I can adjust my  make scrape command to assume there’s a  ./tx_elevators_data directory alongside tx_elevators, and fallback to tx_elevators/data. I should also change the csv filenames, keeping elevator_data_file.csv the original and making a new elevator_data_file.sorted.csv.

Conclusion and Next Time…

Once again, I’ve chosen to use Make to automate things. I was hoping to go into actually importing the data here, but then this post would be way too long. The gist of it is tx_elevator‘s makefile calls  tx_elevators_data‘s makefile to make sure it has fresh data before importing it. The entire process is condensed to one command:  make scrape. In addition to making the process simple and repeatable, it’s also important for the it to be easy to understand. Document everything as if you expect contributors, because you’re going to forget how everything worked.

Best Practices Case Study Nerd

Dissecting Elevators part 1: rapid models.py development

Intro: This is part one of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.

Rapid models.py development

Introduction

Every minute doing boilerplate at the beginning of a project is a minute you’re not developing. And the beginning of a project is when you really need that momentum to code ideas, not fiddle with settings.  This post goes into how I do my database boilerplate.

It is very common to use sqlite in early Django development because it’s so easy to migrate your database after editing your models. All you have to do is delete your old database file and run syncdb again. I even made a generic make command to find and delete all .sqlite files then run syncdb so I could reuse the same make resetdb command in every project. But what if you want to use another database like PostgreSQL? You could port to using dropdb/createdb, but you may find it annoying to set up for every new project.  Luckily, there is a generic way thanks to django-extensions and DJ-Database-URL, both of which I include on every project. In my settings.py, I still use sqlite by default with:

DATABASES = {'default':dj_database_url.config(default='sqlite:///' +
    project_dir('example_project.sqlite'))}

but in my virtualenv’s postactivate (or in your .env file), I have:

export DATABASE_URL='postgres:///tx_elevators'

so it actually uses PostgreSQL. And by using django-extensions’s reset_db management command, the modifications to the makefile actually end up making things simpler:

# old version:
#   $(foreach db, $(wildcard $(PROJECT)/*.sqlite),\
#      rm $(db);)
#
# new version:
resetdb:
    python $(PROJECT)/manage.py reset_db --router=default --noinput
    python $(PROJECT)/manage.py syncdb --noinput

https://github.com/texastribune/tx_elevators/blob/2013-april-fools/Makefile#L22-L24

Just remember to put django-extensions in your installed apps. And make sure you’re using a recent version of django-extensions (>= 1.1.0), because there was a bug in reset_db prior to then.

Now after every model change, you can still use make resetdb to reset the db, no matter what database engine you use. Well… as long as that database engine is either sqlite, MySQL, or PostgreSQL.

anonymous db

Why not South?

South is a pain. Even if you script away the repetitiveness, it is an overly-complicated, blunt instrument designed for established projects, not for rapidly developing pre-alpha projects. And wiping away the database is a good thing. It means you can’t accumulate baggage in your data that you’ll never be able to recreate. Another benefit is that bootstrapping development on a new machine is a breeze because you’ve make that process simple and repeatable. And if you do need to do a quick migration, my first choice is django-extensions’s sqldiff command. For example, let’s say I wanted to make the year_installed date null-able to indicate bad data, and to make Building.city a foreign key to a new City model. If I changed my original models.py to do that, the output of ./manage.py sqldiff tx_elevators becomes:

BEGIN;
-- Application: tx_elevators
-- Model: City
-- Table missing: tx_elevators_city
-- Model: Building
ALTER TABLE "tx_elevators_building"
    DROP COLUMN "city";
ALTER TABLE "tx_elevators_building"
    ADD "city_id" integer;
CREATE INDEX "tx_elevators_building_city_id_idx"
    ON "tx_elevators_building" ("city_id");
ALTER TABLE "tx_elevators_building"
    ALTER "city" TYPE integer;
ALTER TABLE "tx_elevators_building"
    ALTER COLUMN "city" SET NOT NULL;
-- Model: Elevator
ALTER TABLE "tx_elevators_elevator"
    ALTER COLUMN "year_installed" DROP NOT NULL;
COMMIT;

Which I can pipe into the database. You do have to know some SQL, because the SQL it produces is not always right, but it does get you 95% of the way there.

Afterwards

Once you do a release, you should abandon this and switch to using South. You can still use make reset_db and blow everything away, but you should at least be providing migrations once your project is stable.

Next time…

I’ll go over how I made importing data a one-liner.

Best Practices

The MAIN thing about Django templates

For about a year now, I’ve been harping about using {% block main %}, and today, I got affirmation that it is the right thing to do. I saw an article posted about it in A List Apart’s feed and started investigating.

So it’s looking likely that we’ll have a <main> tag soon. I’ll skip over the specifics because you’re probably already using <div class="main">. It is one of the most common elements and even more so now that .nav, .header, and .footer are now <nav> <header>, and <footer>.

The important thing to remember is there can only be one <main> tag on a page. And it just so happens that you can only have one {% block main %} in a Django template (barring some trickery). So how to use it? The obvious way is something like:

{% extends "site_base.html" %}
{% comment %} Don't actually use main like this {% endcomment %}
{% block main %}
  <main>
    {# Your content #}
  </main>
{% endblock %}

But not so fast! Django templates can divided into three categories: base/layout templates, detail/leaf templates, and includes/shared/component templates (which I’ll ignore for the rest of this post). So here’s how your templates should look like. The base, layouts/base.html:

{% extends "site_base.html" %}
{% block body %}
  {% block main_container %}
    <main role="main">
      {% block main %}{% endblock %}
    </main>
  {% endblock %}
{% endblock %}

And the leaf:

{% extends "layouts/base.html" %}
{% block main %}
  {{ object }}      
{% endblock %}

So why have a block immediately surrounding the <main>? Well you’re going to find it hard to make variations of base templates without it. To extend an existing base, load the structure into {% main_container %} and re-provide {% block main %} like this:

{% extends "layouts/base.html" %}
{% block body %}
  {% block main_container %}
    <div class="row">
      <div class="span8">
        <main role="main">
          {% block main %}{% endblock %}
        </main>
      </div>
      <div class="span4">
        <aside>
          {% block rail %}{% endblock %}
        </aside>
      </div>
    </div>
  {% endblock %}
{% endblock %}

Now you’re guaranteed to have your main content rendered no matter which base you extend because every base template provides a main block.

And finally, don’t actually use <main> since it’s only a draft spec. Use <div class="main" role="main"> instead.