{"id":291,"date":"2013-04-12T23:47:29","date_gmt":"2013-04-13T05:47:29","guid":{"rendered":"http:\/\/www.crccheck.com\/blog\/?p=291"},"modified":"2015-10-11T22:23:29","modified_gmt":"2015-10-12T04:23:29","slug":"dissecting-elevators-part-3-wherein-data-is-actually-imported","status":"publish","type":"post","link":"https:\/\/www.crccheck.com\/blog\/dissecting-elevators-part-3-wherein-data-is-actually-imported\/","title":{"rendered":"Dissecting Elevators part 3: Wherein data is actually imported"},"content":{"rendered":"<blockquote><p><em>Intro<\/em>: This is part three of an eight part series looking at the <a href=\"http:\/\/elevators.texastribune.org\/\" target=\"_blank\">Elevator Explorer<\/a>, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools&#8217; Day, 2013. I&#8217;m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I&#8217;ll be referring to will be in this <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/tree\/2013-april-fools\">tagged release on github<\/a>.<\/p><\/blockquote>\n<h1>Consuming Data<\/h1>\n<figure id=\"attachment_429\" aria-describedby=\"caption-attachment-429\" style=\"width: 384px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/www.crccheck.com\/blog\/wp-content\/uploads\/2013\/04\/Consume-Data.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-429\" alt=\"Consuming Data\" src=\"http:\/\/www.crccheck.com\/blog\/wp-content\/uploads\/2013\/04\/Consume-Data.gif\" width=\"384\" height=\"288\" \/><\/a><figcaption id=\"caption-attachment-429\" class=\"wp-caption-text\">Python consuming data. Also an accurate depiction of the author.<\/figcaption><\/figure>\n<p>Now that you&#8217;ve got some juicy government data. How about getting it into a database? Having done this quite a few times before, there are some tips and tricks I&#8217;d like to share.<\/p>\n<h2>Make your models mimic the data<\/h2>\n<p>I&#8217;ve gone back and forth on this, and I&#8217;ve settled on making my Django models look like the source data. The reason most people do this is:<\/p>\n<ol>\n<li>It&#8217;s easy.<\/li>\n<\/ol>\n<p>For Texas higher education data, I tried making the Django models an API to the data, but that proved to be a disaster. Writing the importers took longer, and when collaborators tried incorporating new data sources, they ended up creating their own models instead of using the existing models; and we ended up in a worse state than if we had just made the models mimic the data. The technique I&#8217;m trying now is making models that strongly resemble the source data, and then creating additional models to hold denormalized and supplemental data. My co-worker, Noah, has gotten a little farther with this idea and hopefully we&#8217;ll be able to unveil it soon. One benefit of this technique is the ability to wipe and re-import data while keeping supplemental data intact. For example, if the source data was dirty and said there was a 2000 floor elevator in Lockhart, but a human corrects it, the curated data is preserved even between import cycles.<\/p>\n<p>It should be obvious that I modeled my <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/blob\/2013-april-fools\/tx_elevators\/models.py\" target=\"_blank\">models.py<\/a> off the csv because I even documented the source field names:<\/p>\n<pre><code>class Building(models.Model):\r\n    \"\"\"Some place with an address that holds `Elevator` object(s).\"\"\"\r\n    # LICNO\r\n    elbi = models.IntegerField(u'Building Number', unique=True)\r\n    # BNAME1\r\n    name_1 = models.CharField(max_length=100)\r\n    # BNAME2\r\n    name_2 = models.CharField(max_length=100)\r\n    # BADDR1\r\n    address_1 = models.CharField(max_length=100)\r\n    # BADDR2\r\n    address_2 = models.CharField(max_length=100)\r\n    # BCITY\r\n    city = models.CharField(max_length=50)\r\n    # BZIP\r\n    zip_code = models.CharField(max_length=5)\r\n    # BCOUNTY\r\n    county = models.CharField(max_length=20)\r\n    # ONAME1\r\n    owner = models.CharField(max_length=100)\r\n    # CNAME1\r\n    contact = models.CharField(max_length=100)\r\n<\/code><\/pre>\n<h2>Import all the data!<\/h2>\n<p>Since the models and the data are so tightly coupled, the <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/blob\/2013-april-fools\/tx_elevators\/scripts\/scrape.py\">importer<\/a> is basically just a simple mapping of the CSV fieldnames to the model fieldnames. There&#8217;s one technique I&#8217;m playing with that I&#8217;m liking a lot. It goes like this:<\/p>\n<ol>\n<li>Iterate over the csv rows.<\/li>\n<li>Separate your row data into fields that are unique, and fields that aren&#8217;t.<\/li>\n<li>Put the non-unique data into a <code>default_data<\/code> dict.<\/li>\n<li>Do a <code>get_or_create(id=row['id'], defaults=default_data)<\/code>.<\/li>\n<li>If the object wasn&#8217;t created, update it like <code>obj.__dict__.update(default_data)<\/code>.<\/li>\n<li>Except don&#8217;t do the update that way, only save the changed fields, and only if you have to. I wrote <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/blob\/2013-april-fools\/tx_elevators\/scripts\/scrape.py#L15-L37\">some helpers<\/a> to do this:\n<pre><code>def setfield(obj, fieldname, value):\r\n    \"\"\"Fancy setattr with debugging.\"\"\"\r\n    old = getattr(obj, fieldname)\r\n    if str(old) != str(value):\r\n        setattr(obj, fieldname, value)\r\n        if not hasattr(obj, '_is_dirty'):\r\n            obj._is_dirty = []\r\n        obj._is_dirty.append(\"%s %s-&gt;%s\" % (fieldname, old, value))\r\n\r\ndef update(obj, data):\r\n    \"\"\"\r\n    Fancy way to update `obj` with `data` dict.\r\n\r\n    Returns True if data changed and was saved.\r\n    \"\"\"\r\n    for key, value in data.items():\r\n        setfield(obj, key, value)\r\n    if getattr(obj, '_is_dirty', None):\r\n        logger.debug(obj._is_dirty)\r\n        obj.save()\r\n        del obj._is_dirty\r\n        return True\r\n<\/code><\/pre>\n<\/li>\n<li>So instead of <code>obj.__dict__.update(default_data)<\/code>, I do <code>update(obj, default_data)<\/code>.<\/li>\n<li>Django 1.5 supports <a href=\"https:\/\/docs.djangoproject.com\/en\/dev\/releases\/1.5\/#support-for-saving-a-subset-of-model-s-fields\">saving a subset of a model&#8217;s fields<\/a>. It would be trivial to adapt the code to take advantage of this.<\/li>\n<\/ol>\n<p>This technique is slow, but the same speed a typical Django data import. If speed is an issue, you should truncate your tables and do bulk inserts.<\/p>\n<h2>Log all the things!<\/h2>\n<p>So you may have noticed the log statement inside there. Every time I do an import, I get a stream of how the data is changing from my logger:<\/p>\n<pre><code>['last_inspection 2011-11-01-&gt;2012-11-27']\r\n['last_inspection 2012-02-29-&gt;2013-02-13', u'drive_type UNKNOWN-&gt;HYDRAULIC']\r\n['year_installed 2008-&gt;2007']\r\n[u'equipment_type ESCALATOR-&gt;PASSENGER', 'year_installed 2008-&gt;2007']\r\n[u'equipment_type NEW EQUIPMENT-&gt;PASSENGER', 'last_inspection None-&gt;2013-02-21', 'floors 0-&gt;2', u'drive_type ELEVATOR DRIVE TYPE-&gt;HYDRAULIC', 'year_installed 0-&gt;1982']\r\n<\/code><\/pre>\n<p>And it&#8217;s just Python logging. I can change the verbosity and where it gets sent. The best part is, <strong>it helps identify dirty data<\/strong>. If you do an import with the same data twice in a row, updates you see in the second pass mean there&#8217;s some duplicate data in the source because the same changes will keep getting made.<\/p>\n<p>So the terrible thing about logging is by default, it has ugly output. I have a solution for that. In my <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/blob\/2013-april-fools\/example_project\/settings.py#L163-L165\">settings.py logging configuration<\/a>, I have a <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/blob\/2013-april-fools\/example_project\/logging_handlers.py\">custom handler<\/a> that does pretty colorized output. The handler is based on <a href=\"https:\/\/gist.github.com\/vsajip\/758430\">this gist<\/a> by Vinay Sajip. I simplified it by stripping out Windows support and include it in all my projects now.<\/p>\n<h2>Conclusion<\/h2>\n<p>The elevator data set was easy to work with; there&#8217;s fewer than 50,000 entries and only one CSV file. The techniques I&#8217;ve learned working with it will definitely translate to bigger projects. The create or update code could use a lot of work, but I&#8217;ve already used it on more complicated data importers elsewhere with <code>DateField<\/code>s and <code>DecimalField<\/code>s and\u00c2\u00a0 <code>TextField<\/code>s, (Oh My!). And just in case you&#8217;re wondering, there is a <a href=\"https:\/\/github.com\/texastribune\/tx_elevators\/blob\/2013-april-fools\/Makefile#L27\">make command<\/a> for grabbing the latest data and importing it. Next time, I&#8217;m going to go over how I got the data out to do a visualization.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Intro: This is part three of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools&#8217; Day, 2013. I&#8217;m going to be looking at the things I learned, things I wish I could&hellip;<\/p>\n <a href=\"https:\/\/www.crccheck.com\/blog\/dissecting-elevators-part-3-wherein-data-is-actually-imported\/\" title=\"Dissecting Elevators part 3: Wherein data is actually imported\" class=\"entry-more-link\"><span>Read More<\/span> <span class=\"screen-reader-text\">Dissecting Elevators part 3: Wherein data is actually imported<\/span><\/a>","protected":false},"author":2,"featured_media":340,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"Layout":"","footnotes":""},"categories":[50,21,4],"tags":[51,48],"class_list":["entry","author-showmewhatyougot","post-291","post","type-post","status-publish","format-standard","has-post-thumbnail","category-best-practices","category-case-study","category-technical","tag-django","tag-python"],"_links":{"self":[{"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/posts\/291","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/comments?post=291"}],"version-history":[{"count":17,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/posts\/291\/revisions"}],"predecessor-version":[{"id":749,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/posts\/291\/revisions\/749"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/media\/340"}],"wp:attachment":[{"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/media?parent=291"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/categories?post=291"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.crccheck.com\/blog\/wp-json\/wp\/v2\/tags?post=291"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}