When you’re building a website what should you do with all that rarely (yet not never) changing seed data? I’m talking about things like Localization strings, choices in some drop down menus and so on. This is a situation I’ve faced a few times and I’ve gone through a variety of solutions to the problem. A couple of projects back, we tried putting everything in flat files - in the case of localization strings, this tends to be encouraged by the framework - but then you end up having to teach people Git just so they can fix a few typos. Then when they’re done (and you’ve cleaned up the mess they made in Git) they complain:
Foreign marketing dude: ‘elp, somezing eez wrong! I’m about to do a demo and I do not see zee changes I made to zee site.
me: Oh, yeah. I was just notified by Jenkins that the last build failed. I think I can fix it quickly, then I’ll redploy.
Foreign marketing dude: Re-de-huh?! And ‘oo eez zis Jenkins?
Too technical. So then we figured everything had to go in the database and we’d make a user-friendly front-end. Every table needed its own CRUD admin pages to maintain the data. Some grew huge and even needed a search box - AJAX naturally. As far as the end-user website was concerned, the project turned out to be a bit of a flop but the admin site rocked! At least I thought it did til one day someone accidentally deleted loads of localization strings. He didn’t even seem that concerned:
Foreign marketing dude: Zis Git, ‘eez taken care of zee backups, no?
me: Erm…
So to my current project and I started pondering what a reasonable amount of development time was to spend on seed data maintenance and in the end, I came to the conclusion: not a lot. At the same time, we’d started using Google Docs for pretty much anything we could so I figured we could use Google Docs spreadsheets to maintain the seed data. They have some convenient properties:
- Anyone you give access to can edit the data. We share the documents “People at Our Organization who have the link can edit”. We’re only four but you can share with individuals if you prefer.
- It has version control. If someone messes it up you can see the whole history, click on individual revisions to see what changed, and revert to a previous version.
- Everyone understands spreadsheets (at least better than git). In fact our previous attempt at making a seed data admin interface began looking more and more like a spreadsheet since all the data’s tabular by nature and it’s convenient seeing all the values in place and being able to edit in situ.
- There’s a well-documented API which provides CSV export. Authenticating is slightly tricky but thereafter it’s just one call to download a CSV-export.
So how does it work?
Each class of data (what would normally be a table in a database) has its own spreadsheet, and a config file in the app declares all the links.

When the app launches, it authenticates to Google Docs using a dedicated apps account and, for each document it hasn’t already cached as a file, it saves the CSV export as a file in the cache directory, and loads the data. The cache means that if the app is restarted and Google Docs is having a bad day, the app can start fine using the previous run’s data. Also, if someone is in the middle of editing when the app happens to restart, you don’t want the half-baked changes appearing.
There’s a single seed data admin page where site admins with links to edit the spreadsheet and reload the data.

The cached file is replaced and the app begins using the new data straight away. Note, when you’re re-reading the data, don’t just empty the data structure holding the data and load it up again because if any requests arrive while that’s happening, they’ll see partially loaded data. Instead, load it into a new structure and make the actual structure point to the new one. This also helps cope with bad data since if the actual import crashes, the app can carry on using the old data.
Details, details, details
Our app is built using Lift so we could use Google’s Java API to access Docs. We also used JavaCSV to parse the CSV data. Of course CSV is quite a trivial format so it’s tempting just to do it yourself, but there are a few corner cases like data containing commas and new-lines which 3rd party libraries handle. Each set of seed data is handled by a model class holding the dataset in an immutable Map.
Each of the model classes extends one of two generalizations. One deals with multi-column data such as in the first screenshot above. In these cases, lookups are typically made for a key to retrieve a class with properties mapped to the columns in the spreadsheet. The other generalization deals with data such as localization where a column is picked first (e.g. the locale) and a key lookup is made to fetch a single element.
Conclusions
So how’s worked out? Pretty good. We’ve used this in production now for six months and so far there haven’t been any hiccoughs. You wouldn’t want to use this for large data sets since everything’s held in memory but in each case we’ve used it, that’s worked to our advantage. Doing round trips to the database to look up localization strings when there could be 50 on a page, for example, would be very expensive.
If you try it yourself, let me know how it goes.
