TL;DR - this was my approach:
- create migrations on the old project to transform the data into the new schema structure
- run these migrations on a dump of the live database
- dump just the data of the tables you care about from the migrated database with pg_dump
- import this dump into the new database with psql
Once I worked through a few kinks, it was all remarkably straightforward.
Background
- Swing Out London is a cms for managing swing events
- I’m rebuilding it from scratch as a totally new project
- In particular the old data model was insane and the new one is turning out much cleaner
- The current live system is Rails 3 and hosted on Heroku - so uses postgres for the live database, but uses sqlite for the development database
- The new system is Rails 4 and uses postgres for the development database
I want to be able to import data from the current live site into my work-in-progress new schema as I go along, for a few reasons:
- Doing the migration incrementally will hopefully be easier than doing it all in one go at the end
- Working with real data will help me make better design decisions earlier
- Being able to see what is missing between the old and new schemas will help to make sure I don’t miss anything
The challenge
Data which in the old app exists in one god object (event
) needs to be divided between three tables in the new app (events
, event_seeds
, event_generators
).
In particular, in the new app, events with individual dates are modelled by having:
- an
event_generator
with a date, which belongs to… - an
event_seed
(which holds the main data) and belongs to… - an
event
which groupsevent_seed
s together.
In the old app, dates are stored in a swing_dates
table (which has unique entries)
and associated with events
via a join table: event_swing_dates
(Don’t do this. If you can believe it, this was done in an attempt to speed up database queries).
The thought of transforming this data at the SQL level gives me a headache. Fortunately in the old app this complexity is hidden under the hood and events just have a #dates
method which pulls out the relevant list. For the sake of sanity, my import solution needs to take advantage of that method - hence doing the transformation at the application level using migrations.
Step 0: Convert the old app to use postgres
I had never gotten around to updating the old app to use posgres in development, so the first step was to sort that out. Luckily this is pretty easy:
- In the
Gemfile
, remove thesqlite3
gem and move thepg
gem out of the production group. - In
database.yml
, change the sqlite config to a posgres config
Step 1: Create migrations
I used a number of migrations to do the data transformation. Since some of them are irreversible, there’s no real reason to separate them out, other than organising the code:
ChangeVenuesToNewFormat
The venues table isn’t changing much so this just involved renaming a column, adding some not-null constraints (probably not necessary) and dropping the columns which aren’t being imported yet (source)
CreateNewEventsTables
This involved copying the section of the schema in the new app which creates the tables for the event_seeds
and event_generators
tables. This way the tables are exactly the same as the new schema. (source)
MoveDataToNewEventTables
This is where the gnarly stuff happens. First I need a way of putting data into the new tables. I could go and create full models in the app, but it’s possible to just create lightweight classes right in the migration:
I guess I should maybe also have included all of the validations here, but I’m fairly confident that the validations are basically equivalent between the two apps, and in any case, the relevant fields have null: false
constraints on them in the database, so I’m not going to sweat it.
Now we can actually create the new records:
Note the use of event.dates
.
create_event_seed
and create_event_generator
are just wrappers around
EventGenerator.create
and EventSeed.create
for convenience.
This migration also dropped most of the fields on events
but kept the original ids intact - I have a feeling that will turn out to be useful later on. (source)
CreateNonMigrationTables and DropUnusedTables
These migrations create the remaining missing tables (like Users, which we’re not going to import data into) and drop the tables which don’t yet have any counterpart in the new app. (source 1, source 2)
Given that in the end I only migrated data from specific tables, this probably wasn’t necessary, but it’s handy to be able to verify that the schemas between the old and new app exactly match, and to see what hasn’t been migrated yet.
Step 2: Run the migrations on the live database
So we’ve now got a set of migrations which convert the old schema into the latest version of the new schema. Time to test it out on some real data.
Getting a database dump from heroku is easy. You first need to drop the existing development database with dbdrop soldn1_dev
and then you can create a new one as a copy of the live database. For me that looks like this:
heroku pg:pull HEROKU_POSTGRESQL_GRAY soldn1_dev
Then it’s just a case of rake db:migrate
and cross your fingers. If it’s successful then the schema.rb
files should match between the old and new app (perhaps modulo some indexes).
Step 3: Dump the data
So now we’ve got our data into the form we want it, how do we actually move it into the new database?
Dumping the whole database would mean we’d carry over the schema_migrations
table as well. schema_migrations
consists of a list of identifiers which match up to an app’s migration files - it’s what Rails uses to track which migrations have been run, which is important for e.g. rake db
commands, so we don’t want to modify it.
The simplest approach seems to be to dump only the tables which actually contain data which we want to migrate:
pg_dump soldn1_dev -a -t event_generators -t event_seeds -t events -t venues > dump
pg_dump
actually writes to stdout
, so that > dump
creates a file from the output.
The -a
flag tells pg_dump
to dump only the data. Although we’ve engineered a state where the migrations on the old site and the new site both result in the same schema, it seems cleaner to only move across the data rather than recreating the tables.
Step 4: Import the data
If there are any records already in the relevant tables with IDs which would clash with data in the import then this would need to be resolved somehow before importing. I don’t care about any existing data, so I’m happy to resolve it by clearing out any existing records with rake db:reset
.
To load the data from the dump file:
psql soldn2_dev < dump
And that’s it: the data is now in the new database. Job done.
Thanks to @zeeraw for the suggestion to use Rails migrations