Data Migrations and The Importance of Centralizing Your Transformation Code

When discussing data migrations, I often talk about the importance of “repairability”, the ability to fix data caused by an error in your data migration, weeks or even months post go live. In fact, some time ago I wrote an article on that exact subject. In that article I briefly mentioned the importance of centralizing your transformation code but didn’t really delve into why. But to fully understand why, we first need some history.

RAD (Rapid Application Development) has been around since at least the late 70’s, but didn’t really take off until 1991 with Microsoft‘s release of Visual Basic in 1991 (now affectionately call VB Classic). It almost seemed as if Visual Basic was designed with RAD in mind, it provided developers with a graphic user interface so that they can quickly prototype out screens (or windows or forms) by dragging objects onto in – show it to the user, iterate, then go back and write the code behind the screen that runs when the user interacts with it.  If this sounds a lot like Salesforce’s “Click not Code”, that’s probably because Salesforce’s “No code” ethos was heavily influenced by this. In fact, Salesforce actually calls itself a RAD Platform!

One of the biggest (or at least in my opinion, most valid) criticisms of Visual Basic, was its tendency to produce “spaghetti code”. Every single object on the screen could have a code that triggers based off and event with the object. So if I have 20 text boxes on a Screen (or window or form), each of those objects could fire custom code based on a multitude of events (OnClick, OnGotFocus, OnChanged,  OnLostFocused, OnDoubleClick, OnDrag, OnDrop ..whatever!). This made it nearly impossible to instinctively know what’s going on, every odd/unexpected behavior needed to be thoroughly traced through all that spaghetti to figure out what odd combination of user actions caused it.

The VB Classic IDE – A command button with an “OnClick” event (and with horrible default names)

Whether or not Salesforce has a spaghetti code problem can be debated, but I don’t think anyone would argue that any issue it has with spaghetti code, is anywhere near VB Classic levels. And in all fairness, this wasn’t an issue that was isolated to VB, many programming languages at the time suffered from similar issues (and some still do, though most people would blame the developer not the programming language).

Anyhow, at the same time VB was growing up, so where ETL and Middleware. And like VB most ETL tools adapted a visual programming style. But because ETL tools did not need to generate a UI, they were much more successful at building tools that don’t produce spaghetti code. Whereas VB wanted to have the IDE mimic the UI, ETL tools aimed to have the IDE mimic a data flow diagram, and in the past 20 years this has hardly changed. If you look at the DTS IDE below, you would hardly know that it’s nearly 20 years old. It looks surprisingly modern as an ETL tool. (In all fairness, you could argue that the VB IDE shown above doesn’t look that different then the current VS WinForms IDE – but not many people write WinForm apps nowadays)

DTS as part of SQL Server 7.0

This happens to be one of my  favorite things about ETL tools, that they tend to be very visual, they truly live up to the dream of visual programming – BUT over time, as they matured and offered more and more transformation tasks (or components), you find yourself needing to chain together too many of these transformation components, often where each acts on a every row in the data set (so it can be slow). This makes it difficult to track down every transformation that happened to a field. Though I wouldn’t call this spaghetti code, its most certainly not intuitive or easy to follow.

When we are migrating data to Salesforce, the first step in the process is to transform our data so that it exactly matches the Salesforce data model, right down to every field to be populated on the target Salesforce object.  For each object, I recommend you have all your transformation code in a single place, that could be a single block of code within your ETL tool, but the end result is that you want to have one piece of code per Salesforce object that handles all the transformations needed for that object. I tend to use SQL views or stored procedures, but a Python script (or whatever your favorite scripting language is) can work just as well. This practice makes it very easy to compare the source data, the data after transformation, and the data as it is in Salesforce after it is loaded (in other words, to QA test). It’s also makes tracking down issues very easy, and generally performs better (is faster).

An additional benefit of using something like SQL or python Is that it makes your transformation code ETL tool agnostic. If a client is requiring you to use a particular tool for your migration (or integration) you can heavily leverage your existing  knowledge to code that migration (or integration) and only leverage the tool for data pumping, and administrative type stuff like error handling, logging and scheduling. You can also then build a reusable transformation code library that is ETL tool agnostic.

As part of your data transformation code, you should rename the columns to match the Salesforce object field API names. This way, your code mirrors both the documentation as well as the mappings done in your ETL tool. Again, this just makes testing very easy. In addition, most Salesforce connectors (as part of your ETL or middleware) have an automapping feature with which, if the field names match, the tool auto-maps the source fields to the targets. This practice eliminates any room for human error. After the automapping is done, you can confirm there is nothing left unmapped; then you’ll know nothing fell through the cracks. If your mapping document is in a spreadsheet, you can use formulas to autogenerate the bulk of your transformation code, because most fields don’t need to be transformed; they just need to be mapped. When done properly, your transformation code should be so clean that someone could reverse-generate the data mapping document from it.

This article is adapted from my book “Developing Data Migrations and Integrations with Salesforce: Patterns and Best Practices” (Apress December 2018) and covers Best Practices # 8 & 9,

Enjoyed this article? Visit my blog at: or subscribe by Email or RSS

Written by
David Masri

David Masri

Director of Professional Services

Let's Get Started