Salesforce MQL campaign attribution with Python (no administrator required!)

11 min readJan 6, 2021

“When you’re a hammer everything looks like a nail.”

We’ve heard this saying before, but I feel like in our typical legacy roles as operators, consultants, or technical specialists focusing on your one ‘specialty’, we get trained as hammers by default.

Need to hang a painting? Great.

Impossible to open pasta jar lid? Technically, it could work, but it’s messy and awkward.

But when we’ve trained ourselves as hammers for years — when we specialize as hammers and we know that’s how you can produce results — it’s tough to turn to a paperclip or a roll of duct tape if it’s not the trusted solution.

This is one of the primary reasons why I wanted to learn Python. I’d been doing consulting that specialized in Salesforce in one form or another for nearly seven years, but time and time again I saw solutions deployed that, while technically possible, were far from ideal and often ended up being more time consuming and less effective at solving the problem at hand.

Problem Statement and My Solution

Problem/gap: Marketing and sales teams are unable to tell what marketing campaigns led a prospect to MQL.

The hammer solution: Ask any Salesforce consultant or administrator and they’ll probably offer a solution that puts the “Campaign at MQL” or something like that on the lead and contact, like so:

Build an “MQL Date” field on the lead and contact object that will store the date that the prospect MQLed, if it doesn’t exist already.
Build a workflow rule or process builder (workflow is definitely better at not erroring at volume) to stamp the MQL date when a prospect’s status changes to MQL. Since you’re doing it for lead and contact, do it twice.
Build a custom field to store “Campaign at MQL” so marketers can see the “source campaign” for MQL.
Now build another process builder to fire whenever a campaign member is marked as “responded.” If the responded date is within 24 hours of the MQL date. Do it twice, again — one for contact, one for lead — or map the field from lead to contact if nobody reMQLs.

It’s an OK solution, but not hugely elegant. You’d have to backfill all of the past data, which would be challenging, or just live without it and start your reporting from the day you deploy the solution. Furthermore if there were two campaign responses right before MQL, there’s no easy way without Apex to figure out the closest ‘source.’

Not to mention, Salesforce people are in high demand, so you also have to get their time in the first place.

I got to thinking: is this something Python and Pandas could do more elegantly? Would it be worth the tradeoff of working in code?

So I fired up Jupyter and I gave it a shot on one of my client’s Salesforce instances, with permission. I’m sharing the code here anonymized — as you’ll see, it can be easily reapplied elsewhere. As I suspected, you can trace and report on campaign attribution for any tracked field history event without ever having to touch Salesforce automation tools, just using Python and Pandas!

Extracting and Assessing Data

First, I imported Pandaforce (a fantastic package for accessing Salesforce data with Python/Pandas), then passed in the security credentials. I made sure to make the password and security token as prompts so nothing sensitive was hard coded.

Next, I wanted to see: are they already tracking the Status change to MQL in lead history?

First I tried SOQL:

SELECT LeadId, Field, OldValue, NewValue FROM LeadHistory WHERE Field=’Status’ AND NewValue=’MQL’

INVALID_FIELD:
LeadHistory WHERE Field=’Status’ AND NewValue=’MQL’
^
ERROR at Row:1:Column:84
field ‘NewValue’ can not be filtered in a query call

OK, so we can’t find it that way. Let’s at least use SOQL to see the values:

SELECT LeadId, Field, OldValue, NewValue FROM LeadHistory WHERE Field='Status'

MALFORMED_QUERY:
WHERE Field=’Status’ GROUP BY NewValue
^
ERROR at Row:1:Column:75
field ‘NewValue’ can not be grouped in a query call

Well then. I guess we’ll do it in Python then. Back to the Jupyter notebook we go!

Now that that’s loaded into memory, let’s take a look at what the lead status values are:

Lead Status # leads
0 Assigned 1565
1 Contacted 1769
2 Converted 351
3 Customer 2264
4 Disqualified 1102
5 Engaging 13
6 MQL 664
7 Nurture 7509
8 Other 4525
9 Qualified 67
10 Raw Lead 4530
11 SQL 10
12 Unqualified 1881
13 Working 6957

(This looks uglier in Medium than I wish it did, but I‘m trying to minimize pasting images of dataframes for accessibility reasons.)

Since this is a client using the measurable lead model, they have some newer data where an MQL is essentially marked as Assigned to a rep, but also some older historical records where their Status used to be marked as ‘MQL.’ We actually cleaned all this data up, but since this is a lead history report, we still have the history of older MQL value changes. No problem.

Digging into Lead History

Now we need to filter down that leadHistory to just those values. We also want to make sure that we only get the first time that this lead switched to an MQL value , so we also want to deduplicate the field history by LeadId, and keep only the first row by CreatedDate. (Since we’re dealing with Lead history, CreatedDate is the date the record MQLed, not the date the lead was created!)

This spits out some lead IDs, the field, the old and new statuses, and the date the lead was first updated to that date. We now have that stored as firstMQL.

The first five results from .head() give us a general sense of the data

Great! Now how do we compare that to the campaign responses and figure out which one caused this MQL?

First, let’s get all campaign members from Salesforce. It’s less expensive to filter them in Pandas now that we have a list of LeadIds that we want to filter down by.

I’ll also do a quick length check to make sure things generally make sense at this point.

len(firstMQL) returns 2205, while len(campMembs) returns 1707.

That’s odd. Why would there be 2,205 instances of a first MQL over the past 18 months that lead history is captured, but only 1,707 campaign responses in that time?

With a few more lines of code, I looked for leads that do not have a campaign member:

leadswithCM=campMembs[‘LeadId’].tolist()
leadswithnoCM=firstMQL[~firstMQL[‘LeadId’].isin(leadswithCM)] 
# That tilde means NOT in the leadswithCM list

From there — and I won’t screenshot this — I confirmed that there are leads in Salesforce with no campaign response because of messy data. Add that to the roadmap to clean up — this will only be as good as the correctly coded campaign responses we have. (I don’t want to complicate things at this stage by including campaign non-responses.)

With that big asterisk aside, back to our data. From here the fun part!

Building our model

We need to compare the data in the campMembs campaign membership table to the leads’ firstMQL data and find the campaigns which the leads responded to just before MQLing. We know from firsthand experience that the marketing automation platform integration can sometimes send this data with a bit of a delay or out of order, so it’s best practice to give this a day or so on either side.

We can find the MQL (or Assigned) date for each lead and loop through them with a for loop. From there, we could determine possible campaign ‘matches’ by finding any campaign responses where the FirstRespondedDate is within 24 hours of the MQL date on either side. Then we can dedupe that table (if there are any dupes) and get that closest campaign response.

First, though, we need to clean up some data types. FirstMQL.info() shows us that we have work to do:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2205 entries, 0 to 2204
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   index              2205 non-null   int64 
 1   LeadId             2205 non-null   object
 2   Field              2205 non-null   object
 3   OldValue           2205 non-null   object
 4   NewValue           2205 non-null   object
 5   LeadUpdatedDate    2205 non-null   object
dtypes: int64(1), object(5)
memory usage: 103.5+ KB

We need to cast our LeadUpdatedDate and FirstRespondedDate to datetime formats for us to be able to perform operations on them. I should have done this when I first brought in the dataframe, but I’m telling a story, OK? 😂

Fortunately, I know from experience that Python is mercifully, gloriously good at reading Salesforce date/time outputs.

I also know thatLeadUpdatedDate got a time zone when we casted it to datetime, probably because it had both a date and a time, but FirstRespondedDate didn’t, since it was actually just a date.

We’ll need to be careful here since the FirstRespondeddDate has no time on it. Someone could have responded to a campaign anytime in a 24-hour period. But I think it’s safe to say that a 24-hour period is enough time to make sure that any marketing attribution comes in. We’re going to have to look at absolute value differences between the MQL date and the FirstRespondedDate since theoretically most MQL dates will be “after” the FirstRespondedDate, if, say, I MQLed at 11AM on Monday but the FirstRespondedDate has Monday at midnight.

Now that our data is set up, onto the heart of the calculation, in which we loop through our leads, find out if there are any campaign responses associated with them, and get that data into output collections. I commented this section pretty heavily, so hopefully it makes sense.

I definitely checked out the leadswithnoCMs output list to see how many MQLs didn’t even have a campaign source within 24 hours, and also confirmed that the results dictionary handled my leads with one or more campaign sources.

Making our calculations readable

From here, what’s left?

Get the results dictionary of leads with campaign sources into a dataframe format looking pretty
Summarize this table to sum up the total MQLs by campaign type and campaign name, which is what most clients want anyway
Get the leadswithnoCMs list of leads with no campaign source also into a nice looking dataframe
Write all this goodness to an Excel doc (or Google Sheet, etc.) for folks to pore over

OK, let’s get to it.

Get the results dictionary of leads with campaign sources into a dataframe format looking pretty

This part wasn’t so bad, but was time consuming: first, it was making a dataframe of the results, and then enriching it with more campaign and lead data.

I will admit that understanding how Pandas indexing works is still something I’m learning about, so I ended up moving around the index in a way that I’m sure is less than maximally elegant while getting the data into a dataframe.

Also, lines 6–9 show where I had to think carefully about deduping so only one campaign response in our 24-hour window gets counted. This was a tricky one, but considering that our MQL date was a datetime coming out of Salesforce but the campaign response was just a date, I had to just pick one. (I’ll talk about that a bit more at the end.)

From there it was just a lot of merging and reordering of columns to make the output data look good.

…at the very end there I also snuck in the extra output dataframe which was just a groupby to save the user having to make a pivot table. Therefore: 2: Summarize this table to sum up the total MQLs by campaign type and campaign name, which is what most clients want anyway also checked off the list. ✅

Next: 3. Get the leadswithnoCMs list of leads with no campaign source also into a nice looking dataframe

This part wasn’t so bad either. It was a little confusing to figure out that I had to merge back in the original MQL date (LeadUpdatedDateas originally exported) rather than the UTC version that we generated for the calculation. Turns out the Excelwriter in the next step doesn’t like timezone aware datetimes, which was a head-scratcher for a minute.

Almost there! Last step!!

4: Write all this goodness to an Excel doc (or Google Sheet, etc.) for folks to pore over ✅

ExcelWriter is awesome. I have always been too lazy to use any of its formatting tricks, but it’s nice to know it’s there.

Phew. The results are all there. In our main table, we have all of our leads and their source campaigns, with IDs and the MQL date / date the lead responded to the campaign on the far right in case you want to check.

Screenshot of the “All leads with source campaigns” output in Excel.

The second tab, our summary totals, gives us a total # of lead MQLs by campaign+type.

Screenshot of the “Campaign source summary totals” output in Excel.

And then for reference, the final tab includes the names, emails, titles, Salesforce ID, and MQL date of the leads where we couldn’t find any campaign source within the 24-hour window of their MQL date. (Not pictured.)

What the future holds

This was a fun project and I look forward to tweaking the code slightly to share with a few orgs that I volunteer for to get them some easy and fast insights into how movement towards sales or prospect lifecycle milestones can be attributed to campaigns. This code says “MQL” all over it, but can be used for anything.

In terms of limitations this code faces:

By far the biggest limitation of this code is that Salesforce field history tracking a) must be opted into, and b) only keeps field history tracking data for up to 24 months. In my expereince 2+ year old MQL data may not be that relevant for fast-growing companies anyway, but it’s still important to know. And if your SFDC administrator hasn’t turned on field history tracking for the key field (lead status = MQL in this case) that you’re trying to track, you’re out of luck.
You have to run it locally on a computer, at least for now; it’s not a report that any user can just run immediately.
This data is not visible at the user level the way that an automation solution as described at the top would be; a user has to run this Python code whenver they want to see this report, rather than operators being able to run their own reports on the data whenver they want. (One could argue that that option’s page layout clutter, chance for user error, and the downfalls of automation failures and inaccuate data backfills would almost balance it out, though.)

Further improvements I’d love to make to this code:

Add the caveats and explanations of how this code works in an explainer “cover sheet” tab at the beginning of the Excel output, if that’s how future viewers want to receive the information. The 24-month window is critical to understand.
Make the field and field values that we’re tracking as the campaign source milestone something that a user can enter, and then have the code look for if there’s enough field history tracking data to make an analysis. For example, rather than just specifying Lead Status is among ['MQL','Assigned'], letting a user specify that they want to track when a lead is converted, or when a Meeting Status field updates to “Complete”.
If we’re thinking to yonder pastures, on the topic of “it’s not just a report that any user can run immediately”, it’d be fun to build this as a web app where users can log in with SSO, enter these inputs from a web interface, and view or download the results file.

That’s all for now. This was a fun learning experience for me and I feel like it’s continuing my thinking outside of the Salesforce administrator / developer box. I hope you find this useful too!