This post was adapted from the Odaseva Data Innovation Forum session, “How to Efficiently Process Data in Salesforce” featuring Sarath Yalavarthi, CTA, Application Owner for Consumer Banking CRM at JPMorgan Chase & Co.
Watch the session replay here and see a clip below.
In today’s dynamic digital landscape, data reigns supreme. Efficiently processing data in Salesforce is crucial. This post reviews Sarath Yalavarthi’s presentation from the Odaseva Data Innovation Forum for Salesforce Architects about the three pivotal operations that are integral to efficient processing of data in Salesforce:
Below are the considerations and approaches for each:
Data injection is the process of collecting and inserting data from various sources into Salesforce.
It’s important to take into consideration the volume of records: is it a few records, thousands or millions of records?
The other important thing that needs to be taken into consideration is the complexity of the data set, whether it’s a few picklists or files or complete data sets.
How frequent is the data load? Is it a one-time data load, or is it every day, monthly or weekly?
Is the data that’s being sourced from an ERP, a live stream, or spreadsheets?
The approach to data injection are as follows:
Manual data entry is suitable for limited use cases or instances where data entry is performed directly by end users, particularly in small-scale scenarios.
The data import wizard supports a few objects and can load up to 50,000 records. For example, custom objects, accounts, contacts, leads, and campaign members are supported, but opportunities or opportunity products are not.
The Data Loader allows for data import and export and is efficient when handling datasets below 5 million records. So if the requirement is to upload or download a million records in a single operation, utilizing the data loader is recommended.
The Salesforce API tool allows users to interact with Salesforce programmatically.
With this tool, the user can write code to create records, update records, or use Bulk API to extract huge volumes of data. This is the right tool for large datasets and complex data integrations.
ETL tools are the right tools for daily deltas, take backups, or perform any complex integrations.
The considerations for data retrieval stay the same as data injection, which are:
The approach for data retrieval is similar to data injection as well, except for:
If datasets containing less than 2,000 records need to be exported, it is possible to create a report and download the data.
With the Weekly Daily Export, backups can be scheduled and made available for download. The entire Salesforce dataset can be obtained by scheduling downloads, organized into multiple CSV files, each with a file size limit of 512 MB.
Data Loader, API, and ETL Tools approaches are the same as they are for data injection.
The considerations remain the same for data transformation as well:
The approach is determined by the number of records and the use case. Let’s take a look at what those use cases are:
Flows are of different kinds: there are before-save record trigger flow, after-save record trigger flow, event-scheduled path flows, and scheduled flows. Flows are good for making medium changes. For example, a scheduled path flow can process up to 250k records, so for a small use case, flow is a good option.
For use cases that involve very complex calculations, triggers are used. Triggers are event-driven so whenever a record is created, updated, or deleted, then actions like updating or creating auxiliary records can be performed. Triggers are very efficient and can process complex logic, making it the right tool for real time data transformation.
Formulas are easy to use, calculated in runtime, and do not take up database space.
Apex Batch Job is the right tool to use for running transformations periodically. It is capable of processing 50 million records at a time. Since most implementations are fewer than 50 million records, the Apex Batch Job is an ideal tool. It can schedule to run during non-business hours so that it doesn’t impact work during business hours.
For implementations with more than 50 million records, you can scale on-demand with Functions.
Salesforce Data Cloud, a relatively new feature, allows Salesforce to store and process data outside of Salesforce. The Data Cloud can be used to transform and process huge volumes of data without impacting the core Cloud.
Upon completing data transformation, users can seamlessly utilize the records through currently exposed features, such as Cloud enrichment and various others, which have become generally available in the latest release. This marks a valuable and noteworthy addition to the functionality.
If there is a need to process a dataset of, say, 200 million records, and an existing ETL solution is in place for performing aggregate calculations, it is recommended to utilize the ETL to extract the necessary information. Conduct all the required data transformation and load the information into the aggregated objects.
If a particular table or an object has more than 5 million records, then that is generally considered a Large Data Volume (LDV).
LDV can significantly impact reports, queries, and list views – it can either time out or data loading times can take very long. It can also cause skews, which can be lookup skews or master-detail skews, resulting in substantial delays when updating child records. Summary fields can become a challenging task if there are more than 10,000 or 20,000 records, leading to record locking. Sharing calculations, especially with large data volumes, contribute to considerable time consumption, ultimately diminishing the overall system performance. Moreover, if there is too much data, it would incur additional costs from Salesforce, and may also lead to security issues.
Skinny tables serve as an efficient means to access predefined data. For instance, consider an object such as an account with 200 fields. When a specific report necessitates only 10 fields, working with Salesforce support can help create a skinny table for the 10 fields. Consequently, whenever a report, query, or list view involving these fields is used, it will be very swift and efficient. It’s important to note that this feature is not universally supported in all objects. It is supported only in custom objects, accounts, contacts, and a limited set of others.
The standard index is out-of-the-box pre-configured and enabled on specific fields like all ID fields, created by date, and several other fields. In the absence of a standard index, collaboration with the support team can help create a custom index through indexed queries. This operation proves to be highly efficient, similar to the functionality of SQL or Oracle indexes.
Divisions can be used to achieve a form of horizontal partitioning of the data, so that the dataset being queried is low.
Use only the active data in Salesforce and archive the remaining data. If there is a need to access the archived data, external objects can be used in conjunction with the Salesforce Data Cloud.
While processing LDVs, roadblocks may arise such as row locks and hitting governor limits like the maximum number of bulk jobs that can be executed in a single day. Given the number of bulk jobs, even after submitting them, there may still be many bulk jobs waiting in the queue. High database CPU utilization can prompt Salesforce to throttle the entire Org, resulting in frequent timeouts.
To overcome these roadblocks, the following are what can be done to chart the path forward:
To ensure the smooth progression and processing of all records, the most straightforward and efficient approach involves leveraging the latest Bulk API version. Even for those utilizing V2, it is crucial to use the latest version to capitalize on the additional capacity and capabilities that have been introduced.
Leveraging Salesforce support is yet another strategy. This requires some advance planning as it may need a month or two. This involves securing an increase in the Bulk API batch limit and also increasing the concurrent batch processing limit. Caution is advised, as excessive increments may lead to increased database CPU utilization, resulting in throttling. This is why it is important to thoroughly test the full copy Sandbox to determine the optimal concurrent batch limit specific for Org requirements.
A bypass strategy needs to be implemented that will bypass all the automation and validations for the ETL user – this is hugely beneficial as it will make data insertion very fast.
For processing LDVs, it is advisable to defer the sharing rules. Running this process during off-business hours ensures that, shortly after completion, the sharing rules can be initiated quickly.
Data Cloud is highly beneficial, particularly when dealing with millions or billions of records.The Data Cloud enables the user to perform LDV operations more efficiently.
Bulk API V2 efficiently manages batch operations and handles retries up to 15 times. When using Bulk API V2, only ingests and updates contribute to the batch count. Query jobs are not counted so while executing a query with 5,000 batches, they are counted against the daily limit of 15,000 batches. However, with V2, for the same query, there is a limit on the number of queries that can be done in a day (approximately 10,000), but imposes no limit on the number of batches consumed.
Additionally, there exists a limit on the total results that Bulk API V2 can extract in 24-hours; it is set at 1 TB. This, however, isn’t a strict limit. If there is going to be heightened data processing during specific periods, with Salesforce support’s help, this limit can be increased. Retries are taken care of, batch management is taken care of, and they are not counted against your limits.
Want to learn about more ideas, opportunities, and strategies to maximize the value of Salesforce data? Watch sessions from the 2023 Data Innovation Forum for Salesforce Architects here.