Given the complexity of a proper ETL software setting, consider some tips to improve your ETL performance.
Process Evenly Sized Files with COPY Command
One of the most simple and effective ways to improve the ETL performance is to divide data nodes into slices with dedicated cores. The goal is to give each slice an equal amount of work. This will help evenly distribute the processing capacity and prevent the overloaded slices from slowing down the process.
The exact number of slices depends on the node type. For example, DS2.XLARGE compute node can be divided into two slices, while DS2.8XLARGE has 16 slices.
Use Workload Management
ETL performance improvement measures necessarily include defining multiple queues with different workloads. It is needed to prevent ETL runtimes from becoming inconsistent.
For this purpose, create a queue dedicated to the ETL process and configure it with five or fewer slots. It will help to mitigate the excessive use of COMMIT. Then, claim extra memory that is available in a queue with the help of the “wlm_query_slot_count” command. Finally, create a separate queue for reporting ones, and do not forget to set up the dynamic memory parameters.
Organize Parallel Execution
If you experience sluggish data loading, this advice on how to improve the ETL performance may be the solution to your issue. Run ETL jobs in parallel, i.e., execute SQL statements with parallel processes. Such execution can be performed for conventional queries and DDL commands.
You may also use this feature for DML statements. For this purpose, you need to enable parallel execution with an “Alter session enable parallel DML” command. You also need to synchronize the parallel degree of all tables and adapt the database configuration.
Additionally, you should prevent the optimizer from deciding whether to execute in operation parallel mode or not. For this purpose, use commands “+parallel” and “no_parallel.”
Reduce and Optimize Datasets at Early Steps
To speed up and improve ETL performance even higher, you need to reduce the amount of time the optimizer spends joining the relevant rows of tables. It is especially vital for commands with complex WHERE conditions, as they make the optimizer read the largest table too early.
You need to join all the small tables first and use the “+ leading” hint to indicate the correct join order.
Get Rid of Unnecessary Indexes
The typical issue is that data warehouses are over-indexed. Indeed, such databases rarely require the use of indexes. Such an approach as creating another index to fix performance problems does not work for ETL jobs. But quite the opposite, it deteriorates an ETL’s functioning even more.
This is because additional indexes slow down the INSERT/UPDATE/MERGE operations. It also makes the program create a Nested Loops Join, i.e., the algorithm for reading rows from the first table in a loop, which you also should avoid. So, the best way to improve the ETL performance is to drop the unnecessary indexes.
Load Data in Bulk
The following tips to increase the ETL performance will help you handle big data better. Since you can store and process petabyte-scale datasets, you need to ensure their efficient transfer.
For this purpose, use a manifest file to merge datasets from multiple files. Also, use temporary staging tables created by the “create temporary table” command to hold the data before the transformation. And use the “alter table append” command to swap data from the staging table to the target one. This way, you load all the datasets at once.
Use Diagnostic Queries Regularly
The core of ETL performance improvements is monitoring your BI tool’s health. Such an ETL strategy will help you identify issues because they have an adverse effect on your cluster. You may use SQL monitoring scripts/diagnostic queries for this purpose. Here are a few examples:
- commit_stats.sql. This command shows detailed queue statistics so you can see the largest queue’s length and time. If there is any abnormality, the INSERT/UPDATE/COPY/DELETE operations likely take more time than they should.
- table_info.sql. This command shows unsorted statistics and storage and key information. This queue is valuable to check if there are any delays in performing transformation steps.
- v_get_schema_priv_by_user.sql. This command generates the schema showing what users have access to what data categories. Use this command in case reporting users can observe unnecessary, intermediate tables in reports.
These are only a small portion of diagnostic queries for ETL performance testing. So, it is useful to create a list of them and run each command regularly or as a reactive measure after detecting an issue.
Gather Statistics on Target Tables
Consider this the last step of an ETL job. In case you handle a complex stage or intermediate table, then missing statistics may lead to poor estimations. Without gathering statistics, several stage tables with outdated info may be joined together, leading to severe inaccuracies. That may result in generating a flawed execution plan. So, increase the ETL performance by reviewing statistics on target tables all the time.
Fortunately, a high-functional ETL solution has nearly all of the mentioned already integrated, so the probability of performance issues is much lower.