A Databricks Unit (DBU) represents how much computation is “consumed”, which is then billed on per-second increments of computation usage.
There are several factors that affect how many DBUs a given enterprise uses in an hour. The (arguably) most impactful factor is the sheer volume of data usage. The impact of volume (at least when compared to the other factors) is generally linear, meaning that computing 20 TB of data will cost about five times as much as computing 4 TB of data.
Additionally, the DBU calculation will also be influenced by both data velocity and data complexity.
In this context, the term “data velocity” is used to mean the frequency the pipeline is loaded. Some ETL pipelines operate using a continuous operating model, which, as you might expect, is the most expensive way to operate. On the other hand, pipelines that are only updated a few times per day (or even less) will have a considerably lower velocity and, as a result, will cost much less to use.
“Data complexity”, in this sense, represents how much work is taken to process a particular data set. If a data set has to undergo a complex process, such as deduplication or table upserts, that data will be considered much more complex than data that doesn’t.
As a result — as you might expect — small, periodic, and simple data aggregations will require the fewest DBUs. On the other hand, large, constant, and complex data aggregations will require the most DBUs and will increase costs. Of course, most data sets fall somewhere in between these two, which is why calculating DBUs is not always as intuitive as you might assume.