Batch Size v/s Parallelization
As engineers, we strive to achieve low latency and high performance in our systems. In today’s world of cloud computing, we have all the resources at our disposal to achieve this goal. As our businesses grow, so does the throughput of requests hitting our systems. And that leads to the very common, horizontal v/s vertical scaling discussion.
In this note, I will talk about an adjacent topic, the discussion of batch size v/s parallelization. In the world of Big Data, Machine Learning, and Artificial Intelligence, we deal with gigantic datasets and humongous models, both with entities in the order of 10s to 100s of billions. Since no one machine can process this sheer volume of data, we rely on distributed systems.
There are two ways to go about processing these large datasets. Option 1 involves creating large chunks of data, while Option 2 involves creating small chunks of data. As evident, we will have fewer total chunks in Option 1, and more total chunks in Option 2. In other words, we will use less parallelization in Option 1 (since we have fewer chunks to process), and we may use more parallelization in Option 2. Therefore, batch size and parallelization are inversely proportional. Let us now discuss the pros and cons of these two parameters.
Since this is ‘not a one-size-fits-all’ problem, one must conduct experiments to fine-tune our systems. Too aggressive (large batch size or high parallelization) might result in resource contention and throttling issues. Too conservative (small batch size or low parallelization) might result in under-utilizing your systems’ resources.
In conclusion, it is essential to find the right balance between batch size and parallelization to achieve optimal system performance.