Optimizations: Since we can be definitely that the primer estimate of pivot points will not be utility enough, and since theinitial estimate is simply a linear fulfillment from the minimum to the maximum, the first pass can be optimized by inquiring that all we truly making is to form a histogram of the distribution. Therefore, the storage will not happen for the bucket or bucket-index for the elements, in the first pass, and there is no need to make a binary search to find the increasing of the bucket (see Figure 7). The steps of the bucketsort then become:• Creating of the Histogram• Refining the Pivots• Counting the elements per bucket• Repositioning the elementsWhen creating the offsets for each bucket in the final pass, the opportunity will take to assure that each bucket starts at a float4 aligned offset, that way eliminating of the need for an extra pass before merge-sorting the lists.Choice of divisions: As explained above, we must split the list into at least (d = 2p) parts, where p is the number of available processors. It is easy to realize, though, that each thread will do an atomic increasing on one of the d addresses, that choice would lead to a lot of stalls. However, the mergesort algorithm does not force any upper limits on the number of sublists. In fact, increasing the number of sublists in bucketsort rebates the amount of work that mergesort needs to do. Meanwhile, splitting the list into too many parts would lead to longer binary searches for each bucketsort-thread and more traffic between the CPU and GPU. In our tests, with 32 processors (on the GeForce 8600) splitting the list into 1024 sublists seems to be the best choice.In the histogram pass, where no binary search is desired, however, we use several times more buckets to improve the quality of pivot points if the distribution is expected to be very uneven and spiky.Random shuffling of input elements: Since the parallel bucketsort relies on atomic increments of each bucket size, nearly sorted lists can cause many serialized accesses to the same counter, which can limit the parallelism and lower the speed. In order to avoid significant slow-down, we can, therefore, add one pass, without compromising generality, at the beginning of the algorithm, which randomly shuffles the elements to be sorted. This is done in O(N) time using CUDA and the atomic exchange instruction.