When I started using GPUs for deep learning my deep learning skills improved quickly. When you can run experiments of algorithms and algorithms with different parameters and gain rapid feedback you can buku Forex learn much more quickly. After this success I was tempted to use multiple GPUs in order to train deep learning algorithms even faster.
I also took interest in learning very large models which do not fit into a single GPU. I thus wanted to build a little GPU cluster and explore the possibilities to speed up deep learning with multiple nodes with multiple GPUs. At the same time I was offered to do contract work as a data base developer through my old employer. When I did my research on which hardware to buy I soon realized, that the main bottleneck will be the network bandwidth, i. So GPU-to-GPU communication within a computer will be fast, but it will be slow between computers. This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. Luckily, Mellanox and Nvidia recently came together to work on that problem and the result is GPUDirect RDMA, a network card driver that can make sense of GPU memory addresses and thus can transfer data directly from GPU to GPU between computers.
There are basically two options how to do multi-GPU programming. Sorry if this was answered elsewhere but why did you opt for multiple systems over 1 systems with 4 GPUs or using multi GPU cards like FASTRA II? However, a system like FASTRA II is slower than a 4 GPU system for deep learning. This is mainly because a single CPU just supports 40 PCIe lanes, i. 8 for 4 or 3 GPUs. 2 nodes communicate with each other? 40GigE is readily available and in moderately wide use.
Cost per port is comparable with IB. 100GigE is also available and in use in a handful of companies, albeit only from Mellanox and as costly as one might expect. You may want to read my other blog posts about parallelism. Those blog posts contain in depth explanations of the parallelism bottlenecks and show that the largest bottleneck is the network connection. 1200 matrix is about 5 MB.