We optimized the X10 runtime at the transport layer, specifically, x10.util.Team class. X10 provides various type of network transports, e.g., PAMI, DCMF, MPI. The Team class provides collective communication routines for sending and receiving data among all the places, e.g., alltoall, scatter, gather, barrier; however, currently (X10 2.3.1 release) for the MPI transport, all those routines are emulated using point-to-point communication. We modified the collective communication routines in the Team class to use the native implementation of the MPI collective communication routines directly.
We have evaluated the performance of collective communication routines of our X10 Team library against various implementations and varying MPI_THREADS_MULTIPLE option (for enabling MPI multithreading) as described below.
OS: SUSE Linux Enterprise Server 11 SP1
Machine: HP Proliant SL390s G7
CPU: Intel Xeon 2.93 GHz (6 cores) x 2 (Hyperthreading enabled)
Main Memory: 54GB
Network: QDR InfiniBand x 2 (80Gbps)
MPI: MVAPICH2 1.9a2 (for Team, Emulation), MVAPICH2 1.6 (for At)
GCC: gcc (SUSE Linux) 4.3.4 [gcc-4_3-branch revision 152973]
Build options for X10: ant -DX10RT_MPI=true -DGCC_SYMBOLS=true -Doptimize=true
Compile options: x10c++ -cxx-prearg -g -x10rt mpi -O -NO_CHECKS -define -NO_BOUNDS_CHECKS source_files
Other Configurations: 2 places per one node
For detailed hardware specification, please refer to Tsubame hardware architecture page.