Open-MX: Myrinet Express over Generic Ethernet Hardware

News: Open-MX maintenance ending. (2015/12)

News: Clustermonkey reports Open-MX use in the Limilus Project for Switchless 10G experiments with Bonded Loop. (2013/04/11)

News: The Research, Computing & Engineering website hosts a podcast interview of Open-MX project leader. (2009/03/15)

News: Linux Magazine talks about Open-MX in an article about the Good old Ethernet. (2009/02/25)

News: Clustermonkey published an article about Open-MX and links to the videos of an Open-MX talk that was recently given at the STFC Daresbury Laboratory in UK. (2009/01/15)

Summary

Open-MX is a high-performance implementation of the Myrinet Express message-passing stack over generic Ethernet networks. It provides application-level with wire-protocol compatibility with the native MXoE (Myrinet Express over Ethernet) stack.

The following middleware are known to work flawlessly on Open-MX using their native MX backend thanks to the ABI and API compatibility: Open MPI, Argonne's MPICH2/Nemesis, Myricom's MPICH-MX and MPICH2-MX, PVFS2, Intel MPI (using the new TMI interface), Platform MPI (formerly known as HP-MPI), NewMadeleine, and NetPIPE.

The design of Open-MX is described in several papers.

The FAQ contains answers to many questions about Open-MX usage, configuration, and so on.

Features

Open-MX implements:

Support for Linux kernel starting with 2.6.15 up to latest 4.x releases.
- Support for heterogeneous peers with different kernels, endianness, 32/64 bits environments, ...
Support for all Ethernet hardware that Linux supports
- Can coexist with regular Ethernet traffic (IPv4, IPv6, ICMP, ...) on the same Ethernet interfaces without any disturbance
- Support for heterogeneous Ethernet hardware in the same fabric
Compatibility with Myrinet Express 1.2.x
- Wire compatibility with MX-over-Ethernet
- API compatibility
- ABI compatibility (starting from MX 0.9)
Contiguous and vectorial communications (without any additional copy)
- Up-to 4GB messages (32-bits message length)
- Shared communications between all endpoints of all local interfaces (syscall-based with a single copy for all message sizes)
- Self communications (software loopback in the library)
I/OAT memory copy offload
Blocking functions with overlapped progression
Thread safety and concurrency
Dynamic discovery and hostname resolution of connected peers
Dynamic process addition/removal/reconnection
Retransmission and network fault tolerance
Highly configurable model for improved performance in homogenous/simple environments

The following features will be available in the near future:

Progression thread

Requirements:

Linux kernel, with the corresponding headers and kbuild system installed
Ethernet network (no routers)
Jumbo frames (MTU=9000) support (in all interfaces and switches) is recommended for best performance

To get the latest Open-MX news, or for discussion regarding Open-MX development, you should subscribe to the open-mx mailing list. See also the news archive.

Performance

If you need help to tune your installation for Open-MX, please refer to the Performance tuning section of the FAQ.

10G Ethernet Performance

Multiple raw outputs of the Intel MPI Benchmarks and NetPIPE are available below using Open MPI or MPICH-MX on top of Open-MX. For comparison purpose, the performance of the BTL TCP component of Open MPI is also given (using the exact same configuration of the host and its 10G interface). For configuration details, see the headers of the corresponding output file. Direct access to all raw performance numbers is available here.

	MPICH-MX/Open-MX	Open MPI/Open-MX	Open MPI/TCP
NetPipe (link width is 9491Mbps)	7.05µs - 9367Mbps	7.22µs - 9106Mbps	15.07µs - 6462Mbps
IMB (link width is 1186MiB/s)	7.02µs - 1160MiB/s	7.21µs - 1124MiB/s	14.54µs - 825MiB/s

The Open-MX latency depends on the processor frequency. For instance, if you replace the 2.33 GHz "Clovertown" Xeon (E5345) in the above tests with some 3.16 GHz "Harpertown" (X5460), the latency drops to 6.18µs.

Gigabit Ethernet Performance

IMB performance on Gigabit Ethernet interfaces (Broadcom bnx2) with Open MPI/Open-MX and Open MPI/TCP.

Intra-node Communication Performance

Since Open-MX also provides an efficient shared-memory communication model, the IMB performance on top of MPICH-MX is also available for the following runs:

2 processes on a single node (only Open-MX shared-memory communication)
16 processes shared among 2 nodes (both Open-MX shared-memory and Open-MX network communication)

Bugs and Questions

The FAQ contains answers to many questions about Open-MX usage, configuration, and so on. Bug reports and questions should be posted as Gitlab Issues or on the open-mx mailing list. See the end of README.md in the source tree for details.

Credits

Inria Open-MX was developed by Inria Bordeaux Research Centre (former Runtime team-project) in collaboration with Myricom, Inc. The main contributors are Brice Goglin, Nathalie Furmento, and Ludovic Stordeur.

Open-MX development resources are maintained on the Inria Gitlab project.

Papers

Brice Goglin. High-Performance Message Passing over generic Ethernet Hardware with Open-MX In Elsevier Journal of Parallel Computing (PARCO), 37(2):85-100, February 2011. Available here.
This paper describes the design of the Open-MX stack and of its copy offload mechanism, and how the MX wire protocol and host configuration may be tuned for better performance. If you are looking for general-purpose Open-MX citations, please use this one.
Brice Goglin. NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet. In Concurrency and Computation: Practice and Experience, Euro-Par 2009 best papers issue. Wiley. 2010. Accepted for publication, to appear. Available here.
Extended revision of the Euro-Par 2009 paper, discussing cache-affinity-related problems in the Open-MX receive stack and improving performance by enhancing the cache-efficiency from the NIC up to the application.
Brice Goglin and Nathalie Furmento. Finding a Tradeoff between Host Interrupt Load and MPI Latency over Ethernet. In Proceedings of the IEEE International Conference on Cluster Computing, New Orleans, LA, September 2009. IEEE Computer Society Press. Available here.
Achieving low latency with Open-MX usually requires careful tuning of the NIC interrupt coalescing. This paper discuss how to add basic support for Open-MX-aware coalescing in regular NIC so as to achieve optimal latency without disturbing message rate or increasing the host load.
Brice Goglin. NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet. In Proceedings of the 15th International Euro-Par Conference, Volume 5704 of Lecture Notes in Computer Science, pages 1065-1077, Delft, The Netherlands, August 2009. Springer Verlag. Available here.
This paper discussed cache-affinity-related problems in the Open-MX receive stack. It shows that adding Open-MX protocol knowledge in the NIC firmware and combining it with multiqueue capabilities improves performance by enhancing the cache-efficiency from the NIC up to the application.
Brice Goglin. Decoupling Memory Pinning from the Application with Overlapped on-Demand Pinning and MMU Notifiers. In CAC 2009: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2009, Rome, Italy, May 2009. IEEE Computer Society Press. Available here.
This paper describes an innovative memory pinning optimization in Open-MX based on pinning pages on-demand, overlapping this process with the communication, and decoupling it from user-space so as to implement a safe pinning cache using the kernel MMU-Notifier framework.
Brice Goglin. High Throughput Intra-Node MPI Communication with Open-MX. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2009), Weimar, Germany, February 2009. IEEE Computer Society Press. Available here.
The Open-MX intra-communication subsystem achieves very high throughput thanks to overlapped memory pinning and I/OAT copy offload. This paper led to the development of the KNEM module which offers similar performance to the generic MPICH2/Nemesis implementation without needing Open-MX, as described here.
Brice Goglin. Improving Message Passing over Ethernet with I/OAT Copy Offload in Open-MX. In Proceedings of the IEEE International Conference on Cluster Computing, pages 223-231, Tsukuba, Japan, September 2008. IEEE Computer Society Press. Available here.
Open-MX uses I/OAT copy offload on the receive side to work around the inability of generic Ethernet hardware to perform zero-copy receive, enabling high throughput up to the 10G linerate.
Brice Goglin. Design and Implementation of Open-MX: High-Performance Message Passing over generic Ethernet hardware. In CAC 2008: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2008, Miami, FL, April 2008. IEEE Computer Society Press. Available here.
This paper describes the initial design and performance of Open-MX stack.