Myrinet Express over Generic Ethernet Hardware
Summary
Open-MX is a high-performance implementation of the Myrinet Express message-passing
stack over generic Ethernet networks.
It provides application-level with wire-protocol compatibility with the native MXoE
(Myrinet Express over Ethernet) stack.
The following middleware are known to work flawlessly on Open-MX
using their native MX backend thanks to the ABI and API compatibility:
Open MPI,
Argonne's MPICH2/Nemesis,
Myricom's MPICH-MX
and MPICH2-MX,
PVFS2,
Intel MPI (using the new TMI interface),
Platform MPI (formerly known as HP-MPI),
NewMadeleine,
and NetPIPE.
The design of Open-MX is described in several
papers.
The FAQ contains answers to many questions about Open-MX usage, configuration, and so on.
Features
Open-MX implements:
- Support for Linux kernel starting with 2.6.15 up to latest 4.x releases.
- Support for heterogeneous peers with different kernels, endianness, 32/64 bits environments, ...
- Support for all Ethernet hardware that Linux supports
- Can coexist with regular Ethernet traffic (IPv4, IPv6, ICMP, ...) on the same Ethernet interfaces without any disturbance
- Support for heterogeneous Ethernet hardware in the same fabric
- Compatibility with Myrinet Express 1.2.x
- Wire compatibility with MX-over-Ethernet
- API compatibility
- ABI compatibility (starting from MX 0.9)
- Contiguous and vectorial communications (without any additional copy)
- Up-to 4GB messages (32-bits message length)
- Shared communications between all endpoints of all local interfaces (syscall-based with a single copy for all message sizes)
- Self communications (software loopback in the library)
- I/OAT memory copy offload
- Blocking functions with overlapped progression
- Thread safety and concurrency
- Dynamic discovery and hostname resolution of connected peers
- Dynamic process addition/removal/reconnection
- Retransmission and network fault tolerance
- Highly configurable model for improved performance in homogenous/simple environments
The following features will be available in the near future:
Requirements:
- Linux kernel, with the corresponding headers and kbuild system installed
- Ethernet network (no routers)
- Jumbo frames (MTU=9000) support (in all interfaces and switches) is recommended for best performance
To get the latest Open-MX news,
or for discussion regarding Open-MX development,
you should subscribe to the
open-mx mailing list.
See also the news archive.
Bugs and Questions
The FAQ contains answers to many questions about Open-MX usage, configuration, and so on.
Bug reports and questions should be posted as
Gitlab Issues
or on the
open-mx mailing list.
See the end of README.md in the source tree for details.
Papers
-
Brice Goglin.
High-Performance Message Passing over generic Ethernet Hardware with Open-MX
In Elsevier Journal of Parallel Computing (PARCO), 37(2):85-100, February 2011.
Available here.
This paper describes the design of the Open-MX stack and of its copy offload mechanism,
and how the MX wire protocol and host configuration may be tuned for better performance.
If you are looking for general-purpose Open-MX citations, please use this one.
-
Brice Goglin.
NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet.
In Concurrency and Computation: Practice and Experience, Euro-Par 2009 best papers issue.
Wiley. 2010. Accepted for publication, to appear.
Available here.
Extended revision of the Euro-Par 2009 paper, discussing cache-affinity-related problems
in the Open-MX receive stack and improving performance by enhancing the cache-efficiency
from the NIC up to the application.
-
Brice Goglin and Nathalie Furmento.
Finding a Tradeoff between Host Interrupt Load and MPI Latency over Ethernet.
In Proceedings of the IEEE International Conference on Cluster Computing,
New Orleans, LA, September 2009.
IEEE Computer Society Press.
Available here.
Achieving low latency with Open-MX usually requires careful tuning of the NIC
interrupt coalescing. This paper discuss how to add basic support for Open-MX-aware
coalescing in regular NIC so as to achieve optimal latency without disturbing
message rate or increasing the host load.
-
Brice Goglin.
NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet.
In Proceedings of the 15th International Euro-Par Conference, Volume 5704 of Lecture Notes in Computer Science, pages 1065-1077,
Delft, The Netherlands, August 2009.
Springer Verlag.
Available here.
This paper discussed cache-affinity-related problems in the Open-MX receive stack.
It shows that adding Open-MX protocol knowledge in the NIC firmware and combining
it with multiqueue capabilities improves performance by enhancing the cache-efficiency
from the NIC up to the application.
-
Brice Goglin.
Decoupling Memory Pinning from the Application with Overlapped on-Demand Pinning and MMU Notifiers.
In CAC 2009: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2009,
Rome, Italy, May 2009.
IEEE Computer Society Press.
Available here.
This paper describes an innovative memory pinning optimization in Open-MX based on pinning pages on-demand,
overlapping this process with the communication, and decoupling it from user-space so as to implement a
safe pinning cache using the kernel MMU-Notifier framework.
-
Brice Goglin.
High Throughput Intra-Node MPI Communication with Open-MX.
In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP2009),
Weimar, Germany, February 2009.
IEEE Computer Society Press.
Available here.
The Open-MX intra-communication subsystem achieves very high throughput
thanks to overlapped memory pinning and I/OAT copy offload.
This paper led to the development of the KNEM module which offers similar
performance to the generic MPICH2/Nemesis implementation without needing
Open-MX, as described
here.
-
Brice Goglin.
Improving Message Passing over Ethernet with I/OAT Copy Offload in Open-MX.
In Proceedings of the IEEE International Conference on Cluster Computing, pages 223-231,
Tsukuba, Japan, September 2008.
IEEE Computer Society Press.
Available here.
Open-MX uses I/OAT copy offload on the receive side to work around the inability
of generic Ethernet hardware to perform zero-copy receive, enabling high throughput
up to the 10G linerate.
-
Brice Goglin.
Design and Implementation of Open-MX: High-Performance Message Passing over generic Ethernet hardware.
In CAC 2008: Workshop on Communication Architecture for Clusters, held in conjunction with IPDPS 2008,
Miami, FL, April 2008.
IEEE Computer Society Press.
Available here.
This paper describes the initial design and performance of Open-MX stack.
Last updated on 2022/03/03.