Open-MX - Frequently Asked Questions

If you do not find your answer here, feel free to contact the open-mx mailing list.

Basics

What is Open-MX?

Open-MX is a software implementation of Myricom's Myrinet Express protocol. It aims at providing high-performance message passing over any generic Ethernet hardware.

Open-MX implements the capabilities of the MX firmware (running in Myri-10G NICs) as a driver in the Linux kernel. A user-space library exposes the MX interface to legacy applications.

How does Open-MX compare to MX?

Open-MX implements MX programming interface with API and ABI compatibility and it is also wire-compatible with MX-over-Ethernet. See the Native MX Compatibility section for details.

There are some tiny differences between MX and Open-MX implementations:

Open-MX does not provide a progression thread yet, which means no progression occurs in the background unless an Open-MX function is invoked.
Open-MX does not support limiting the endpoint unexpected queue with PARAM_UNEXP_QUEUE_MAX in open_endpoint.
Some getinfo keys are meaningless in Open-MX, so they will return dummy values such as "N/A (Open-MX)".
Open-MX does not support the deprecated register_unexp_callback() function. Only the modern register_unexp_handler() is supported.
Open-MX is able to perform wait_any(), test_any(), probe or iprobe on random matching masks even when the matching space has been divided with the endpoint Context Ids parameter.

There are also some tiny differences between the native MX and Open-MX programming interfaces. These differences are hidden by Open-MX API/ABI compatibility layer. But if you plan to use the Open-MX specific API directly, you might want to know that:

Open-MX send/ssend/recv routines are not vectorial, some other vectorial-specific routines are provided (sendv/ssend/recvv).
There is no distinction between the routine return type (mx_return_t) and a request status code type (mx_status_code_t), both of them are identical in Open-MX (mx_return_t).
MX status address field source is renamed into addr in Open-MX.
set_error_handler() can be used to setup the global handler or any specific endpoint handler.
Open-MX provides the cancel_notest() routine to cancel a request without freeing it, so that it can be completed with the CANCELLED status later.

Which operating system does Open-MX support?

Open-MX supports Linux on any architecture.

The Open-MX driver works at least on Linux kernels >=2.6.15. Kernels older than 2.6.15 are unlikely to be ever supported due to various important functions being unavailable (especially vm_insert_page).

The Open-MX driver is regularly updated for newer kernels, making it likely to work on the latest stable kernel even before it is actually released.

Which hardware and fabric does Open-MX support?

Open-MX works on all Ethernet hardware that the Linux kernel supports. The only requirements is that the MTU is large enough (details) and that all connected peers are on the same LAN, which means there is no router between them (switches are OK).

Which MTU should my network support for Open-MX?

The Open-MX MTU requirements may be obtained by reading the driver status:

  $ cat /dev/open-mx
  Open-MX 1.0.90
   Driver ABI=0x208
   Configured for 32 endpoints on 32 interfaces with 1024 peers
   WireSpecs: NoWireCompat EtherType=0x86df MTU>=0x9000

The minimal MTU actually depends on the configuration. Open-MX was designed to be compatible with MX wire-specifications. If this compatibility is enabled (by passing --enable-mx-wire to the configure script), 4 kB frames (plus at most 64 bytes of headers) have to be accepted by the network.

If Open-MX is configured in non-MX-wire-compatible mode (default), the minimal required MTU is 9000. But another value may be enforced by configuring Open-MX with --with-mtu=1500 or another non-default value. Packet sizes will be updated accordingly, as shown in driver status. See also How should I tune Open-MX MTU and packet sizes?.

  $ cat /dev/open-mx
  Open-MX 1.0.90
   WireSpecs: NoWireCompat EtherType=0x86df MTU>=0x9000
   MediumMessages: 8192B per fragment
   LargeMessages: 4 requests in parallel, 32 x 8968B pull replies per request

Is Open-MX compatible with IP traffic?

Yes. Open-MX talks to the Ethernet layer as IP does, but it does not use the same Ethernet packet type. It means that IP and Open-MX can perfectly coexist on the same network and drivers, thanks to operating system passing the incoming packets to the corresponding receive stack.

Does Open-MX support mixed endianness on the same fabric?

Yes. By default, Open-MX will encode its packet headers in network-order, unless --disable-endian has been given to the configure script. Open-MX can thus make big-endian architectures talk to little-endian ones, or 32bits ones to 64bits, ...

However, it is obviously up to the application to make sure that its data is passed through the network in the endian-independant way.

What if I find a bug?

Bugs should be reported as Gitlab Issues or sent to the open-mx mailing list. Questions may be asked there too.

Lots of information might be useful when diagnosing a bug, see details in 'Reporting Bugs' at the end of the Gitlab project or at the end of the README file in the source tree.

Running

In short, how do I test Open-MX?

Assuming you want to connect 2 nodes using their 'eth2' interface:

Build and install Open-MX in /opt/open-mx (see Building and Installing for details).
```
$ ./configure
$ make
$ make install
```
Make sure both interfaces are up with a large MTU
```
$ ifconfig eth2 up mtu 9000
```
Load the open-mx kernel module and tell it which interface to use (see Kernel Driver and Managing Interfaces for details).
```
$ /path/to/open-mx/sbin/omx_init start ifnames=eth2
```

Wait a couple seconds and run omx_info to check that all peers are seeing each other. See Peer Discovery for details.

$ /path/to/open-mx/bin/omx_info
[...]
Peer table is ready, mapper is 01:02:03:04:05:06
================================================
  0) 01:02:03:04:05:06 node1:0
  1) a0:b0:c0:d0:e0:f0 node2:0

Use omx_perf to test actual communications (see How do I measure performance with omx_perf?).

How may I test Open-MX on a single node?

If Open-MX is installed on a node and you want to check that everything looks good without running intensive benchmarks on the network, you may run some local tests.

Load the open-mx kernel module and tell it to use the loopback interface (see Kernel Driver and Managing Interfaces for details).

$ /path/to/open-mx/sbin/omx_init start ifnames=lo

You should now see localhost appear in the peer table.

$ /path/to/open-mx/bin/omx_info
[...]
Peer table is ready, mapper is 00:00:00:00:00:00
================================================
  0) 00:00:00:00:00:00 localhost

Note that localhost is a special hostname that the driver gives to the loopback interface instead of the usual hostname:0 for the first attached interface.

You may then run the local testing suite (which requires that the first attached board is the above localhost peer).

  $ /path/to/open-mx/bin/omx_check

Moreover, you can control the verbosity of the test suite with the OMX_TEST_VERBOSE environment variable. Valid values are 0 (default value), 1 and 2.

How does Open-MX work?

Open-MX provides implements the Myrinet Express (MX) protocol and application interface on top of regular Ethernet hardware. A user-space library manages MPI-like requests and passes them to the Open-MX driver which maps them directly onto the software Ethernet layer of the Linux kernel. Packets are sent/received through the underlying (unmodified) driver in a MX-similar way.

Is Open-MX thread-safe?

The Open-MX driver is always thread-safe. The user-space library is thread-safe by default.

When thread-safety is enabled but no threads are actually used, locking is optimized by using weak fake symbols. However, as soon as libpthread is loaded by the application or MPI implementation, real pthread locks are used to ensure thread-safety. Therefore achieving optimal performance with non-threaded applications and MPI implementations requires that libpthread is not loaded uselessly.

You may pass --disable-threads to the configure script to disable thread safety entirely if needed. Disabling thread safety is only useful for reducing the latency a little bit when all applications either ensure thread safety above Open-MX or never use any thread.

Does Open-MX support communication to the same host or endpoint?

Yes. Open-MX may use a software loopback to send messages from one endpoint to itself (self communications) or to another endpoint of any interface of the same host (shared communications). This loopback is faster than going on the network up to a switch and then coming back. And it is guaranteed to work (while some switches do not send packets back to their sender).

If using a single node, it is possible to only attach the loopback interface (lo) to Open-MX and let the stack switch to optimized self or shared-memory communication.

What happens on error?

If a Open-MX function fails for any reason (resource shortage, invalid parameters given by the application, ...), or if a request completes with an erroneous status code (remote endpoint closed or non-responding, ...), Open-MX will by default abort and display an error message. See How to debug an abort message? to find out where the problem comes from.

This behavior is caused by the default error handler, which may be changed by applications through the omx_set_error_handler function. It is also possible to change it at runtime by setting OMX_FATAL_ERRORS=0 in the environment. All error codes will then be returned to the application instead of aborting from within the Open-MX library.

The Open-MX library will also abort under some circumstances, even if fatal errors have been disabled by the user. Apart from internal assertions detecting an implementation bug, the main reason for aborting is when the driver closes an endpoint by force. Fortunately, it only occurs in rare circumstances such as Ethernet hardware failure or the administrator closing an interface.

If you think you found a bug, see What if I find a bug?.

Building and Installing

How do I build and install Open-MX?

$ ./configure
$ make
$ make install

Both steps may be independently parallelized with -j.

Also, if building from GIT, you will need to generate some files such as the configure script and some .in files first. automake, autoconf, autoheader and libtool 2 are required to do so. The autogen.sh script takes care of running them accordingly:

$ ./autogen.sh
$ ./configure --prefix=...
$ make
$ make install

To display full build command line instead of the default short messages, V=1 should be passed to make.

By default, Open-MX will be installed in /opt/open-mx. Use --prefix on the configure line to change this.

Open-MX brings the omx_init initialization scripts which takes care of loading/unloading the driver and managing the peer table.

$ sbin/omx_init start

To choose which interfaces have to be attached, some module parameters may be given on the command line:

$ sbin/omx_init start ifnames=eth1

Can I install Open-MX as RPM packages?

Recent Open-MX tarballs (since 1.2.1) contain a RPM spec file that eases the building of Open-MX as RPM binary packages. To use it, download the tarball and run for instance:

    rpmbuild -tb open-mx-1.2.1.tar.gz

It will produce a single RPM package such as open-mx-1.2.1-0.x86_64.rpm which contains user-space tools and libraries and the kernel module. After installing the RPM package, Open-MX binaries will be installed under /opt/open-mx-<version>/.

The RPM package will also install the init script /etc/init.d/open-mx, the config file /etc/open-mx/open-mx.conf and udev rules in /etc/udev/rules.d/10-open-mx.rules. The administrator may then want to enable automatic launch of the Open-MX init script during the boot.

Note that, if the kernel ever gets upgraded, you might have to rebuild the RPM packages so as to update the kernel module.

May I build a 32bit library? or a 64bit? or both?

By default, Open-MX builds user-space libraries and tools with the default compiler options. You may for instance enforce a 32bit build by passing CC="gcc -m32" on the configure command-line.

It is also possible to build both 32bit and 64bit libraries by passing --enable-multilib to the configure script. The resulting libraries will be installed in lib32/ and lib64/ directories respectively. Note that Open-MX internal tools and tests program will be linked with the default library (the one using the native architecture pointer size).

When linking a MPI layer or other applications over Open-MX, it will usually look for the lib directory within the Open-MX install tree. To make sure that 64bits libraries are used, you may want to tell the MPI configure script to look in lib64 instead of lib, for instance by using --with-mx=/path/to/open-mx/install --with-mx-libdir=/path/to/open-mx/install/lib64 and by pointing LD_LIBRARY_PATH accordingly. The administrator may also want to add a lib symlink pointing to the preferred library for this environment. Open-MX does not create this symlink automatically since it cannot guess the administrator preference and also because adding such a symlink in a standard installation path might be a bad idea.

Where should I install Open-MX?

By default, Open-MX will install in /opt/open-mx. It is possible to change this path by passing --prefix=</new/path> to the configure script.

All Open-MX install files should be available to all nodes since the driver and some tools are required on startup. It is thus recommended that you use a NFS-shared directory as the above prefix.

How to setup Open-MX to auto-start at boot?

To simplify Open-MX startup, you might want to install the omx_init script within the startup scripts on each node:

$ sbin/omx_local_install

Then Open-MX may then be started with:

$ /etc/init.d/open-mx start

You might want to configure your system to auto-load this script at startup.

See Managing interfaces to configure which interfaces have to be attached on startup.

How to configure udev for Open-MX?

Open-MX uses some special device files in /dev for talking to the kernel module (see Which device files does Open-MX use?). On modern installations, udev will take care of creating these device files automatically when the kernel module is loaded.

When installing the Open-MX startup script with omx_local_install (see How to setup Open-MX to auto-start at boot?), the udev installation is checked. An Open-MX-specific udev rule file is installed The administrator may either tune this file or the Open-MX configure line to change device file names or access permissions (see Which device files does Open-MX use?).

The udev rules file is usually /etc/udev/rules.d/10-open-mx.rules.

How to uninstall Open-MX?

Uninstalling Open-MX files is mainly a matter of removing the entire directory pointed by the prefix of the installation (given with --prefix on the configure line, see Where should I install Open-MX?).

A safer way is to run:

$ make uninstall

in the root of the build tree.

If omx_local_install was used, some system files have been installed outside of the installation prefix directory (see How to setup Open-MX to auto-start at boot?). To erase these files (except those that were modified), you may run:

$ sbin/omx_local_install --uninstall

How to install over NFS?

Most NFS configurations do not allow root on the client to operate as root on the server's files. When running make install as root, you might experience problems because some Makefiles (especially the kernel driver's one) might modify some files before actually installing anything.

To avoid this problem, make install does not try to build what may be missing. You should thus build as a normal user and then run make install as root. This way, it really only installs things without ever trying to modify the build tree as root over NFS.

I changed my Open-MX configuration, should I relink my application?

If the application or the MPI implementation is dynamically linked against Open-MX, there is nothing to do since the Open-MX library ABI (binary interface) is stable and will not change when reconfiguring/recompiling.

However, there is also an internal binary interface between the library and the kernel driver. If you reconfigure Open-MX in a different way, and load the new kernel module, the corresponding new library should be used as well. In case of dynamic linking, it should be transparent assuming the new library replaced the old file. In case of static linking, the above application or MPI implementation should be relinked against the new Open-MX static library.

Kernel Driver

How do I change the target kernel for the driver?

During configure, Open-MX checks the running kernel with 'uname -r' and builds the open-mx module against it, using its headers and build tree in /lib/modules/’uname -r’/{source,build}.

To build for another kernel, you must first pass its release name so that Open-MX knows where to install the kernel module.

$ ./configure --with-linux-release=2.6.x-y

The --with-linux-release option also lets Open-MX find the corresponding kernel header directory from /lib/modules//. It is possible to override this directory in case /lib/modules// does not contain the right symlinks:

$ ./configure --with-linux-release=2.6.x-y --with-linux=/path/to/kernel/headers/

Additionally, if your distribution installs kernel headers and build tools into different directories, you may also need to override the build directory:

$ ./configure --with-linux-release=2.6.16.60-0.34 \
              --with-linux=/usr/src/linux-2.6.16.60-0.34 \
              --with-linux-build=/usr/src/linux-2.6.16.60-0.34-obj/x86_64/smp/

Can DKMS update the Open-MX kernel module automatically?

Since Open-MX 1.3.2, DKMS (Dynamic Kernel Module Support) may be used to automatically rebuild the Open-MX kernel module when a new kernel is installed.

Assuming you want to use Open-MX 1.3.2, you should first unpack the open-mx-1.3.2.tar.gz tarball in /usr/src (required by DKMS). Then run configure inside this source tree, and build and install it as usual. Now, you should tell DKMS to use the source tree for building updated kernel modules:

# dkms add -m open-mx -v 1.3.2

When installing a new kernel, DKMS will reconfigure the source tree with the same options and specify the new kernel as the target. The resulting open-mx module will be installed under /lib/modules/ and may thus be loaded with modprobe. Finally, the omx_init startup script will try to load a DKMS-built kernel module if it cannot find one in the regular Open-MX installation directory.

If you ever need to manually invoke DKMS to rebuild the module against the current kernel (or against a specific kernel by adding the right -k option), do:

# dkms build -m open-mx -v 1.3.2
# dkms install -m open-mx -v 1.3.2

Beware that only one instance of the open-mx may be available in /lib/modules/ for each kernel. If you ever use multiple Open-MX installations simultaneously, DKMS will not be able to manage all of their modules at the same time.

To uninstall everything DKMS generated from this Open-MX tarball:

# dkms remove -m open-mx -v 1.3.2 --all
# rm -rf /usr/src/open-mx-1.3.2

See this article for more details about DKMS.

How do I change the compiler for the kernel driver?

The kernel module should preferably be compiled with the same compiler than the kernel has been. To change the compiler for the kernel module, pass KCC=<othercompiler> on the configure command line. It is also possible to specify additional command-line options for this kernel compiler by passing them in KCFLAGS=<options...> at configure time. See also How do I change the target architecture of the kernel driver?.

How do I change the target architecture of the kernel driver?

By default the kernel module is built for the currently running architecture. If the target kernel was built for another architecture, the kernel module build should be modified accordingly. For instance, to build for i386 machines, pass KARCH=i386 on the configure command-line. This variable will be passed as ARCH to the kernel build command line. See also How do I change the compiler for the kernel driver?.

Which device files does Open-MX use?

Once the module is loaded, udev creates a /dev/open-mx file which is used by user-space libraries and programs. Additionally, the Open-MX init script will create the device node in case udev was not running (see also How to configure udev for Open-MX?). The --with-device configure option may be used to change the name of this device file, its group or mode. Write access to this file is required when using Open-MX.

There is actually also another /dev/open-mx-raw device file that may be used by the peer discovery process to send/recv raw packets. It may be configured similarly with --with-raw-device.

Managing Interfaces

Which interfaces are attached are startup?

By default, when loading the Open-MX driver, all existing network interfaces in the system will be attached (except those above 32 by default), except the ones that are not Ethernet, are not up, or have a small MTU.

To change the order or select which interfaces to attach, you may use the ifnames module parameter when loading:

$ /path/to/open-mx/sbin/omx_init start ifnames=eth2,eth3
$ insmod lib/modules/.../open-mx.ko ifnames=eth3,eth2

Once Open-MX has been installed with omx_local_install, the /etc/open-mx/open-mx.conf may be modified to configure which interfaces should be attached at startup (see also What are Open-MX startup-time configuration options?).

How do I see or modify the list of attached interfaces?

The current list of attached interfaces may be observed by reading the /sys/module/open_mx/parameters/ifnames special file. Writing 'foo' or '+foo' in the file will attach interface 'foo'. Writing '-bar' will detach interface 'bar', except if some endpoints are still using it. To force the removal of an interface even if some endpoints are still using it, '--bar' should be written in the special file. Multiple commands may be sent at once by separating them with commas.

Finally, it has to be noted that the dynamic peer discovery cannot discover newly attached or detached local interfaces. As soon as the list of local interfaces changes, the local discovery process should be restarted (see Peer Discovery):

$ omx_init restart-discovery

What are the requirements for an interface to work?

These interfaces must be 'up' in order to work.

$ ifconfig eth2 up

However, having an IP address is not required.

Also, the MTU should be large enough for Open-MX packets to transit. 9000 will always be enough. Look in dmesg for the actual minimal MTU size, which may depend on the configuration. A relevant warning will be displayed in dmesg if needed.

$ ifconfig eth2 mtu 9000

If one of above requirements is not met, a warning should be printed in user-space when opening an endpoint.

How do I see the interfaces status?

The list of currently open endpoints may be seen with:

$ omx_endpoint_info

The interfaces may also be observed with the omx_info user-space tool.

Do I need to attach an interface if using Open-MX for local communications only?

Yes. Open-MX requires all communication endpoints to be attached to an interface, even if it is not used by actual network traffic underneath. It is fortunately possible to attach the loopback interface (lo) and either use it as a regular interface talking to itself, or bypass it and use the optimized shared communicarion. The loopback interface is always names localhost by default (See What are the interface and peer names?).

What are the interface and peer names?

Each interface attached to an Open-MX driver in the fabric is identified by a MAC address, an internal peer index, and a convenient Open-MX hostname.

The hostname is set by the local driver when attaching the interface. By default, it is the machine hostname followed by a colon and the index of the interfaces in the list of attached Open-MX interfaces. For instance, when attaching two interfaces to machine node34, they will be named node34:0 and node34:1. If attaching the loopback interface (lo), the driver will automatically name it localhost instead.

Interface names are propagated to other machines in the fabric automatically. the list of all known peers (interfaces attached to any driver in the fabric) and thei hostnames may be seen with:

$ omx_info
[...]
Peer table is ready, mapper is 01:02:03:04:05:06
================================================
  0) 01:02:03:04:05:06 node1:0
  1) a0:b0:c0:d0:e0:f0 node2:0

It is possible to rename local interfaces with:

$ omx_hostname -n  -n

If an interface is renamed while its old name has already been propagated to other machines, it is possible to force the update by clearing the list of known remote hostnames (as root) with:

$ omx_hostname -c

Peer Discovery

What is the peer table?

Each Open-MX node has to be aware of the hostnames and MAC addresses of all other peers. Specific information about hostnames may also be found in What are the interface and peer names?.

How can I setup the peer table?

By default, a dynamic peer discovery is performed but it is also possible to enter a static list of peers manually.

The --enable-static-peers option may be used on the configure command line to switch from dynamic to static peer table. It is also possible to switch later by passing --dynamic-peers or --static-peers to the omx_init startup script.

It is possible to restart the peer table management process without restarting the whole Open-MX driver with:

$ omx_init restart-discovery

This is especially important when attaching or detaching interfaces at runtime while using dynamic peer discovery. But it may also for instance be used to switch between static and dynamic peer table.

How do I use a static peer table?

Dynamic discovery may sometimes take several seconds before all nodes become aware of each others. If the fabric is always the same, it is possible to setup a static peer table using a file. To do so, Open-MX should be configured with --enable-static-peers.

A file listing peers must be provided to store the list of hostnames and mac addresses in the driver. The omx_init_peers tool may be used to setup this list. The omx_init startup script takes care of running omx_init_peers automatically using /etc/open-mx/peers when it exists.

The contents of the file is one line per peer, each containing 2 fields (separated by spaces or tabs):

a mac address (6 colon-separated numbers)
a board hostname (<hostname>:<ifacenumber>)

To change the location of the peers file, it is possible to use the --with-peers-file=<path> configure option, or the --static-peers=<path> omx_init option.

If Open-MX has been configured for dynamic peer discovery by default, the --static-peers omx_init option may also be used to switch to static peer table.

What is FMA? How do I use it?

FMA is Myricom's fabric management system. It is used in MX by default. If you plan to make MX and Open-MX operable, or just want a scalable and powerful peer discovery tool, you may tell Open-MX to use FMA instead of the default omxoed dynamic peer discovery program.

FMA is available from Myricom's FMS page or may be copied from the MX source tree.

To build FMA and use, just unpack the FMA source within the Open-MX source directory (as a fma/ subdirectory), and run configure, build and install.

Which FMA version should I use?

FMA only correctly supported MX-over-Ethernet (or fabrics mixing MX-o-E and MX-over-Myrinet nodes) starting with 1.3.0. So, if running FMA as the peer discovery tool for Open-MX, at least FMA 1.3.0 is needed.

The same FMA version should be running on all nodes. This point is especially important if the fabric mixes MX and Open-MX nodes (see What is MX-wire-compatibility?). There are two easy ways to make sure the same FMA is used on MX and Open-MX nodes:

Build MX with its own FMA, and copy the FMA source in the Open-MX source dir before building it. This requires at least MX 1.2.3 since this is where FMA 1.3.0 was shipped first.
Or download the latest FMA source from Myricom's FMS page and unpack it in both MX and Open-MX before building.

How do I decide between omxoed, FMA and static peer table?

When mixing Open-MX an native MX hosts on the same fabric, it is required that the peer discovery processes are compatible. MX uses FMA by default, so Open-MX should be configured to use FMA in this case. If MX was specifically configured to use mxoed, then Open-MX may keep using its default discovery tool, omxoed, which is compatible with mxoed.

FMA may also be much slower than omxoed on small networks. Since this is Open-MX' main use case, it is recommended to keep using the default configuration (i.e. use omxoed) unless the fabric contains some native MX hosts.

Setting up a static peer table is faster than both FMA and omxoed but it obviously only works for statix fabric. Note that it is possible to manually add some peers later using the omx_init_peers tool. Dynamic peer discovery

By default, Open-MX uses the omxoed program to dynamically discover all peers connected to the fabric, including the ones added later. The only requirement is that the omxoed program runs on each peer.

If Myricom's FMA source directory is unpacked within the Open-MX source (as the "fma" subdirectory), Open-MX will automatically switch (at configure time) to using FMA instead of omxoed as a peer discovery program. Using FMA is especially important when talking to native MX hosts since they will use FMA by default as well.

The discovery program is started automatically by the omx_init startup script. If Open-MX has been configured to use a static peer table by default, it is still possible to switch to dynamic discovery by passing --dynamic-peers to omx_init.

It is also possible to switch from fma to omxoed by passing the option --dynamic-peers=omxoed to omx_init.

How many peers may Open-MX talk to?

Open-MX may manage up to 65536 peers on the fabric. However, since such big fabrics are quite unusual, the Open-MX driver only supports 1024 peers by default. This threshold may be increased when loading the driver by passing the module parameter peers=N.

If too many peers are connected and the driver fails to add all of them to the peer table because it is full, a warning will be displayed in the kernel log and in the output of omx_info.

What is the raw interface and how do I use it?

Open-MX exports a message-passing programming interface to applications. It also exports another interface called "raw" used by peer discovery programs to manage the peer table in the driver.

Unless you have a very good reason to not use the existing peer discovery programs or a static peer table, you really do not want to look at the raw interface. The regular message-passing interface should provide everything you need.

What does the message "Discovery exited early" mean?

If the peer discovery program fails during startup for some reason, omx_init with issue the error message:

Starting the dynamic peer discovery (omxoed )
Discovery exited early

The problem is usually explained in the log, either in /var/log/omxoed.log or /var/run/fms/fma.log (depending on the peer-discovery program that you use). A common reason for such a failure is when no interface was attached to Open-MX. /var/log/omxoed.log will report No NICs found in this case.

Performance Tuning

How-to quickly benchmark Open-MX?

To get best performance for benchmarking purposes between homogeneous hosts, you might want to:

Build Open-MX with --disable-endian and --disable-mx-wire (default).
Make sure no cores are sleeping since they would be slow to process incoming packets. Booting Linux with idle=poll is an easy way to prevent this sleeping. Another one is to have a task using 100% on each core as any real-life application would do.
To reduce cache-effects without sharing a single core power between bottom halves and the main process, bind the process on one core (close to the network interface, with numactl or taskset), and bind the interrupts of the Ethernet interface on a very close core (see How do I find out and change the binding of an interrupt line?).

Once things are properly built, installed and loaded, you may check performance using omx_perf (see How do I measure performance with omx_perf?).

You may also want to look at some hardware-specific features.

How do I measure performance with `omx_perf`?

omx_perf measures the latency and throughput of a ping-pong for multiple message lengths between two hosts. As any ping-pong benchmark, it should not be considered as valuable as benchmarking actual applications, but it still may help measuring raw network performance and diagnosing performance problems.

omx_perf may be started as a server on the first node by passing no command-line arguments.

node1 $ omx_perf
Successfully attached endpoint #0 on board #0 (hostname 'node1:0', name 'eth2', addr 01:02:03:04:05:06)
Starting receiver...

Then another instance of omx_perf running on a second node may connect to the server:

node2 $ omx_perf -d node1:0
Successfully attached endpoint #0 on board #0 (hostname 'node2:0', name 'eth2', addr a0:b0:c0:d0:e0:f0)
Starting sender to node1:0...

You should get performance numbers such as

length         0:       7.970 us   0.00 MB/s        0.00 MiB/s
length         1:       7.950 us   0.00 MB/s        0.00 MiB/s
[...]
length   4194304:       8388.608 us   500.00 MB/s       476.83 MiB/s

See the omx_perf.1 manpage for more details.

What is the MX wire-compatibility impact on Open-MX performance?

Open-MX enables 2 types of wire-compatibility by default, native-MX compatibility and endian-independent compatibility. Disabling them when they are not needed may improve the performance.

If native MX compatibility is not required on the wire, you might want to avoid --enable-mx-wire on the configure command line so that larger packets are used for large messages. See Native MX Compatibility for details about wire compatibility, and MTU support.

If the machines on the network all use the same endian-ness, you might want to pass --disable-endian to the configure command line so that Open-MX does not swap header bits into/from network order. It may reduce the latency very slightly.

How should I tune Open-MX MTU and packet sizes?

Open-MX performance increases with packet sizes, so a large minimum MTU is recommended. For this reason, MX-wire-compat should not enabled unless needed (see also What is the MX wire-compatibility impact on Open-MX performance?). Similarly, if running on regular Ethernet fabrics, MTU 9000 should be preferred to 1500.

Open-MX uses packets as large as possible to fully benefit from large MTU. If for some reason (for instance hardware-related preferences) some larger packets decrease performance, it is possible to reduce their size by configuring Open-MX with --with-medium-frag-length (for medium message fragments) and --with-pull-reply-length (for large message frames). But, in most cases, passing --with-mtu according to the NIC and swiches configuration should be enough. The corresponding values may be check in the driver status as explained in Which MTU should my network support for Open-MX?

Is there a registration cache in Open-MX?

Achieving optimal performance requires to avoid memory copies as much as possible. This is done using memory registration, which pins buffers in physical memory. Since this operation is expensive, it is interesting to do only once per buffer when the buffer is used multiple times. To do so, you should set the OMX_RCACHE environment variable to 1.

$ export OMX_RCACHE=1

However, this configuration may be dangerous if the application frees the buffer in the meantime. Since Open-MX has no way to detect this for now, this registration cache should be used with caution.

OpenMPI forces the registration cache to enabled by default because it is able to detect and support such dangerous events. If for some reason, you need to force the disabling of the Open-MX registration cache anyway, you may set OMX_RCACHE to 0 in the environment, or pass --mca mpi_leave_pinned 0 to the OpenMPI process launcher.

What if I do not need shared or self communications?

Open-MX may use a software loopback to send messages from one endpoint to itself (self communications) or to another endpoint of any interface of the same host (shared communications). If these shared/self communication are useless, the library overhead may be slightly reduced by disabling them at runtime by setting OMX_DISABLE_SELF=1 or OMX_DISABLE_SHARED=1 in the environment. Note that some MPI layers such as OpenMPI already set these environment variables by default.

This is especially the case if there is a single process on each node and it does not talk to itself, or if multiple processes of the same do not talk to each other.

What is the interrupt coalescing impact on Open-MX' performance?

Most Ethernet drivers use interrupt coalescing to avoid interrupting the host once per incoming packet. While this may be good for the throughput, it increase the latency a lot, up to several dozens of microseconds.

To get the best latency for Open-MX, interrupt coalescing should be reduced. The easiest way to do so is to disable it completely.

$ ethtool -C eth2 rx-usecs 0

Disabling coalescing entirely may be bad for throughput since it increases the host per-packet overhead. However, a very high coalescing delay (several tens of microseconds) is mostly useful for the throughput of unidirectional streams, which is not the case of Open-MX. In most cases, a small but non-null delay (five to twenty microseconds) should be a good idea to get satisfying Open-MX throughput and latency.

A good compromise is to set the delay close to the best latency so that the observed latency is almost optimal while there is still a bit of coalescing for consecutive packets. So, assuming that you observe a N usecs latency with Open-MX when interrupt coalescing is disabled, a nice configuration would to set coalescing to N or N-1 usecs:

$ ethtool -C eth2 rx-usecs <N-1>

Is process and interrupt binding important for Open-MX?

Yes. The Open-MX receive stack is composed of a kernel routine running in the bottom half on any of the machine cores, depending on where the NIC is sending its IRQs. Device drivers usually configure IRQs to be sent to all cores in a round-robin fashion. This behavior distributes the receive workload on all cores, which is good for the vast majority of MPI jobs where each core runs exactly one process.

If you plan to have less processes than cores, you might experience some performance degradation caused by idle cores going to sleep and thus taking more time to process incoming IRQs. A dirty way to work around this problem is to prevent core from sleeping by booting the kernel with the idle=poll parameter. Or you may restrict the IRQs coming from the NIC to the subset of cores that run the Open-MX processes. For instance, if your processes are bound to core #0-1, the IRQ affinity bitmask should be set to 3 (see How do I find out and change the binding of an interrupt line?).

Under extreme circumstances, for instance for benchmarking purpose, you may want to use a single process per machine and bind it to a different core from the one receiving IRQs. This way, they will not fight for CPU time. However, since cache line sharing is critical, the binding should be done on the very next core so that cache effect cost is very small. For instance, binding IRQs on core #1 and the process on core #0:

$ echo 2 > /proc/irq/<irq>/smp_affinity
$ numactl --physcpubind 0 myprocess

Another way to bind process is to use the OMX_PROCESS_BINDING environment variable (see What are Open-MX runtime configuration options? ).

Such a configuration may be the best for benchmarking purpose, especially on the latency side. However, under a normal load, having IRQs go to all cores is probably a good idea since most applications will use one process per core. See also How may multiple receive queues help Open-MX?

Note that the core numbering is far from being linear in modern machines. It is likely that cores numbered as #0 and #1 by the software are actually not close to each other in the actual hardware. The numbering is often a round-robin across physical processors to maximize memory bandwidth or so.

Should I avoid some kernels and drivers?

Some old kernels (<2.6.18) have problems with some drivers that receive data in frags (non-linear skbuff). As a workaround, they will linearize these skbuffs unless their target protocol stack explicitly supports non-linear skbuff. This basically adds a memory copy for all packets except IPv4 and IPv6, which would decrease Open-MX performance.

To avoid this, if IPv6 is not in use on the network, you might want to tell Open-MX to use the IPv6 Ethernet type. This way, its skbuffs will not be linearized uselessly. To enable this workardound, you should pass --with-ethertype=0x86DD to the configure command line.

Note that this solution is only required under very special circumstances and should be avoided in most of the cases.

Hardware-Specific Features

Which hardware features may help Open-MX?

Open-MX works on generic hardware but may be enhanced if some specific hardware features are available. Interrupt coalescing, including adaptive coalescing, may help dealing with host interrupt load and latency. See How to use adaptive interrupt coalescing?. Hardware copy offload may significantly reduce the receive copy overhead. See How does I/OAT copy offload help Open-MX?. Multiqueue support may also improve the cache-friendliness of the receive stack. See How may multiple receive queues help Open-MX?.

How to use adaptive interrupt coalescing?

If your driver supports Adaptive interrupt coalescing, it may well help Open-MX performance. It basically automatically disables coalescing (and thus improves latency) when the amount of packets is low, and reenables a high coalescing delay (and thus improve the overall performance) when the amount of packets is high (see also What is the interrupt coalescing impact on Open-MX' performance?). Thus, when it is supported, you probably want to try enabling adaptive interrupt coalescing on the receive side:

$ ethtool -C eth2 adaptive-rx on

Then, if you do not observe optimal performance yet, you may want to tune adaptive coalescing so that for instance a pingpong-like pattern gets the best latency. Since a 6-microseconds pingpong generates 83 thousands of packets per second, you may for instance tell the driver to disable coalescing entirely when less than 150 thousands packets are received per seconds:

$ ethtool -C eth2 pkt-rate-low 150000
$ ethtool -C eth2 rx-usecs-low 0

Note that some NICs and drivers are slow at adapting the coalescing delay according to traffic pattern changes. In this case, adaptive coalescing may disturb performance by reacting too slowly. A careful review of the applications' behavior and of the performance improvement that it actually brings is thus necessary before enabling adaptive coalescing by default.

How does I/OAT copy offload help Open-MX?

Lots of modern platforms such as Intel I/OAT-enabled servers provide hardware DMA engine to offload memory copies. Open-MX performance may increase very significantly thanks to this feature.

The support for dmaengine is automatically built in Open-MX when supported by the kernel and may be configured at runtime through several module parameters. See What are Open-MX startup-time configuration options? for details.

Note that DMA engine hardware may still require the administrator to load the corresponding driver, for instance the 'ioatdma' kernel module. The kernel logs will display the DMA engine status when loading Open-MX or modifying some module parameters.

How may multiple receive queues help Open-MX?

Many modern hardware have the ability to associate one receive queue to each IP connection thanks to packet filtering in the NIC and multiple receive queues. If the NIC supports multiqueues with knowledge of the Open-MX protocol, the Open-MX' performance may increase due to the receive stack becoming more cache-friendly.

The usual optimization consist in having the bottom-half taking care of an endpoint always runs on the same CPU. See How do I add Open-MX multiqueue support to my NIC firmware?. The other optimization is to make sure that this bottom half runs on the same CPU than the process that opened this endpoint. See How do I bind my processes near Open-MX receive multiqueues?.

How do I add Open-MX multiqueue support to my NIC firmware?

As of today, only the myri10ge firmware for Myri-10G boards is known to have built-in Open-MX-aware multiqueue support. It is included in myri10ge firmware since version 1.4.33.

First, you need to make sure your hardware supports multiple Rx queues (receive queues). Then you need to get your firmware sources and be able to rebuild/reflash it. What you need to change is the code that choose a Rx queue depending on the packet contents. Here's a Open-MX packet description:

Bytes 1-6 and 7-12: source and destination mac address.
Bytes 13-14: Ethernet type. You have to match 0x86DF for Open-MX there.
Bytes 17: Open-MX type.
Byte 18: Destination endpoint number, except if the Open-MX type is 0x2a (and some control packets we do not care about).
Byte 32: Destination endpoint number, only if the Open-MX type is 0x2a.

So you need to get the endpoint number and just hash it so that all packets from the same endpoint go to the same Rx queue. This ensures there will be no cache effects between bottom halves on different CPUs.

How do I bind my processes near Open-MX receive multiqueues?

This section only matters if the NIC has multiqueue support and if this multiqueue support knows how to filter Open-MX packets.

As explained in What are Open-MX runtime configuration options?, the OMX_PROCESS_BINDING environment variable may be used to bind Open-MX processes depending on their endpoint number. If your NIC is capable of filtering Open-MX packets into multiple queues, you may setup this variable manually. You need to know which interrupt line is used for each endpoint number and where these interrupts are sent (see also How do I find out which interrupt line my interfaces use? and How do I find out and change the binding of an interrupt line? for more information).

But an easier solution consists in having Open-MX gather binding information automatically. If the OMX_PROCESS_BINDING variable is set to file, binding hints will be read from /tmp/open-mx.bindings.dat. To generate this file (once Open-MX is loaded and the interface(s) are attached), run the omx_prepare_binding tool (as root). Again, this is only useful if your NIC multiqueue support knows how to filter Open-MX packets.

$ sudo omx_prepare_binding
Generated bindings in /tmp/open-mx.bindings.dat
$ cat /tmp/open-mx.bindings.dat
board 00:60:dd:47:c4:75 ep 0 irq 1269 mask 00000001
board 00:60:dd:47:c4:75 ep 1 irq 1268 mask 00000002
board 00:60:dd:47:c4:75 ep 2 irq 1267 mask 00000004
board 00:60:dd:47:c4:75 ep 3 irq 1266 mask 00000008
[...]

omx_prepare_binding scans /proc/interrupts so as to find out which interrupt lines are used for each attached interface. It finds out the corresponding interrupt line numbers and driver queue numbers. It assumes that the driver registered these interrupts in a standard way, as requested by the Linux network stack maintainer here. For instance, if the interface is eth2, the driver should register its queue names as eth2...N... where N is the queue number. If there are different queues for sending and receiving, omx_prepare_binding may ignore sending queues if their name contains tx.

There are many endpoints (usually 32 per interface) and only a limited number of Rx queues (usually one per core). Several endpoints may thus actually be bound to the same queue. If the queue numbers are standard (contigous set of integers from 0 to N-1), omx_prepare_binding assumes that the NIC will actually compute the queue number by applying a modulo to the endpoint number. Otherwise, omx_prepare_binding only assumes that each queue is associated to a single endpoint whose number is the same.

omx_prepare_binding may apply hardware quirks if the NIC uses a complex way to associate endpoint numbers with queue numbers, or if the driver does not use standard interrupt line names. Please report such cases to the open-mx mailing list so that a special case is added.

How do I find out which interrupt line my interfaces use?

You may need to know which interrupt line number is used by a NIC in order to be able to bind processes near the cores that process incoming interrupts and packets. To do so, look in the /proc/interrupts file and search for the line that corresponds to your NIC. The first column contains interrupt line numbers. It is followed by per-processor counters and some information.

           CPU0       CPU1       CPU2       CPU3
 33:       6644       6519       6762       6623   PCI-MSI-edge      eth0
 36:     894478    1023337     889191    1012491   PCI-MSI-edge      iwlagn

The last column usually contains either the interface name (eth0 above, which has interrupt line 33). Sometimes, it contains the corresponding kernel driver name (iwlagn above, which has interrupt line 36) or something derived from it. The Open-MX driver displays the driver name of each interface during attach, so it may help finding the irq line as well.

[53318.466750] Open-MX: Attaching Ethernet interface 'wlan0' as #1, MTU=1500
[53318.466756] Open-MX:   Interface 'wlan0' is PCI device '0000:02:00.0' managed by driver 'iwlagn'

Finally, since the /proc/interrupts file contains interrupt counters, the interrupt line may be found by comparing counters before and after the run of an Open-MX application.

How do I find out and change the binding of an interrupt line?

First, to find out an interrupt line number, see How do I find out which interrupt line my interfaces use?

The binding (or affinity) of an interrupt line (where these interrupts will be processed) may be read from the file /proc/irq/<number>/smp_affinity. It contains a hexadecimal bitmask listing the indexes of processors that will receive interrupts (usually in a round-robin manner).

  # cat /proc/irq/57/smp_affinity
  ffffffff

To force interrupt on the 11th physical processor, write a bitmask where only bit #10 is set:

  # echo 400 > /proc/irq/57/smp_affinity
  # cat /proc/irq/57/smp_affinity
  00000400

Make sure you always read the file again to check that your request was accepted. Indeed, some hardware and/or kernel constraints may have to be applied in the meantime. Also, be aware that some kernel require careful formatting of the bitmask, with the right numbers of digits and commas (which depends on the kernel configuration).

  # cat /proc/irq/4334/smp_affinity
  00000000,ffffffff,ffffffff,ffffffff
  # echo 00000000,00000003,00000000,00000000 > /proc/irq/4334/smp_affinity
  # cat /proc/irq/4334/smp_affinity
  00000000,00000003,00000000,00000000

Once the binding is in place, you should see in /proc/interrupts that only the selected processors have increasing counters on the interrupt line. See also How do I find out which interrupt line my interfaces use? for more information about this file.

Native MX Compatibility

What is MX-wire-compatibility?

If you need some Open-MX hosts to talk to some MX hosts, you should enable wire-compatibility (by passing --enable-mx-wire to the configure script). If you only have Open-MX hosts talking on the network, you should keep it disabled to improve performance (see Performance Tuning).

Once Open-MX is configured in wire compatible mode, you need to make sure that the nodes running in native MX mode are using a recent MX stack (at least 1.2.5 is recommended) configured in Ethernet mode. Once peer table are setup on both MX and Open-MX nodes, the fabric is ready.

How to use MX-wire-compatibility with a dynamic peer discovery tool?

You have to make sure that the same peer discovery program (or "mapper") is used on both sides. By default, MX uses the FMA by default. So the FMA source should be unpacked as a "fma" subdirectory of the Open-MX source so that the configure script will enable FMA by default instead of omxoed for dynamic peer discovery.

Note that all FMA versions are not wire-compatible, even if the underlying MX and/or Open-MX stacks are compatible. See Which FMA version should I use? for details.

Under some circumstances, MX may also rely on mxoed, which is compatible with Open-MX' omxoed.

How to use MX-wire-compatibility with a static peer table?

The peer table should be setup on the Open-MX nodes as usual with omx_init_peers, with a single entry for each Open-MX peer and each MX peer.

On the MX nodes, each Open-MX peer with name "myhostname:0" and mac address 00:11:22:33:44:55 should be added with:

$ mx_init_ether_peer 00:11:22:33:44:55 00:00:00:00:00:00 myhostname:0

Note that MX 1.2.5 is required for mx_init_ether_peer to be available.

Also note that it is possible to let the regular MX dynamic discovery map the MX-only fabric and then manually add the Open-MX peers. To do so, the regular discovery should first be stopped with:

$ /etc/init.d/mx stop-mapper

What MX-API and -ABI compatibility does Open-MX provide?

The Open-MX API is slightly different from that of MX, but Open-MX provides a compatibility layer which enables:

Linking of applications that were compiled against MX
Building of applications that were written for the MX API

This compatibility is enabled by default and has a very low overhead since it only involves going across basic conversion routines.

Additionally, Open-MX also understands the MX-specific environment variables that matter here (unless OMX_IGNORE_MX_ENV=1 is set).

When can I disable the MX-API or -ABI compatibility?

If you do not plan to use any applications that has been written for MX, it is possible to disable the API and ABI compatibility alltogether by passing --disable-mx-abi to the configure script.

Which MX version is Open-MX compatible with?

Open-MX provides the binary interface of MX 1.2.x, which is also backward compatible with any application built on an older MX (up to 0.9). So if you built your application on top of MX (unless it was 10 years ago), it will work fine with Open-MX.

When passing --enable-mx-wire to the configure script, Open-MX is wire compatible with MX 1.2.x. It means that a host running the native MX stack 1.1 or earlier will not be able to talk with an Open-MX host.

Also, since all MX versions do not bring the same FMA version, if you want to use FMA as a peer discovery tool, you might want to look at Which FMA version should I use?.

Advanced Configuration

What are Open-MX build-time configuration options?

The following options may be passed to the configure command line before building:

--enable-multilib: Build both 32bit and 64bit library instead of only the compiler's default one. See also May I build a 32bit library? or a 64bit? or both?.
--enable-debug: Only build a debugging library. Both a debug and a non-debug are built by default, and the non-debug one is used to link all tests/tools programs.
--disable-debug: Only build a non-debugging library.
--disable-endian: Disable variable endian architectures support on the wire. Endian-ness independent wire protocol is enabled by default.
--disable-threads: Disable thread safety in the user-space library. See also Is Open-MX thread-safe?.
--disable-internal-malloc: Disable the internal malloc implementation in the user-space library. Open-MX has to use its own malloc implementation to prevent Open MPI from intercepting it (for regcache purpose) and causing some possible deadlocks. Although this internal implementation should not cause any problem or overhead, it is possible to disable it and revert to the default malloc implementation. However, this should only be considered when the user is sure that nobody is going to intercept malloc in the process (as many MPI layers do).
--disable-valgrind: Disable Valgrind hooks in the debugging library. By default, Valgrind hooks are enabled in the debugging library. They help Valgrind understanding what is going on in the Open-MX library.
--disable-mx-abi: Do not support binary (and API) compatibility with MX. Do not build MX symbols inside the Open-MX library and do not export MX API headers in the installation directory.
--enable-mx-wire: Do not optimize the wire-protocol, maintain wire compatibility with Myrinet Express over Ethernet instead.
--with-mtu=1500: Enable support for 1500-bytes MTU fabric. This may reduce large message throughput.
--with-medium-frags=32: Enforce the maximum number of fragments per medium messages. This consumes a bit more memory but lets you switch later to the rendezvous strategy (using OMX_RNDV_THRESHOLD and OMX_SHARED_RNDV_THRESHOLD environment variables). By default, medium messages may not be larger than 256kB if MTU is 9000 (32 8kB-fragments) and 45kB if MTU is 1500, and the rendezvous strategy usually gets enabled after 32kB. This option lets you increase the maximal number of fragments up to 255 per medium messages, which means almost 2MB per message if MTU is 9000.
--with-medium-frag-length=8192
--with-pull-reply-length=8968: Enforce the maximum size of medium message fragments instead of relying on the MTU and wire-compatibility configuration.
--with-pull-block-replies=32: Enforce the maximum number of pull reply packets to be sent per pull block request instead of relying on wire-compatibility configuration.
--with-shared-ring-entries=1024: Change the number of entries per shared ring. By default, each endpoint uses 1024-entry rings that are shared between user-space and the kernel for sending and receiving. Reducing the number of entries reduces the overall amount of vmalloc'ed memory. See also What if endpoint opening fails with "No resources available in the system"?.
--disable-fma: Enforce disabling of FMA peer discovery even when MX wire compatibility is enabled. By default, if wire compatibility is enabled, FMA should be used. See also How do I decide between omxoed, FMA and static peer table?.
--enable-static-peers: Use a static peer table instead of dynamic peer discovery
--with-peers-file=<file>: Use <file> as a static peer table instead of the default /etc/open-mx/peers.
--with-linux-release=2.6.x-y
--with-linux=/path/to/kernel/headers
--with-linux-build=/path/to/kernel/build: Enforce the target kernel release, header directory and build directory instead of retrieving them from uname -r. Note that --with-linux-release also changes the linux and linux-build directories, and that --with-linux also changes the linux-build directory.

What are Open-MX startup-time configuration options?

The following module parameters may be passed to the driver module when loading, either as a parameter to the modprobe command, or through the OMX_MODULE_PARAMS variable for the omx_init or /etc/init.d/open-mx startup script. Some of them may also be modified later by writing into /sys/module/open_mx/parameters/<parameter>.

ifnames="eth2 eth3": Attach interfaces eth2 and eth3 at startup instead of all interfaces. See Managing Interfaces for details.
ifaces=32: Allow a maximum of 32 interfaces to be attached at the same time. Default is 32.
endpoints=32: Allow a maximum of 32 endpoints to be open by interfaces. Default is 32.
peers=1024: Allow a maximum of 1024 peers to be connected on the network. Default is 1024.
demandpin=1: Defer memory pinning of large region until really needed to enable overlap of pinning with communication (only shared-memory for now). Default is 0 (disabled).
dmaengine=1: Enable DMA engine to offload memory copies, when supported in hardware and in the kernel. Modifying this value will display the DMA engine status in the kernel logs. Default is 0 (disabled).
dmaasyncmin=65536: Offload asynchronous copy on DMA engine hardware only if the whole message length is above this threshold. This is used for large message receive. Even if fragments are large, offloading their copy does not make much sense if there are very few of them. Default is 64 kbytes.
dmaasyncfragmin=1024: Offload asynchronous copy on DMA engine hardware only if the current fragment length is above this threshold. This is used for large message receive. Even if the whole message is big, offloading very small fragment copy does not make much sense if submitting the copy offload request is slower than copying directly. Default is 1024 bytes.
dmasyncmin=2097152: Offload synchronous copy on DMA engine hardware only if the length is above this threshold. This is used for medium message receive, and shared memory communication. Offloading small synchronous copies is not faster than a regular copy when the data is smaller than the cache. Default is 2 Mbytes.
skbfrags=16: Allow a maximum of 16 frags to be attached to socket buffer on the send side. If the underlying driver does not support frags, 0 should be used. The default and maximal value is MAX_SKB_FRAGS (16 on common archs).
skbcopy=0: Copy buffers small buffers into a linear skb instead of attaching pages. If the underlying driver is slow sending frags, increasing this parameter to copy small frags into linear skb may be faster than using frags as usual. Default is 0 (never copy, always attach).

When starting Open-MX with the omx_init script (or /etc/init.d/open-mx if installed by omx_local_install), it is also possible to tune its startup by modifying the /etc/open-mx/open-mx.conf configuration file with the following variables. It is also possible to overwrite these variables by passing them in the environment when running the startup script.

OMX_IFACES="all": Defines which interface to acquire for Open-MX at startup. "all" attaches all available interfaces (default). "eth1,eth3" attaches eth1 and eth3. " " attaches none of them. See also Which interfaces are attached are startup?.
OMX_MODULE_PARAMS=: Pass some parameter to Open-MX kernel module on load.
OMX_MODULE_DEPENDS=: Define some kernel module dependencies (useful if modinfo is missing).
OMX_FMA_PARAMS=: Pass additional FMA command-line parameters (-D for debug, ...).
OMX_FMA_START_TIMEOUT=5: Define the additional FMA startup timeout in seconds (5 by default).

What are Open-MX runtime configuration options?

The following environment variables may be used to change the library behavior at runtime, when starting a process:

OMX_RCACHE=1: Enable registration cache. The registration cache is disabled by default.
OMX_PRCACHE=1: Enable parallel registration cache, which caches large windows more aggressively than MX can, by supporting multiple large receive and possibly one large send on the same window at the same time. Parallel registration cache is disabled by default.
OMX_DISABLE_SELF=1: Disable software loopback between an endpoint and itself. Self software loopback is enabled by default.
OMX_DISABLE_SHARED=1: Disable software loopback between endpoints of the same node. Shared software loopback is enabled by default.
OMX_RNDV_THRESHOLD=32768: Set the rendezvous threshold for native inter-node communication. Native inter-node networking switches from eager to rendezvous at 32kB (while shared intra-node communication switches at 4kB by default). This threshold may be modified with this environment variable with the following restrictions: It cannot be lower than 128 bytes (minimal medium message length). The threshold may be increased above 32kB, up to the maximal number of fragments per medium messages (32 by default unless --with-medium-frags was used at configured time) multiplied by the maximal medium fragment length (8kB if MTU is 9000).
OMX_SHARED_RNDV_THRESHOLD=4096: Set the rendezvous threshold for shared intra-node communication. Native inter-node networking switches from eager to rendezvous at 32kB while shared intra-node communication switches at 4kB by default. Acceptable values for this threshold are the same as those of OMX_RNDV_THRESHOLD.
OMX_PROCESS_BINDING=2,0,3,4,1,5,7,6: Defines where each process has to be bound when it opens an endpoint. By default, no binding is done. If a comma-separated binding is given, the n-th value defines the processor where the process opening endpoint n will be bound. If all:x is given, then processor x will be used whatever the endpoint index is. If file is given in the environment variable, bindings will be read from /tmp/open-mx.bindings.dat. If file:<filename>, they will be read from the specified filename. For more details about process binding, see Is process and interrupt binding important for Open-MX?. See also How do I bind my processes near Open-MX receive multiqueues?.
OMX_CTXIDS=3,7: Enable context-ids splitting of the matching space to reduce matching time. Two comma-separated numeric values have to be given. The first one is the number of multiplexing bits, the second one is their offset in the 64bits match space. Note that enabling context-ids requires the application to satisfy some contraints such as not using wildcards in the multiplexed bits when posting receive.
OMX_ANY_ENDPOINT=n: Force a specific endpoint index to be used when OMX_ANY_ENDPOINT is given to omx_open_endpoint().
OMX_MEDIUM_SENDQ=1: Use the send queue or not for sending medium messages. If using the send queue (default), data is copied in a static buffer by the user-space library and the driver will attach the corresponding static pages to outgoing socket buffers. If this strategy is slow on your NIC, for instance because it does not like fragmented DMA on the send side, you may want to try setting this variable to 0. It will force the library and driver to use a linear socket buffer where the data is directly copied in.
OMX_WAITSPIN=1: Busy loop instead of sleeping in blocking functions. Blocking functions sleep by default.
OMX_WAITINTR=1: Let sleeping functions be interruptible by signals. Blocking functions go back to sleep on signal by default.
OMX_CONNECT_POLLALL=1: When blocking in mx_connect, poll other endpoints as well. When opening multiple endpoints per process, this may work around some deadlocks that may occur if endpoints are connecting in random order.
OMX_RESENDS_MAX=1000: Try to resend each send request 1000 times before timeout-ing. By default, each request is resent up to 1000 times before timeout-ing.
OMX_NOTACKED_MAX=4: Allow a maximum of 4 messages not acked per partner. When passing this threshold, an explicit ack is sent immediatly if needed. This is equivalent to MX_IMM_ACK and may be used to enforce immediate acking of all incoming messages, see What if a message fails because an endpoint is unreachable?
OMX_ZOMBIE_SEND=512: Tolerate the completion of 512 sends before their actual ack. At most 512 zombies are completed before being acked by default.
OMX_FATAL_ERRORS=0: Disable fatal errors. Instead of having the Open-MX fail as soon as a request or function gets an error, let the error be reported to the application.
OMX_ABORT_SLEEPS=0: Sleep before actually aborting on fatal errors. If set to non 0, the Open-MX library will may as many seconds and print the process pid before actually aborting.
OMX_DEBUG_REQUESTS=1: Enable checking request queues. Everytime the progression loop runs, check that the amount of allocated requests is equal to the amount of currently queued requests. If set to 2, some debugging messages about the number of requests will be displayed. If set to 3, more details about each queue will be added. This feature is disabled by default since it may be time consuming for request-intensive applications.
OMX_DEBUG_CHECKSUM=1: Enable end-to-end checksumming of messages. Compute the checksum of the send buffer and compare it with the checksum of the final receiver buffer. If the message was truncated because the receive buffer was too small, the check is ignored. This feature is disabled by default since it usually slows down communication a lot.
OMX_DEBUG_SIGNAL=1: Enable dumping of the library state when receiving a signal. This feature is only enabled by default in the debug library. It may be disabled all the time by setting the variable to 0. Setting the variable to a positive number else will enable the dumping even if the non-debug library. If bigger than 1, the dumping will be more detailed.
OMX_DEBUG_SIGNAL_NUM=<SIGUSR1>: Change the signal to be use to dump the library state. By default, SIGUSR1 is used. The given value has to be numeric.
OMX_VERBOSE=1: Display verbose messages. No verbose messages are displayed by default, except in the debugging library.
OMX_VERBDEBUG=<mask>: Display verbose debugging messages in the debugging library. No verbose debugging messages are displayed by default (mask=0).
OMX_VERBOSE_PREFIX=1: Display information about the process in the prefix of all Open-MX messages. The default prefix is OMX:. If 1 is given, the prefix becomes OMX:%H:%p. Otherwise, any string may be given using special variables %p, %e, %b, %B and %H. %e will be replaced by the endpoint number, %p is the process id, %b is the board number, %B is the Open-MX board hostname, and %H is the machine hostname. If no endpoint is involved, X is used as endpoint or board identifiers. Moreover, if %H or %B is followed by [f-t], the hostname or board hostname is truncated from the character index f to the character index t.
OMX_IGNORE_MX_ENV=1: Ignore the environment variables of the native MX stack. By default, several MX-specific variables such as MX_RCACHE or MX_DISABLE_SELF are translated into Open-MX-specific environment variables. See What MX-API and -ABI compatibility does Open-MX provide?.

What should I know before I build/link my middleware with Open-MX?

If you plan to use Open-MX within a middleware such as a MPI layer, you should read the following configuration advices:

MX ABI/API compatibility: Passing --disable-mx to the Open-MX configure line is only possible if all middleware involved use the native Open-MX API. In most cases, keeping the MX ABI/API compatibility enabled should cause no harm and a very small overhead. It thus is recommended. Once Open-MX is installed, passing its installation path to the middleware configuration system as the MX installation path should do the trick.
Thread-safety: A thread-safe middleware should generally rely on a thread safe Open-MX. Building Open-MX with --disable-threads may only work if caller uses neither any blocking Open-MX functions nor the unexpected handler, and obviously serializes Open-MX calls.

Debugging

What debugging features does Open-MX offer?

Open-MX provides several debugging features such as verbose messages, additional checks, non-optimized building, valgrind hooks, ... For performance reasons, they are not enabled by default.

By default, Open-MX will build a non-debug library and an optional debug library. The former is installed in $prefix/lib while the latter goes in $prefix/lib/debug. The driver is built without debug by default.

If you think you found a bug, see What if I find a bug?.

How-to enable debugging features by default?

Passing --disable-debug to the configure command line will only disable the build of the debug library. Passing --enable-debug will make only the debug library be built and installed in $prefix/lib as usual, and the driver will be debug enabled.

The build flags may be configured by passing CFLAGS on the configure command line. Additional flags may be passed for the debugging library build with DBGCFLAGS.

How to debug an abort message?

Open-MX may abort the application under many circumstances. If you wish to attach a gdb to debug the process before it actually aborts, you may pass OMX_ABORT_SLEEPS=30 in the environment so that the actual abort is deferred by 30 seconds. The pid of the process will be displayed in the meantime. See also What happens on error?.

Does Open-MX provide statistics regarding the network traffic?

Yes. Open-MX maintains per-interface statistics at the driver level (even if debugging is disabled). They may be observed with

$ omx_counters

You may pass the -b option to select a single interface. Only the non-null counters at displayed, unless -v is given. These counters may also be cleared with -c.

Open-MX also maintains statistics regarding local communication (shared-memory). They may be observed with

$ omx_counters -s

How may I see the status of all requests in the Open-MX library?

When the SIGUSR1 signal is sent to an Open-MX program, the library will dump its status on the standard output, including all known peers and pending requests.

This feature is enabled by default in the debug library only. It may be enabled at runtime by setting the OMX_DEBUG_SIGNAL environment variable to 1 or more (more means more status details will be displayed). This feature may also be disabled in the debug library if the variable is set to 0. If a numeric value is given in the OMX_DEBUG_SIGNAL_NUM environment variable, it will replace the default signal number (SIGUSR1).

How can I see/check the driver configuration?

The driver configuration depends on many static/dynamic configuration parameters (See Advanced Configuration). To dump this configuration, you may read from the device file:

$ cat /dev/open-mx
Open-MX 0.9.2
 Driver ABI=0x151
 Configured for 32 endpoints on 32 interfaces with 1024 peers
[...]

This output may also be reported by the startup script:

$ omx_init status

What does "Failed to create user region" mean?

 Open-MX: FatalError: Failed to create user region 4, driver replied Bad address
 omx_misc.c:86: omx__ioctl_errno_to_return_checked: Assertion `0' failed.

This fatal error means that the application passed an invalid buffer to Open-MX. So the Open-MX driver failed to pin the buffer in physical memory when starting a large message.

It is very similar to a segmentation fault (an actual access to the buffer would have caused a fault). The application needs to be fixed, and returning an error would not help much, so Open-MX just aborts.

What if endpoint opening fails with "No resources available in the system"?

 mx_open_endpoint failed: No resources available in the system
 vmap allocation for size 16912384 failed: use vmalloc= to increase size.

Open-MX requires a large amount of vmalloc memory (about 16MB per endpoint by default) for maintaining shared rings between user-space and the kernel for sending and receiving. This requirement might be problematic on 32bits machine with very small physical memory.

If performance is not important, passing --with-mtu=1500 will reduce memory requirements (thanks to small ring entries). Otherwise, passing --with-shared-ring-entries=128 will reduce the number of slots per ring by 8 and thus significantly reduces vmalloc needs, but it may also slightly hurt performance under high packet rate. The last solution consists in increasing the pool of vmalloc'able memory in the kernel thanks to the vmalloc parameter on the kernel boot command line as shown in the above message.

What if a message fails because an endpoint is unreachable?

 Open-MX: Send request (seqnum 713 sesnum 0) timeout, already sent 41 times, resetting partner status
 Open-MX: Cleaning partner 00:00:00:00:00:00 endpoint 0
 Open-MX: Completing send request: Remote Endpoint Unreachable

Open-MX completes send requests when they are acknowledged by the receiver. If a message is dropped by the fabric or the receive stack from some reason, it will be resent periodically by the sender (every half second by default). Once too many resends were tried (1000 by default), the sender will consider that the target endpoint is unreachable. All pending messages for this target are thus aborted with an erroneous status.

The main reason for getting such an error is a node failure. If the whole machine and system fails, packets cannot be received at all. Note that a process crashing does not cause this problem since its system would still be alive and would thus report Endpoint Closed instead.

Another reason for considering an endpoint as unreachable is when the target process failed to ack in time. It could mean that the machine is severely overloaded and the system did not provide the process with enough CPU time. It could also mean that the application did not poll Open-MX often enough and thus prevented the ack from being sent. A workaround for this problem is to pass the OMX_NOTACKED_MAX=1 environment variable so that acks are sent as soon as possible (see also What are Open-MX runtime configuration options?).

Miscellaneous

How do I run Platform MPI or HP-MPI over Open-MX?

Platform MPI (formerly known as HP-MPI) was successfully used on top of Open-MX thanks to the following changes in the HP-MPI configuration file:

--- /opt/hpmpi/etc/hpmpi.conf.orig    2009-08-06 11:17:59.000000000 +0200
+++ /opt/hpmpi/etc/hpmpi.conf    2009-08-06 11:18:08.000000000 +0200
@@ -105,11 +105,11 @@
 
 # the expected way to get MX
 MPI_ICLIB_MX__MX_MAIN = libmyriexpress.so
-MPI_ICMOD_MX__MX_MAIN = "^mx_driver "
+MPI_ICMOD_MX__MX_MAIN = "^open_mx "
 
 # full path to mx in case ld.so.conf isn't set up
-MPI_ICLIB_MX__MX_PATH = /opt/mx/lib/libmyriexpress.so
-MPI_ICMOD_MX__MX_PATH = "^mx_driver "
+MPI_ICLIB_MX__MX_PATH = /opt/open-mx/lib/libmyriexpress.so
+MPI_ICMOD_MX__MX_PATH = "^open_mx "
 
 # -------- GM ------------------------

If you do not find your answer here, feel free to contact the open-mx mailing list.