Building the future Internet

So we have now spawned Dwellir, a company founded by a team of dreamers, technowizards, cryptocoiners and myself.

We join ranks with many others, building the future of Internet. We ride the tsunami of blockchain backed enterprises and our vision aligns with the idea of a decentralized internet, transparent, free, privacy focused, adventurous and fun.

Some details on Dwellir

The legal entity Dwellir is based in Sweden, known for a robust social and legal system. Sweden has a lagging legal framework for blockchain businesses, but we can deal with all this – as long as we don’t have to manage corruption, dictators, political revolutions and maffia.

Our markets, partners and businesses are global, so Dwellir is already operating as such. We are geared and experienced to handle this perfectly.

Dwellir spawned organically from the founders long term commitments to a free and decentralized internet, parties, hacks, adventures and laughs. It became real from grasping the opportunity to merge a few interesting projects related to the development of Internet.

We operate now multiple sites with the absolute core of functionalities needed of a blockchain infrastructure provider. That is a lot of security, availability features, performance, monitoring, development pipelines and optimizations to our operations. Its those boring things which are needed to form a foundation for future projects that needs to have all this in place.

Worth mentioning is that Dwellir doesn’t run core systems in public clouds. The massive issues with privacy and integrity associated with public clouds rhymes bad with our mission. It’s also what our community, partners and clients expects from us and which we intend to honor. We use public clouds for tests and zero exposure development where it fits well.

Dwellir is 100% financed by the original founders. That gives us a lot of freedom to build smart.

Initial days

The first steps with Dwellir was taken long ago. We spent a long time raising infrastructure, experimenting with technology around blockchains and worked up datacenters located in Sweden. Some of them are now used as “contingency plans” to secure operations long term. Later locations were intentionally established with robustness, security and privacy in mind.

Dwellir is in the business of trust so we don’t take shortcuts.

One of Dwellir’s colocation datacenters is located in Pionen. A former nuclear shelter which is not only as cool as it sounds is also known for its very privacy oriented operator. This is exactly in line with our ambitions with Dwellir.

We managed to secure a place on the Kusama 1KV network recently. It was an important preparation for the Polkadot validator program which we now prepare for. We are ready for that step and excited to get up and running.

All this hard work and preparation allows us to rely entirely on our own capabilities to support the networks going forward.

Next step

Our next up projects are directed towards a few different networks with focus on stability for our business and operation. Polkadot, Kusama and Ethereum. While there are many other interesting projects – these above mentioned are all very attractive from a Dwellir mission perspective.

  • Share our vision to decentralize the future.
  • Provide a stable enough technology baseline to continue long term survival.
  • Have an interesting enough financial upside for sustainable operations.
  • Have very strong communities which we love to be part of.

This approach should be interesting also to larger institutional players, which is what we aim for in a few years.

Future focus

Resting on stable infrastructure and technology – startup of RPC nodes for Polkadot, staking nodes for Ethereum, and continued operations on validators on Kusama are examples of stuff about to be finalised. Our capacity is high so this is all very achievable and still room for more future oriented projects which we will be able to reveal in due time.

Our development happens at the same time as we intend to shift focus towards business and finance development of Dwellir in the near future. We are about to secure a few very interesting contracts with like-minded partners which will give us even more muscles to scale up.

In the very same way as we built our technology, we are dead set to build an equally strong financial and business baseline. Our financial situation allows us to pursue the right partners which is good for Dwellir long term.

Things are looking promising at the moment, so with hard work and some luck we could see more of Dwellir raising up to become one of the significant players in the deep crypto space.

Transformation in the workings of HPC.

Waking up the world of services and clouds, HPC computer systems are in a massive transformative shift. Influenced by compute thirsty AI, data analytics and advancements physics drives much of this transformation. This article is a personal reflection on this which intersect also to my professional work.

Legacy systems that we are transforming looks like this:

  • Can rarely run more than one operating system.
  • Are difficult to re-deploy or rebuild.
  • Are difficult to tune and optimize during the lifetime of their operation.
  • They are difficult to re-purpose for specific needs or temporary needs.
  • They do not allow for much exploring new setups or re-configurations.
  • They do not offer good programmatic access, which makes modern automation hard.
  • They are defined by hardware dependencies or requirements. That makes virtualization and containerization difficult to reach objectives.
  • The systems are difficult to secure, while at the same time maintain some flexibility, access and collaboration.
  • The systems are hard to split into smaller resource pools available to separate tenants.
  • They are difficult to integrate with public clouds and their services.
  • Many tools in HPC are not built in a cloud context which makes adoption of modern cloud oriented tools complicated.

You will argue, that some advanced HPC systems are already running in Openstacks (CERN for example) – to mitigate some of the problems above. I agree. However, I find these problems coming back higher up in the software layers. That is, deploying any modern software stack on top of a HPC system, getting it up to speed, in a short time, performing well – is not trivial even with access to the excellent primitives from openstack IaaS.

In fact, the deployment and operational problem of higher order systems is monumental from a general perspective.

General problems requires general solutions.

Partial solutions comes from AI and GPU computing, many times now harbored in Docker and Kubernetes, where advanced setups utilize GPU:s and high speed interconnects in very innovative ways. Data analytics derive their problem solving from SPARK and Hadoop etc. I see these areas as drivers to develop the traditional HPC systems in a direction that embrace and enable all these software systems in a new generation of HPC systems. But, that also require a proper solution to the general problem of managing the tsunami of software systems and their clouds.

I’ll make an attempt to break out some of my design targets for a next generation HPC system – that address the general problems listed above:

  • Built fundamentally as a cloud. Perhaps taking the NIST standard as a reference with an added hardware layer (which is not covered by the standard).
  • Able to deploy advanced software stacks on top of generally usable HPC, specifically to meet needs from GPU computing and I/O intense workloads.
  • Capable of building any software stack fast and tearing down equally fast is a key feature.
  • Able to draw from and integrate with public cloud resources. It must be able to do so, because the wealth of services and solution in the general computer science is never going to be covered by your private clouds and you can’t solve all problems alone.
  • The technologies must allow for programming the complete infrastructure setup. This means in practice, that services deployed must provide proper API primitives to allow programmers to manipulate and control the infrastructure as a whole. Managing software manually is dead, all hail our robot overlords.
  • Promotes and uses open standards when selecting API:s and data formats to avoid “standards lock-in”.
  • Adopt the “subscriber” model of consumption of resources. Users are not “owners” but “subscribers”. This means that the smallest entity of access and accounting is the “user account”.
  • Packaging of subscriptions into “services”. This is what we can learn from public clouds and a great way to get organized around priorities, costs and revenues.
  • Able to track consumption of resources within the supplied services.
  • It should explicitly focus on collaboration (sharing) across projects and IT environments. Working with distributed teams, remote and cross-disciplines requires this. Your technology needs to support this way of working and must accelerate it.
  • It should have a security model that enables delegated access but at the same time provides transparent access and tenant isolation per default. Security is for real.
  • All things open source.

Now, while the list above is not complete. A system with the above characteristics probably makes for a successful compute center for the future across many disciplines.

However, change can be destructive. Its worth as a final remark:

The great american philosopher Allan Watts says that people that feel joy and pleasure in what they do, will as a side effect become artisans in their profession.

I reflect on this a lot, as change can be hurtful. It is important to me, to provide a growing and joyful experience for people affected by change. Working with modern tools and methods in a computer science context enable professionals to develop excellence and become artisans in their work, but ultimately transformation is a human process.

Technology is secondary.

Tutorial: Relations with Juju

This post teaches you how to build juju charms with relations by studying two example charms made for this purpose. I work professionally with juju since many years and I’ve always thought juju relations should have a more tutorials to help beginner level programmers get start with this extremely powerful tool. You can find other tutorials I’ve written in the open source community on https://discourse.juju.is

Difficulty: Intermediate

What you will learn

This tutorial will teach you implementing a simple relation in juju. We will use two existing charms that implements a master-worker pattern and study that code for reference.

Get the code here: git clone https://github.com/erik78se/masterworker

You will:

  • Learn what a relation is and how to use it in your charms.
  • Learn more about hooks and how hook-tools drives relation data exchanges.
  • Learn about relation-hooks and when they run.
  • How to debug/introspect a relation with hook-tools

Preparations

  • You need a basic understanding of what juju, charms and models are.
  • You should have got through the getting started chapter of official juju documentation.
  • You need to be able to deploy charms to a cloud.
  • You have read: the lifecycle of charm relations
  • You have read: charm writing relations.
  • You need beginner level python programming skills.

Refreshing our memory a little

Some key elements of juju we are worth mentioning before we dig into the code.

Juju hook-tools

When working with relations and juju in general, what goes on under the hood are calls to juju hook-tools.

Looking at what those hook-tools can be done via:

juju help hook-tools

Two specific hook-tools are of high importance when working with juju relations:

relation-get & relation-set

Those are the primary tools when working with juju relations because they get/set data in relations.

Important: You set data on the local unit, and get data from remote units.

Hooks, their environment & context

Hooks-tools are normally executed from within a ‘hook’ where environment variables are set by the Juju agent, dependent on the context/hook. These variables are used when writing code for relations.

An example below show logging of some of those environment variables.

#!/bin/bash
juju-log "This is the relation id: $JUJU_RELATION_ID"
juju-log "This is the remote unit: $JUJU_REMOTE_UNIT"
juju-log "This is the local unit: $JUJU_UNIT_NAME"

Charmhelpers

When building charms with python, the python package charmhelpers provides a set of functions that wrapps the hook-tools. Charmhelpers can be installed with

pip install charmhelpers

Here is the documentation: charmhelpers-docs

Installing charmhelpers for use within you charm, could be part of your install-hook, or even better, cloned into the “./lib/” of your charm, making it part of your charm software.

Cloning charmhelpers into your charm is a good practice since it isolates your charms software requirements from other charms that may live on the same host.

Feeling all refreshed on juju basics, lets now introduce the “master” and “worker” charms.

Master worker

Clone the “masterworker” repo to your client.

git clone https://github.com/erik78se/masterworker

The repo contains:

├── bundle.yaml         # <--- A bundle with a related master + 2 workers
├── master              # <--- The master charm
├── worker              # <--- The worker charm
├── ./lib/hookenv.py    # <--- Part of charmhelpers

The idea here is that:

  • The master is a single unit, whereas the workers can be many.
  • The master send some unique information to the individual worker units.
  • The master send some common information to all workers units.
  • The workers don’t send (relation-set) any information at all.

This pattern is useful in a lot of situations in computer science, such as when implementing client-server solutions.

Lets deploy the master and two workers so we can see how it looks and how the charms are related.

juju deploy master
juju deploy worker -n 2
juju relate master worker

Note: You could of course deploy the bundle instead:

juju deploy ./bundle.yaml
masterworker-deployed.png

Implementation

So, lets go through the steps required to produce the relation between these charms.

The first step in implementing the relation between two charms starts with defining the relation endpoint for the charms and its interface name. This is done in metadata.yaml

Step 1. Define an endpoint and select an interface name

A starting point to create a relation charm, is to modify the the metadata.yaml file. We do this for both master and worker since they have different roles in the relation.

The endpoints for the master and worker are defined as below.

master/metadata.yaml

provides:                # <--- Role
  master-application:    # <--- Relation Name
    interface: exchange  # <--- Interface name 
    limit: 1

worker/metadata.yaml

requires:                # <--- Role
  master:                # <--- Relation name
    interface: exchange  # <--- Interface name

The interface name must be same for the master/worker endpoints or juju will refuse to relate the charms.

Step 2. Decide what data to pass

As we described above, the master is the only part of the relation that exchanges information in our invented exchange interface with the worker.

  1. worker-key for each unique worker. The worker-key is created by the master.
  2. message from the master to all the workers.

So, this data is all what we will “get/set” in the relation.

This is all done as part of the “relation hooks” that we will look into now.

Step 3. Use the relation hooks to set/get data.

Lets follow the events following the call to juju relate:

juju relate master worker

What happens now, is that Juju triggers a specific set of hooks on all units involved in the relation called “relation hooks”. The picture below shows how these hooks are called and in what order when a relation is formed.

juju-hook-state-machine.png

The master set data in master-application-relation-joined

The worker get data in master-relation-changed

A best practice here, is to use relation-joined and/or relation-created to set initial data and relation-changed to retrieve them just as we have done in the master and worker charms.

The reason for this is that we can’t know in relation-created or relation-joined that the other end of the relation has set any relation data yet.

Only a few relation keys (such as, the remote unit ‘private-address’) are available at these early stages (Available in relation-joined) and its not until in relation-change that your own relation data should be expected to be available.

Apart from these considerations, all we do to manage data is via: “relation-set” and “relation-get”.

Now, lets look a bit closer on how the master sends out data that is unique to our worker units.

Communicating unit unique data

Data exchanged on juju relations is a dictionary.

So to pass individual data to workers, the master creates a composite dictionary key, made up by the joining remote unit-name + key-name and relation-set data for that composite key.

./master/hooks/master-application-relation-joined

    log(" ========= hook: master-application-relation-joined  ========")

    # Generate a worker-key
    workerKey = generateWorkerKey()

    # Get the remote unit name so that we can use that for a composite key.
    remoteUnitName = os.environ.get('JUJU_REMOTE_UNIT', None) # remote_unit()

    # Get the worker remote unit private-address for logging
    workerAddr = relation_get('private-address', unit=remoteUnitName)

    log(f"Joined with WORKER at private-address: {workerAddr}")

    # Assemble the relation data.
    relation_data = { f"{remoteUnitName}-worker-key": workerKey }

    # Set the relation data on the relation.
    relation_set(relation_id(), relation_settings=relation_data )

The worker access its individual ‘worker-key’ in the master-relation-changed hook:

./worker/hooks/master-relation-changed

    log(" ========= hook: master-relation-changed  ========")
    
    localunitname = os.environ['JUJU_UNIT_NAME']

    ## If we have data that belong to this unit
    if relation_get(f"{localunitname}-worker-key"):

        # Get the worker-key with our unit name on it, e.g.: 'worker/0-worker-key'
        workerKey = relation_get(f"{localunitname}-worker-key")

Pretty straight forward, right?

Lets explore further how we use an alternative way to send out a message to the workers outside of the relation hooks.

Triggering a relation-change via a juju action.

So, juju takes care of making sure that any change on a relation triggers the hook relation_name-relation-change on the remote units, we can trigger this from other non relation hooks since we can access the relations by their id:s.

Look at the juju-action broadcast-message to show how this is achieved:

./master/actions/broadcast-message

# Assume that the first relation_id is the only and use that.
relation_id = relation_ids('master-application')[0]

# Get the message from the juju function/action
message = function_get('message')

relation_data = { 'message': message }

# ... set the relation data.
relation_set(relation_id, relation_settings=relation_data)

If you run the action ‘broadcast-message’ and watch the “juju debug-log” you will see all units logging the message sent.

juju run-action master/0 broadcast-message message="Hello there"

Look into the relations (debugging)

We will often need to see what goes on on a relation, what data is set etc. Lets see how that is done using the hook-tools.

Here we retrieve the relation-ids for the master/0 unit.

juju run --unit master/0 "relation-ids master-application"
master-application:0

Removing and adding back a relation shows how the relation-id changes from master-application:0 to master-application:1

juju remove-relation master worker
juju relate master worker
juju run --unit master/0 'relation-ids master-application'
master-application:1

We can see from the command below, how the worker can access all (-) keys/data on the master/0 unit.

juju run --unit worker/0 'relation-get -r master:1 - master/0'
egress-subnets: 172.31.27.134/32
ingress-address: 172.31.27.134
private-address: 172.31.27.134
worker/0-worker-key: "5914"
worker/1-worker-key: ADA1

We can from the command below, see that on the master/0 there is no information from the worker, which is expected. Remember that the workers don’t set any data.

juju run --unit master/0 'relation-get -r master-application:1 - worker/0'
egress-subnets: 172.31.35.128/32
ingress-address: 172.31.35.128
private-address: 172.31.35.128

Individual keys can be retrieved as well with their key names:

juju run --unit master/0 "relation-get -r master-application:1 worker/1-worker-key master/0"
ADA1

Step 4. Departing the relation

The last step to implement in juju relation is taking case of when a unit departs from a relation, the programmer should:

  1. Remove any relation data associated with the departing unit from the relation dictionary with the relation-set hook tool.
  2. Do whatever is needed to remove a departing unit from the service e.g. perform reconfiguration, removing databases etc.

Lets walk this through by removing a worker. Follow the events with juju debug-log.

juju remove-unit worker/1

The master (and worker/1) gets notified of the event and executes their respective relation-departed hook.

Departing – as it happens on the master

The master cleans up the relation data associated with the departing (remote) unit. ./master/hooks/master-application-relation-departed

    # Set a None value on the key (removes it from the relation data dictionary)
    relation_data = {f"{remoteUnitName}-worker-key": None}

    # Update the relation data on the relation.
    relation_set(relation_id(), relation_settings=relation_data) 

The master hasn’t done anything else on the host itself, so its duties are complete.

Inspecting the relation will show that the data for worker/1 is gone:

juju run --unit worker/0 'relation-get -r master:1 - master/0'
egress-subnets: 172.31.27.134/32
ingress-address: 172.31.27.134
private-address: 172.31.27.134
worker/0-worker-key: 5914

Departing – as it happens on the worker

On the worker side of the relation, the worker didn’t set any relation data, so it doesn’t have to do anything to clean up in its relation data.

But, the worker should remove the WORKERKEY.file that it created on the host as part of joining the relation.

This cleanup procedure is placed in the ‘relation-broken’ hook.

./worker/hooks/master-relation-broken

    # Remove the WORKERKEY.file
    os.remove("WORKERKEY.file")

The relation-broken hook is the final state when unit is completely cut-off from the other side of the relation, as if the relation was never there. It is last in the relation life-cycle and is a good place to do cleanup related to the host or underlying service deployed by the charm.

If the relation-broken is being executed, you can be sure that no remote units are currently known locally. So, on the master, this hook is not ran until there are no more workers.

Keep in mind that the name of the hook “-broken” has nothing to do with that the relation is “bogus/error”. Its just that the relation is “cut”.

Lets finish up by removing all the relations:

juju remove-relation master worker.

Inspect the relations and look for the file WORKERKEY.file on the remaining worker units (they are gone!).

You will also see in the juju debug-log that the master has finally ran its “relation-broken” hook.

Congratulations, you have completed the tutorial on juju relations!

Running Powerflow on Ubuntu with SLURM and Infiniband

This is a walkthrough on my work on running a proprietary computational fluid dynamics code on the snap version of SLURM over Infiniband. This time, I’ll take you through what it takes to get powerflow to run on Ubuntu18.04. If you like to try out the same thing on STARCCM+, here is a link to a post that takes you through that.

You can use this to perform scaling studies, track down issues and optimizing performance or use it as you like. Much of this will work on other OS:es too.

This is the workbench used here:

Hardware: 2 hosts with 2×20 cores 187GB ram.
Infiniband: Mellanox MT28908 Family [ConnectX-6]
OS: Linux 4.15.0-109-generic (x86_64) Ubuntu18.04.4
SLURM 20.04 (https://snapcraft.io/slurm)
OpenMPI: 4.0.4 (ucx, openib)
Powerflow: 6.2019
A Reference model which is small enough for your computers and large enough to run over 2 nodes on your available cores.

I use Juju to deploy my SLURM clusters in any cloud to get up and running. In this case, I use “MAAS” as the underlying cloud, but this would work on other clouds aswell.

Lets get started.

Modify ulimits on all nodes.

This is done by editing /etc/security/limits.d/30-slurm.conf

* soft nofile  65000
* hard nofile  65000
* soft memlock unlimited
* hard memlock unlimited
* soft stack unlimited
* hard stack unlimited

Modify slurm systemd unit startup files to make ulimit permanent to the slurmd processes.

$ sudo systemctl edit snap.slurm.slurmd.service

[Service]
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity

* Restart slurm on all nodes.

$ sudo systemctl restart snap.slurm.slurmd.service

* Make sure login nodes has correct ulimits after a login.

* Validate that all worker nodes also has correct values on ulimits when using slurm. For example:

$ srun -N 1 --pty bash
$ ulimit -a

You must have all consistent settings for ulimit or things will go sideways. Remember that slurm propagates ulimits from the submitting node, so make sure those are consistent there too.

I’m going to assume you have installed Powerflow on all you nodes at: /software/powerflow/6.2019 but you can have it wherever.

Powerflow also needs csh, install it:

sudo apt install csh

Modify the Powerflow installation

Since powerflow doesn’t yet support Ubuntu (which is a shame) we need to get around this by fixing a few small bugs to get our simulation running as we want.

Workaround #1 – Incorrect awk path

Powerflow assumes that a few os commands are located in a fixed location on all OS:es, which is a bug of course. The bug is located in the file “/software/powerflow/6.2019/dist/generic/scripts/exawatcher” and causes problems.

To fix this, you either edit the exawatcher script and comment out the references:

#set awk=/bin/awk
#set cp=/bin/cp
#set date=/bin/date
#set rm=/bin/rm
#set sleep=/bin/sleep

… as an ugly alternative, you create a symlink to “awk” which is enough to work around the bug. Hopefully this will be fixed in future versions of powerflow.

sudo ln -s /usr/bin/awk /bin/awk

This is not needed on OS:es such at centos6 and centos7 which have those symlinks already in place.

Workaround #2 – bash is not sh

Powerflow has an incorrect script header, referencing “#!/bin/sh” for code that is in fact “#!/bin/bash” and will render into a syntax error on ubuntu.

Replace #!/bin/sh header with #!/bin/bash in the file: /share/apps/powerflow/6.2019/dist/generic/server/pf_sim_cp.select

This is enough really. Its time to run powerflow through SLURM.

Time to write the job-script

#!/bin/bash
#SBATCH -J powerflow_ubuntu
#SBATCH -A erik_lonroth
#SBATCH -e slurm_errors.%J.log
#SBATCH -o slurm_output.%J.log
#SBATCH -N 2
#SBATCH --ntasks-per-node=40
#SBATCH --exclusive
#SBATCH --partition debug

LC_ALL="en_US.utf8"
RELEASE="6.2019"
hosttype="x86_64-unknown-linux"
INSTALLPATH="/software/powerflow/${RELEASE}"

export PATH=$INSTALLPATH/bin":$PATH
export LD_LIBRARY_PATH="$INSTALLPATH/dist/x86_linux/lib:$INSTALLPATH/dist/x86_linux/lib64"

# Set a low number of timesteps since we are only here to test
NUM_TIMESTEPS=100

export EXACORP_LICENSE=27007@license.server.com


exaqsub \
-decompose \
-infiniband \
-num_timesteps $NUM_TIMESTEPS \
-foreground \
--slurm \
-simulate \
-nprocs $(expr $SLURM_NPROCS - 1) \
--mme_checkpoint_at_end \
*.cdi

You probably need to modify this above script for your own environment but the general things are in there. An important note here is that powerflow normally need a separate node to run its “Control Process (CP)” on with more memory than the “Simulation Processors (SP)” nodes. I’m not taking that into account since my example job is small and fits in RAM for this example. This why I also get away with setting:

-nprocs $(expr $SLURM_NPROCS - 1) \

Powerflow will decompose the simulation into N-1 partitions which when simulation start, will leave 1 cpu for running the “CP” process on. This is suboptimal but unless we do this, slurm will complain with:

srun: error: Unable to create step for job 197: More processors requested than permitted

There is probably a smart way of telling slurm about a master process which I hope to learn about soon and use to properly run powerflow with a separate “CP” node.

Submit to slurm

Submitting the script is simply:

$ squeue -p debug ./powerflow-on-ubuntu.sh

You can watch your Infiniband counters to see that significant amount of traffic is sent over the wire which will indicate that you have succeeded.

watch -d cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_packets

You can also inspect a status file that Powerflow writes continously like this from the working directory of the simulation:

# Lets look at the status of the simulation
$ cat .exa_jobctl_status
Decomposer: Decomposing scale 6

# ... again
$ cat .exa_jobctl_status
Simulator: Initializing voxels [43% complete [21947662 of 51040996]

# ... and once the simulation is complete.
$ cat .exa_jobctl_status
Simulator: Finished 100 Timesteps

I’ve been presenting at Ubuntu Masters about the setup I use to work with my systems which allows me to do things like this easily. Here is a link to that material: https://youtu.be/SGrRqCuiT90

Running STARCCM+ using OpenMPI on Ubuntu with SLURM and Infiniband

This is a walkthrough on my work on running a proprietary computational fluid dynamics code, StarCCM+ on Ubuntu18.04 using the snap version of SLURM with openMPI 4.0.4 over Infiniband.

You can use this to perform scaling studies, track down issues and optimizing performance or use it as you like. Much of this will work on other OS:es too.

This is the workbench used here:

Hardware: 2 hosts with 2×20 cores 187GB ram.
Infiniband: Mellanox MT28908 Family [ConnectX-6]
OS: Linux 4.15.0-109-generic (x86_64) Ubuntu18.04.4
SLURM 20.04 (https://snapcraft.io/slurm)
OpenMPI: 4.0.4 (ucx, openib)
StarCCM+: STAR-CCM+14.06.012
A Reference model which is small enough for your computers and large enough to run over 2 nodes.

Lets get started.

Modify ulimits on all nodes.

This is done by editing /etc/security/limits.d/30-slurm.conf

* soft nofile  65000
* hard nofile  65000
* soft memlock unlimited
* hard memlock unlimited
* soft stack unlimited
* hard stack unlimited

Modify slurm systemd unit startup files to make ulimit permanent to the slurmd processes.

$ sudo systemctl edit snap.slurm.slurmd.service

[Service]
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity

* Restart slurm on all nodes.

$ sudo systemctl restart snap.slurm.slurmd.service

* Make sure login nodes has correct ulimits after a login.

* Validate that all worker nodes also has correct values on ulimits when using slurm. For example:

$ srun -N 1 --pty bash
$ ulimit -a

You must have all consistent settings for ulimit or things will go sideways. Remember that slurm propagates ulimits from the submitting node, so make sure those are consistent there too.

Compile OpenMPI 4.0.4

At the time, this is the latest version. This is my configure but I think you can compile it differently for your needs.

$ ./configure --without-cm --with-ib --prefix=/opt/openmpi-4.0.4
$ make
$ make install

Validate that openmpi can see the correct mca ucx

I’m most concerned in this step that ucx pml is available in the MCA for openmpi, so after my compilation is done, I check for that and the openib btl.

$ /opt/openmpi-4.0.4/bin/ompi_info  | grep -E 'btl|ucx'

MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.4)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.4)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.4)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.4)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.0.4)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.0.4)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.4)

What we are looking for here is:

* MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.4)
* MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.4)

The rest are not important at this point. But you might know better, please let me know. You can see in the jobscript later where these modules are referenced.

Validate that ucx_info see your Infiniband device and ib_verbs transports

In my case, I have a Mellanox device (show with: ibv_devices) so I should see that with ucx_info:

$ ucx_info -d | grep -1 mlx5_0

#
# Memory domain: mlx5_0
#     Component: ib

#   Transport: rc_verbs
#      Device: mlx5_0:1
#

#   Transport: rc_mlx5
#      Device: mlx5_0:1
#

#   Transport: dc_mlx5
#      Device: mlx5_0:1
#

#   Transport: ud_verbs
#      Device: mlx5_0:1
#

#   Transport: ud_mlx5
#      Device: mlx5_0:1
#

Modify the STARCCM+ installation

My version of StarCCM uses an old ucx and calls /usr/bin/ucx_info. At some point ending during startup, it fails when its not able to find libibcm.so.1 when using our custom openMPI. Perhaps there is a way to force starccm+ to look for ucx_info on the system, but I have not found any way to do this.

To have StarCCM+ ignore its own ucx, simply remove the ucx from the installation tree and replace with an empty directory.

$ sudo  rm -rf /opt/STAR-CCM+14.06.012/ucx/1.5.0-cda-001/linux-x86_64*
$ mkdir -p /opt/STAR-CCM+14.06.012/ucx/1.5.0-cda-001/linux-x86_64-2.17/gnu7.1/lib

This is not needed on OS:es such at centos6 and centos7 because they use the deprecated library libibcm.so.1.

Time to write the job-script

#!/bin/bash
#SBATCH -J starccmref
#SBATCH -N 2
#SBATCH -n 80
set -o xtrace
set -e

# StarCCM+
export PATH=$PATH:/opt/STAR-CCM+14.06.012/star/bin

# OpenMPI
export OPENMPI_DIR=/opt/openmpi-4.0.4
export PATH=${OPENMPI_DIR}/bin:$PATH
export LD_LIBRARY_PATH=${OPENMPI_DIR}/lib

# Kill any leftovers from previous runs
kill_starccm+

export CDLMD_LICENSE_FILE="27012@license.server.com"
SIM_FILE=SteadyFlowBackwardFacingStep_final.sim
STAR_CLASS_PATH="/software/Java/poi-3.7-FINAL"
NODE_FILE="nodefile"

# Assemble a nodelist using this python lib
hostListbin=/software/hostlist/python-hostlist-1.18/hostlist
$hostListbin --append=: --append-slurm-tasks=$SLURM_TASKS_PER_NODE -e $SLURM_JOB_NODELIST >  $NODE_FILE
# Start
starccm+ -machinefile ${NODE_FILE} \
         -power \
         -batch ./starccmSim.java \
         -np $SLURM_NTASKS \
         -ldlibpath $LD_LIBRARY_PATH \
         -classpath $STAR_CLASS_PATH \
         -fabricverbose \
         -mpi openmpi \
         -mpiflags "--mca pml ucx --mca btl openib --mca pml_base_verbose 10 --mca mtl_base_verbose 10"  \
         ./SteadyFlowBackwardFacingStep_final.siM
# Kill off any rogue processes
kill_starccm+

You probably need to modify this above script for your own environment but the general things are in there.

Submit to slurm

You want this job to run on multiple machines, so for me I use a -n 80 to allocate 2×40 cores which is understood by slurm to allocate the two nodes I have used in the example. If you have less more more cores than I do, use a 2xN number in your submit.

$ squeue -p debug -n 80 ./starccmubuntu.sh

You can watch your Infiniband counters to see that significant amount of traffic is sent over the wire which will indicate that you have succeeded.

watch -d cat /sys/class/infiniband/mlx5_0/ports/1/counters/port_rcv_packets

I’ve been presenting at Ubuntu Masters about the setup I use to work with my systems which allows me to do things like this easily. Here is a link to that material: https://youtu.be/SGrRqCuiT90

Here is also another simular walkthrough of doing this also with the CFD application Powerflow: https://eriklonroth.com/2020/07/15/running-powerflow-on-ubuntu-with-slurm-and-infiniband/

Workshop open source for the future

Come join the Edgeryders global festival and create an open source future for the coming generations of internet citizens.

Our theory of change is building dense networks around people trying to tackle messy socialecological, economic and political challenges. Having a dense network around you is one of the hallmarks of social capital, and gives you access to expertise, resources, skill sharing and financing. In a dense network, there are many ways for you to get from one point to another. A key challenge here is for people with aligned interests to find each other, and as a community we’re out to solve just that.

If you, like me, think open source matters – exceedingly much, in a digital world – this is something for you.

The describing text below is in swedish, but the english version is found here: Teaching Teachers Open Source

Edgeryders-Fest-Twitter-1100x628-erik-svenks-01

https://festival.edgeryders.eu/

När: 28:e November – 10:00 – 17:00
Var: Södra Hamnvägen 9 (Hus Blivande, Vägbeskrivning här)
Hur: Lunch och kaffe ingår.

Biljett får du så här:

  1. Besök https://tell.edgeryders.eu/festival/ticket och fyll i formuläret.
  2. Skapa ett edgeryders konto och introducera dig i forumet..
  3. När du introducerat dig på forumet, får du en biljett till dig via email.

Bakgrund till workshopen

Internet, datorer och programmering blir idag omedelbart en del i våra barns liv. Kommande generationer, kommer aldrig upplevt en värld utan internet och datorer.

I utbildningssystemet, introduceras de för verktyg, tankar och begrepp härledda från den digitala världen. Det är i skolan vi formar mycket av vår världsbild och idag är det mesta av stängd, trots att den borde vara öppen.

Vad händer egentligen med kunskap en digital framtid om en stängd vy på mjukvara och internet får lämnas outmanad i skolan?

Fri och öppen mjukvara präglas av transparens, samarbete, frihet och bemyndigande. Etablerade hörnstenar i vårt utbildningssystem. Så varför lever vi inte som vi lär när det gäller det här området som kommer forma våra barn i grunden?

Hypotesen är att skolvärlden saknar djup och kunskap hur fri och öppen källkod spelar en central roll i hur kunskapen om den digitala världen fungerar. Att utbilda lärare och ledare på detta område är så vi hjälper kommande generationer att forma nästa generation av internetmedborgare.

Agerar du redan i skolvärlden, är ditt deltagande välkommet och kommer du från ett open source engagemang och från mjukvaruvärlden, är din kunskap och erfarenhet central för att forma innehållet.

Vad kommer du att uppleva?

Du kommer få chans att lära om hörnstenar inom fri & öppen mjukvara och hur de är grundläggande för kunskap i en digital värld.

Du får möjlighet att dela med dig av din kunskap inom skola, mjukvara eller både och – samt – ta del av andras.

Du kommer få samarbeta med experter och entusiaster om hur vi når ut till skolväsendet.

Du kommer träffa människor med samma brinnande intresse för att forma kommande generationers internet i en spännande miljö.

Magic with Juju

I’ve been working on a set of tutorials over the last few months on how to write “juju charms”.

If you haven’t checked out Juju yet – do it now.

Juju is an open source framework to operate small and large software stacks in any cloud environment. It works fairly transparent on well known clouds like AWS, Google Cloud Engine, Oracle Clouds, Joyent, RackSpace, CloudSigma, Azure, vsphere MAAS, Kubernetes, LXD and even custom clouds!

I touched base with Juju some 5 years ago and immediately understood that this was going to make an impact in the computer science domain. The primary reasons were to me:

  1. Open Source all the way.
  2. No domain specific languages required to work with it. E.g. use your preferred programming language to do your Infrastructure As Code / DevOps.
  3. Smart modular architecture (charms) that allows re-use and evolution of best solutions in the juju ecosystem.
  4. “Fairly” linux agnostic – although the community is heavy on Ubuntu.
  5. Cloud agnostic – you can use your favorite public cloud provider or use your own private cloud, like an openStack or even VMware environment. Huge benefit to anyone looking to move into cloud.
  6. Robust and maintained codebase backed by Canonical – if enterprise support matters to you. There are other companies, like OmniVector Solutions, also delivering commercial services around Juju if you don’t like Canonical. [Disclaimer: I’m biased in my affiliation with OmniVector]
  7. A healthy, active & friendly open source community – if you, like me, think a living open source community is a key factor to success.

Above factors weigh in heavy to me and even though there are a few competitors to Juju, I think you can’t really ignore it if you are in to DevOps, IaaC, CI/CD, clouds etc.

Juju means “Magic” which is an accurate description on what you can do with this tool.

So, here are a few decent starting points from my collection to getting started with juju. Knock yourself out.

The juju community

Getting started with juju.

Tutorial “Getting started with juju hooks”.

Tutorial “Getting started with reactive charms”

Tutorial “Charms with snaps”

Massive open source win!

Now, this is good news!

The OpenChain project, (backed by The Linux Foundation) helps companies all over the world to be compliant with good practices dealing with open source. In effect, fundamentally to how they do business together within the software realm.

On December 6, 2018 – a large amount of huge players within industry made a massive joint announcement regarding this project! 

I’ve been working for over 10 years, slowly pulling my employer Scania towards this path – and now, we are there! Have a look at the map and you also see the global reach. Given the massive support, from huge players in many domains, its foreseeable that most companies will or need to follow suite, many more which I’m helping in the process.

What does this mean to you?

If you are working for a (any) software company and want to sell or do business with any openChain compliant company in the future. You need to show that your company are following the open chain standards, or, you won’t be able to business with us. The only exception will be if none of your code is open source, which in our world – is more or less impossible.

OpenChain can be seen somewhat similar to an ISO certification for software. This change, will need a transition period of course – where exceptions will be needed for sorting out proprietary mess – but we are getting there.

This framework, will in time, be affecting all software businesses. Probably all over the world. The initiative comes not from politics, but from a strong and healthy private sector and is a massive win for open source in the whole world. As the chairman of open source at Scania, and its representative within the Volkswagen group (640.000 employees), I’m more than happy to announce this message!

Now, go open source! Share this message to everyone that works with software.

 

 

Develop or die

I’ve been technical lead for a HPC center for over a decade now and for some 15 years  watched paradigms come and go within the systems engineering domain. This text is about future developments of HPC in relation to the more general software domain. There are some very tough challenges ahead and I’ll try to explain what is going on.

But first, let me get you in line of thinking with a bit of my own story.

Turning forty this year, exploring and experimenting within computer science, has landed me in a very peculiar place. People reach out for advice in very difficult matters around building all sorts of crazy computer systems. Some sets out to build successful business, some other just want to build robots. I help out as much as I can, although my time normally is just gone. Its a weird feeling to be referenced as senior. Where did those years go?

My experience in building HPC systems started in the early 2000 and from a computer science perspective, I’ve tried to stay aligned with the development in the field. This is never easy in large organizations, but having an employer bold enough to let me develop the field has made it possible, not only to deliver enterprise grade HPC for over a decade – but also to develop some other parts of it in exciting ways.

It didn’t always look like this of course.

I’ve been that person, constantly promoting and adopting “open source” wherever I’ve gone. A concept that at least in automotive (where I roam mostly) was pariah (at best) in my junior years. I came in contact with open source and Linux in the university studies and was convinced open source was going to take over the world. I could see Linux dominate the data centers from single the fact that the licensing model would allow it. There are off course tons of other reasons for that to happen, but the licensing model in it self was decisive.

I’m sorry, but a license model that doesn’t scale better than O(n) is just doomed in the face of O(1).

(I think I’m a nightmare to talk to sometimes.)

Anyway, telling my story of computer science and systems engineering, can’t be without reflecting on some technology elements and methodologies that came about during the past decades. After all, important projections on systems engineering can’t be without referencing technology paradigms.

Late in the mid 1990:ies, “Unix Sysadmins” ruled the domain of HPC. Enterprise systems outside of HPC was Microsoft Windows. For heavyweight systems, VMS and UNIX dialect systems prevailed. Proprietary software was what mattered to management (still is?), and everything was home-rolled. The concept of getting “Locked-In” by proprietary systems was poorly understood which was reflected by low understanding of real costs and how ability to build high quality systems was constituted. Building HPC system at this time was an art form. Web front ends didn’t exist. Most people were self taught and HPC was very different from today. My first HPC cluster (for a large company) lived in wooden shelves from IKEA in a large wardrobe. The CTO of the IT division was a carpenter. I’ll refer to this era as the “Windows/Unix Era“.

The Windows/Unix Era started to morph during the 2000 into the Linux revolution. In just a few years, available (not only) HPC software and infrastructure was getting more and more Linux. Other OS dialects such as HPUX, VMS, AIX, Solaris and Mainframe ended up in phase-out, witnesses of an era slowly disappearing. Linux on client desktops entered the stage and at this point, the crafting of HPC system was becoming an art of automation. Mac:s got thinner.

Sysadmin meant in practice, scripting. bash, ash, ksh etc. cfengine, ansible and puppet. Computer science was starting to become quasi magic (that I think most people perceive computers in general) and advanced provisioning systems became popular. Automation!

I was myself using Rocks Clusters for my HPC platform, with some Red Hat Satellite assistance to achieve various forms of automation. The still ever popular DevOps appeared, mainly in the web services domain and some people started to do python in favor of bash (OMG!). Everything was now moving towards Open Source. It became impossible to ignore this development, also at my primary employer. I became the chairman of the open source forum and slowly started to formalize that field in a professional way. A decade behind giants such as Canonical, IBM or RedHat, but hey – automotive is not IT, right?

However, time and DevOps development stood still in the HPC domain. I blame the performance loss in virtualization layers for that situation, or perhaps being victim of own success. While vmware, xen, kvm etc. filled up the data centers in just a few years, it never took off in HPC. Half a decade passed by with this and so called “clouds” started to appear.

I personally hate the term cloud. It blurs and confuses the true essence which perhaps is better put like: “Someone else’s computer”.

Recent development in the cloud has exposed one of the core problems with cloud resources. I’m not going to make the full case here, but generally speaking, access/integrity/safety/confidentiality of any data is almost impossible to safeguard – if you are using someone else’s computer. There is a fairly easily accessible remedy, which I’ll throw up a fancy quote for:

“Architect distributed systems, keep your data local & always stay in control over your own computing. Encrypt”. 

I think time will prove me right and I’ve challenged that heavily by taking active part in a long running research project; LIM-IT that is set out to map and understand Lock-in effects of various kind. The results from this project should be mandatory education to any IT-manager that wants to stay on top of their game in my industry.

From a technology point of view, I can recommend looking at projects like Nextcloud, OpenStack, Bittorrent, etc. But there are many, many others that operates with this great mindset.

So, where are we now. First of all, HPC is in desperate need of getting ingested by technology from other domains. Especially low hanging fruits like those from the Big Data domain. Just to take one conceptual example; making use of TCP/IP communication stacks to solve typical HPC problems. One of my research teams explored this two years ago by implementing a pCMC algorithm in SPARK as part of a thesis program. The intention was to explore and prove that HPC systems indeed serve as excellent hybrid solutions to tackle typical Data Analytics problems and vice versa.  I’m sure MPI-doctors will object, but frankly, that’s a lot of “Not Invented Here” syndrome. The results from our research thesis spoke for itself. Oh, the code can be downloaded here under a open source license so you can try it yourself. (Credits to Tingwei Huang, and Huijie Shen for excellent work on this.)

Now, there is a downside to everything. So also when opening up for a completely new domain of software stacks. Once you travel down the path of revamping your HPC environment with new technologies from, for example, “Cloud” (Oh I hate that word) and going all in “Everything as a Service” – you are basically faced with an avalanche of technology. Hadoop, SPARK, ELK, OpenStack, Jenkins, K8, Docker, LXC, Travis, etc. etc. All of these stacks requiring years, if not decades, to learn. There exists few, if any, persons that master even a few of these stacks and their skills are highly desired on a global market for computer wizards. Even more true is that they’re probably not on your pay list, because you just can’t afford them.

So, as a manager in HPC, or some other complex enough IT-environment. You face the Gordian knot of a problem:

“How do I manage and operate an IT-environment of vast complexity PLUS manage to keep my cost under some kind of control?”

Mark Shuttleworth talks brilliantly about this in his key-notes nowadays by the way which I’m happy to repeat.

Most managers will fail facing this challenge (unfortunately) and they will fail in three ways:

  1. Managers in IT will fail to adopt the right kind of technologies and get “locked in” on either data-lock-in, license-lock-in, vendor-lock-in, technology-lock-in or all of them – ending up in a budget disaster.
  2. Managers in IT will fail to recruit skilled enough people or to many with the wrong skill set – they will fail to deliver relevant services to the market.
  3. Managers in IT will do both of the above – delivering sub-performing services too late to market, at extreme costs.

If I’m right in my dystopian projections, we will see quite a few companies go down within IT in the coming years. The market will be brutal to those failing to develop. Computer Scientists will be able to ask for fantasy salaries and there will be a latent panic in management.

I’ve spent significant amount of time researching and working on this very challenging problem, which is general to its nature, but definitely applicable to my own expert domain in HPC. In a few days, I’ll fly out to get the opportunity to present a snapshot of my work to one of Europe’s largest HPC center in Germany – HLRS. Its happy times indeed.

My native ipv6 – part1

So, I’ve spent about two years arguing with my network provider Fibra to enable my high speed internet for ipv6 native. A struggle that in January 2018 payed off. I was kindly offered to be included in their PoC for ipv6 and I was thrilled.

Thank you Fibra for finally coming around on this.

I’m no guru on ipv6, but I feel comfortable navigating the fundamentals and have been running ipv6 tunnels for many years now. This is my hopefully short blogg series on getting my native ipv6 working at home.

My ambition at this point, is to have a public ipv6 address assigned from my network provider along with a 48 bit routed network prefix which will be split into a 64 bit prefix delegation for my home network.

This seems like a very basic setup at the time of writing and suits my ambitions for now. Later, I will be running separate ipv6 prefixes on my other virtual environments, while keeping the home network separate.

My current WAN access setup is a openWRT Chaos Calmer TPLINK with the WAN interface obtaining public IP:s from Bahnhof in Sweden. Fibra is the network provider, that for the purpose of the PoC, has turned on the switchport for ipv6 traffic for me.

For this to make sense to a reader, this is an extract of a starting /etc/config/network that serves as a starting point for a native ipv6 setup.

config interface 'wan'
 option ifname 'eth0.2'
 option proto 'dhcp'
 option ipv6 '1'

​config interface 'wan6'
 option _orig_ifname 'eth0.2'
 option _orig_bridge 'false'
 option ifname 'eth0.2'
 option proto 'dhcpv6'
 option reqaddress 'try'
 option reqprefix 'auto'

Bringing up and down the interface should have you ending up with a ipv6 address on the eth0.2 interface. I did that maneuver, but no ipv6 address was handed out to me.

Basic debugging needed, you would likely do the same if you get this result.

I started by turning off the firewall (to be sure no traffic is dropped before it hits the tcpstack):

$ /etc/init.d/firewall stop

Then I tcpdump for ipv6 icmp traffic. I expect to see some ipv6 icmp traffic which I did. (in an ipv6 network, icmp traffic is chatty which is a bit different than you might be used to from ipv4 networks)

$ tcpdump -i eth0.2 icmp6

I could see Router Advertisement messages from an interface not on my devices, so something was working at Fibras side. I needed to verify that my router actually sent out dhcpv6 messages and the replies that should follow.

So, I force the ipv6 dhcp client (odhcp6c) to send router solicitation messages out on the interface I wish to handle the ipv6 native stack.

​$ odhcp6c -s /lib/netifd/dhcpv6.script -Nforce -P48 -t120 eth0.2

tcpdumpv6

Nope, no reply from my network provider.

What should happen here is that the dhcp6 server at my network provider Fibra side should respond with a configuration package (Advertise), but as you can see – its just spamming RA packages, seemingly ignoring my client Router Solicitations. This is not the way its supposed to be and I sent Fibra the information above.

To be continued.

Update 1:

For the curious reader, a correct dhcpv6 exchange (RFC 3315) should look like this:
Client -> Solicit
Server -> Advertise
Client -> Request
Server -> Reply

Update 2:

Also a note from my ipv6 friend guru Jimmy is that, if you need to see more details on the ipv6 dhcp messages coming from the odhcp6c, you can expand the tcpdump filter as:

$ ​tcpdump -i eth0.2 icmp6 or port 546 or 547

You should see packages coming out from your interface as:

IP6 fe80::56e6:fcff:fe9a:246a.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit