Building Hadoop, Spark & Jupyter on macOS

So, my goal is simply this, run a Jupyter notebook with a pyspark kernel on macOS Monterey (on Apple Silicon) and read in/process Zstandard compressed JSON files (in Spark)

Now, Spark doesn’t apparently support Zstandard encoded files itself (except apparently internally), so, one needs to rely on Hadoop (native libraries) to do this.

I’ll attempt to summarize the steps I took to do this. First things first, I searched the interwebs to try to find someone who has done this already. Here’s a couple links that got me primed –

Setup

The first article, although a bit out-dated, seemed most straight forward, so I started with it. It relies heavily on Homebrew. First installed dependencies

brew install wget gcc autoconf automake libtool cmake snappy gzip bzip2 zlib openssl

Then build protobuf as it suggested. I didn’t want to install into /usr/local, but rather /opt/local instead

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar -xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0

./configure --prefix=/opt/local
make
make check
sudo make install

/opt/local/bin/protoc --version

Configuring

Now, onto Hadoop itself. First I checked out the branch I wanted

git clone https://github.com/apache/hadoop.git hadoop-2.9.1
cd hadoop-2.9.1
git checkout -b branch-2.9.1

Before I could build and compile correctly, I had to fix one file, I needed to add some lines to hadoop-common-project/hadoop-common/src/CMakeLists.txt at around line 25, I added the following

# stay cmake_minimum_required(VERSION 3.1 FATAL_ERROR) after , Join the line :
cmake_policy(SET CMP0074 NEW)

Building and installing

And now, build everything. This is the magic that worked for me

export OPENSSL_ROOT="/opt/homebrew/Cellar/openssl@1.0/1.0.2u"
env OPENSSL_ROOT_DIR="${OPENSSL_ROOT}/" ZLIB_ROOT="/opt/homebrew/Cellar/zlib/1.2.11" HADOOP_PROTOC_PATH="/opt/local/bin/protoc" mvn package -Pdist,native -DskipTests -Dtar -Drequire.openssl -Drequire.snappy -Drequire.zstd -Dopenssl.prefix="${OPENSSL_ROOT}"

And then just extact the tarball somewhere

cd /app
tar xzf ${SOURCE_DIR}/hadoop-dist/target/hadoop-2.9.1.tar.gz
cd hadoop-2.9.1

Wrangling macOS

I ran into some problems running the typical hadoop checknative command. I’ll spare you the gory details of figuring this out, but it has to do with macOS “System Integrity Protection” preventing dynamic native library lookups. This is a good article about it

The TL;DR is that as soon as macOS executes one if its trusted executables, like /bin/sh or /usr/bin/env, it cripples anything you might have done like setting DYLD_LIBRARY_PATH to dynamic library folders, and results in failure to load them.

The quick fix for a test is to use Homebrew’s version of bash

brew install bash
export PATH=/opt/homebrew/bin:$PATH

That should be enough to run hadoop checknative

export DYLD_LIBRARY_PATH=/work/app/hadoop-2.9.1/lib/native:/opt/homebrew/Cellar/snappy/1.1.9/lib/:/opt/homebrew/Cellar/zstd/1.5.2/lib/:/opt/homebrew/Cellar/openssl\@1.0/1.0.2u/lib/
bash bin/hadoop checknative

On to Spark & Jupyter

Ok, now the basic test works. And with the new found knowledge around macOS and SIP, what needs to be done to make Spark and Jupyter load the native libraries. Let’s start with Spark.

Spark

No need to really build Spark, just download a dist without Hadoop.

curl -L -O https://dlcdn.apache.org/spark/spark-3.1.3/spark-3.1.3-bin-without-hadoop.tgz
tar xzf spark-3.1.3-bin-without-hadoop.tgz

Now, if you look at scripts like bin/spark-shell, you’ll notice they all start with

#!/usr/bin/env bash

Now we know, with SIP enabled, macOS is gonna disable DYLD_LIBRARY_PATH when /usr/bin/env (and subsequently /bin/bash) are run. The strategy is going to be to prevent either of these from running. We already got a solution for bash, but there is no Homebrew version of /usr/bin/env, so we’ll have to build our own.

curl -O https://raw.githubusercontent.com/coreutils/coreutils/master/src/env.c
gcc -o env env.c
mkdir /app/bin
cp env /app/bin

And now, the unfortunate part, I decided to tweak the Spark scripts to run my version of env. Ok for now, I don’t really like to do this sort of thing cause inevitably I’ll forget when I upgrade a component.

cd spark-3.1.3-bin-without-hadoop/bin
sed -i .bak -e 's=/usr/bin/env=/app/bin/env=' $(grep -l /usr/bin/env *)

Moment of truth, lets see if we can start a spark-shell and have it load a Zstandard compressed file

export PATH=/opt/homebrew/bin:$PATH

export HADOOP_HOME=/app/hadoop-2.9.1/
export SPARK_HOME=/app/spark-3.1.3-bin-without-hadoop/
export SPARK_DIST_CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath)

export DYLD_LIBRARY_PATH=${HADOOP_HOME}/lib/native:/opt/homebrew/Cellar/snappy/1.1.9/lib/:/opt/homebrew/Cellar/zstd/1.5.2/lib/:/opt/homebrew/Cellar/openssl\@1.0/1.0.2u/lib/

$SPARK_HOME/bin/spark-shell

Now lets, try

scala> val df = spark.read.json("/tmp/important-json-log.zst")
df: org.apache.spark.sql.DataFrame = [aid: bigint, app: struct<buid: string> ... 16 more fields]

scala> df.count()
res1: Long = 7023

Success! So, how about Jupyter?

Jupyter

Well, it turns out, with all of what we’ve done so far, we are in good shape to use Homebrew’s version jupyter. The startup script itself isn’t using the env trickery

#!/opt/homebrew/opt/python@3.9/bin/python3.9

So, using the same environment settings for Spark, should be enough

jupyter notebook --ip=localhost --notebook-dir=jupyter

Taadah!

Emacs Tramp on Windows 10

Hot off the heals of figuring out how to configure SSH with Windows 10, I finally figured out how to now use the Windows OpenSSH Client with Emacs Tramp, so I can use the ssh mode rather than plink.

There are two customizations necessary. The first is to disable ControlMaster features in SSH, since I believe Windows doesn’t support them. There is a tramp option for that.

  (setq tramp-use-ssh-controlmaster-options nil)

This gets you half of the way, but after I made this change, looking at the tramp logs, there appears to be a problem with the remote side allocating a tty. You can force SSH to do this with the -tt option, the hard part is adding that to tramp.

The way I did it, was to use tramp-connection-properties to override the default settings.

  (add-to-list 'tramp-connection-properties
	       (list (regexp-quote "/ssh:")
		     "login-args"
		     '(("-tt") ("-l" "%u") ("-p" "%p") ("%c")
		       ("-e" "none") ("%h"))))

This adds the -tt argument for any tramp connections that start with /ssh: this is OK. Another option would be to change tramp-methods directly. Sadly my Lisp skills are pretty weak and so manipulation of that list is more work than its worth.

In my init.el, I actually only set these conditionally, when Emacs is running on Windows, so I’ve wrapped it all in a conditional clause.

(when (eq window-system 'w32)
  (setq tramp-use-ssh-controlmaster-options nil)
  (add-to-list 'tramp-connection-properties
	       (list (regexp-quote "/ssh:")
		     "login-args"
		     '(("-tt") ("-l" "%u") ("-p" "%p") ("%c")
		       ("-e" "none") ("%h"))))
  )

Quest for a multi-platform SSH Config

Since working at home during the pandemic, I had decided to give developing on Windows 10 another shot. Although its a bit more painful to configure a working environment that I’m used to with any Unixen system (typically macOS), I’ve managed to configure things adequately.

One thing that has eluded me for a while, and so I’ve just done things a different way, is making the Windows “built-in” OpenSSH Client work properly for me.

Host *.d
    ProxyCommand ssh creechy@dmz.ds.net nc `echo %h | sed -e 's/.[^.]*$//'` %p

The main hitch is that I use ProxyHost regularly to connect through bastion hosts into instances in “data centers”. A ProxyHost command would typically look like

This will run the nc to set up a “tunnel” between me and the target network to which ssh will burrow through to get to my destination when I enter a command like

ssh instance.p

Well, this doesn’t work for Windows OpenSSH Client. You’ll get an error like

CreateProcessW failed error:2
posix_spawn: No such file or directory

For some time I thought it was a problem executing the nc command, but that’s not it. Its executing the ssh command itself. Sometimes I forget, in windows, executables all end with .exe. So the easy fix is to just change to

Host *.d
  ProxyCommand ssh.exe creechy@dmz.ds.net nc `echo %h | sed -e 's/.[^.]*$//'` %p

Not so fast! Now this breaks using the config in Unixen (macOS, Linux, WSL).

I like having things that can run on as many platforms as possible. I finally spent some time to figure it out. The Match directive to the rescue. The idea is to use Match in the ssh config to identify when its being executed in Windows, and use a different ProxyCommand. It took quite a bit of work to find just the right command. I needed to find a command that worked in both a bash shell and cmd.exe, but returns a different status code. This appears to be the magic.

Match host *.d exec "exit ${ONWINDOWS:=1}"
  ProxyCommand ssh.exe creechy@dmz.ds.net nc `echo %h | sed -e 's/.[^.]*$//'` %p

Host *.d
  ProxyCommand ssh creechy@dmz.ds.net nc `echo %h | sed -e 's/.[^.]*$//'` %p

Notice the exec directive. I use this to interrogate the underlying execution context. I’m using a little bash syntax to check for a non-existent environment variable and return 1 if it doesn’t exist – ${ONWINDOWS:=1}. By luck, I suppose, the syntax in cmd.exe doesn’t produce any errors and returns a 0. And since both bash and cmd.exe recognize the exit statement, it all works just right.

So, now as OpenSSH evaluates the directives, for Windows the Match will be evaluated, and ProxyCommand will be set correctly. Then the Host will be evaluated, but since ProxyCommand is already set, it won’t be overriden again.

And then for Unixen, the Match is false and isn’t evaluated, so it “drops though” to the Host which will set ProxyCommand correctly.

Notes on my Media Server

I recently rebuilt my media server. Previously it was based on an old ReadyNAS NV+ and Mac Mini. These both have  outlasted their lifetimes, they are slow, noisy and large. So I recently rebuilt the whole rig.

Primary Uses

I call it a media server but in reality its much more than that. The 1Gbit broadband service my internet provider gives me affords lots of bandwidth to do interesting things.

  • File Server – photos, among other digital assets, that I want to have available in my home.
  • Media serving – videos that I want to have available to view on my TV and other devices.
  • Web presence – Some personal web sites and web applications I want to be able to access anywhere.

The Rig

The goal here is to strike a balance between simplicity, power, and noise. The primary components I chose are

  • Intel NUC6i7KYK – A very compact all-in-one unit. Slightly loud fan, but quieter than my old rig, and it is generally drowned out by background noise anyways.
  • Seagate 5TB Backup Plus USB 3.0 Disk – These are for the data partition. The old ReadyNAS had 4TB of redundant disk. I got two of these Seagate disks to build similar redundancy.
  • Samsung 960 Evo NVMe 500GB SSD – This will hold the system software. Probably a 250GB SSD is sufficient really.

System Software

I decided to use Ubuntu 18.04. There are some applications that do have a GUI for them, making the full Ubuntu desktop available though remote access (via VNC) was required.

There was nothing notable for the basic install, however making Ubuntu run in “headless” mode took some work, which I previously wrote about.

I did choose to partition the system disk using LVM. My original intention was to use LVM to build a redundant data partition, but after the basic install was done, I discovered that ZFS was ported to Linux which is far easier to work with.

Data Partition

I built the data partition using the two Seagate 5TB drives. Really simple mirroring (RAID-1). Using ZFS made this incredibly easy, It has been a while since I last used ZFS but it came back to me pretty quickly. I started with the Ubuntu ZFS Tutorial and followed that up with a read through the Ubuntu ZFS Reference to figure out how create file systems, etc.

% zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
data-pool         2.85T  1.54T    96K  none
data-pool/backup  63.0G  1.54T  63.0G  /data/backup
data-pool/home    78.9M  1.54T  78.9M  /data/home
data-pool/media   2.57T  1.54T  2.57T  /data/media
data-pool/photos   229G  1.54T   229G  /data/photos

File Server

File serving is always an annoying thing. Even for a home network where you might have only a few users with different access permissions, it can be frustrating.

The natural solution for making files available in a heterogeneous environment is to use SAMBA. Here’s some notes on configuration and tweaks. There are some global tweaks I made to /etc/samba/smb.conf

[global]
mangled names=no
veto files = /._*/.DS_Store/
delete veto files = yes
min protocol = SMB2
netbios name = mockhub

This is mainly because I didn’t want to have any of those special OSX files all over the place.

Since I last used SAMBA I found there is a net share, command. I thought this would make it easy to configure my shares, but it doesn’t really provide the level of control I need. In the end I added my shares to /etc/samba/smb.conf file. Here’s an example

[media]
path=/data/media
comment=Media Archive
#usershare acl=Everyone:F,
guest ok=yes
force user=media
read only=no

The force user parameter is the key to simplifying permissions. It tells the samba server to always access files as the specified user. And so even though a client will have to authenticate with their own user/password, all access is done as that user.

Media Server

I’ve found two great options that provide a DLNA-based media server; Universal Media Server and Serviio. Both provide both a server and a management interface. I have used Universal Media Server for a long time. They both have similar features for my needs.

There’s not much special configuration for either of these tools. Simply point them to your archive of video files and they will discover, catalog, and thumbnail them.

One interesting thing to note is regarding fast-forward & rewind. Neither of these can apparently provide fast-foward/rewind if they are transcoding (on the fly). Transcoding can be necessary for various reasons but the main two I have seen are because the source format is not compatible with your player (TV) and to provide subtitles.

For example, I have some h.265 encoded files, but my television can’t play them natively so these programs have to transcode them to play them.

Finally, I wrestled with random disconnects from my TV to both UMS & Serviio, tweaking various settings in the apps and the router. The magic so far has been this:

In LAN -> IPTV Tab, enable the following settings:

Web Server

I have both Java web-apps and plain HTML sites that want to serve. The obvious choice is Tomcat for this.

The trick here is that I don’t want to run Tomcat as root, but I want to serve content from the typical port 80 & 443. I could create a mapping in my router for this, but for now I opted to create some iptables rules to map 80 & 443 to Tomcat’s 8080 & 8443.

% sudo iptables -L -n -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80 redir ports 8080
REDIRECT   tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:443 redir ports 844

I also wanted to use letsencrypt to create my own certificate for Tomcat. Unfortunately letsencrypt doesn’t natively support Tomcat. The additional wrinkle is I want to support multiple domains with a single certificate.

I found this excellent article – Configuring Let’s Encrypt with Tomcat 6.x and 7.x – to do the basic configuration. And combined it with the article Adding an SAN to an SSL cert (in Java)

Conclusion

Since the original installation, I have been finding more uses for the server. For example, I’ve been trying to decrypt an old iOS backup that I lost the password for, but the tools to do this are generally Windows based. So, using VirtualBox, I started up a Windows VM, have have set up a process where I automatically suspend and resume the VM during the day to continue searching for the password.

Although its taken some time to set up, I have enjoyed the additional flexibility of the fully custom media server.

Headless Ubuntu 18.04 with VNC Access

Ok, real quick before I forget how I did it.

What I wanted to do was to run Ubuntu 18.04 without a monitor and be able to access it via VNC. Here’s the steps:

  1. Install Ubuntu 18.04. I made sure the system was set up to auto-login.
  2. Make sure its all patched up to current.
  3. Enable the VNC server (vino), and disable Vino encryption (haven’t found a Windows or OSX viewer that supports Vino’s encryption)
    http://ubuntuhandbook.org/index.php/2016/07/remote-access-ubuntu-16-04/
  4. Install the “dummy video” Xorg server
    sudo apt install xserver-xorg-video-dummy
  5. And configure it. From this GIST
    https://gist.github.com/mangoliou/27c6c5867a95932f21ae59ad7152aa33
    I just ran
    sudo wget -P /etc/X11 https://gist.githubusercontent.com/mangoliou/ba126832f2fb8f86cc5b956355346038/raw/b6ad063711226fdd6413189ad905943750d64fd8/xorg.conf
  6. Finally switch from Xwayland to Xorg. From stackexchange
    https://askubuntu.com/questions/961304/how-do-you-switch-from-wayland-back-to-xorg-in-ubuntu-17-10
    Just uncomment
    #WaylandEnable=false
    in
    /etc/gdm3/custom.conf
  7. And reboot.

Mobile Media Server with a Raspberry Pi

Update: 8.Nov.2013 – Tweaked WiFi config to improve network performance.

This thanksgiving we are planning to drive, for once, to our destination in Southern California. In anticipation of the kids getting bored on the long drive, I started to think about a way to have more videos available than their iPods/iPads can hold, to keep them busy.

The idea I came up with was to see if I could configure a Raspberry Pi as a media server. You can hardly beat the form factor, and the simple power requirements, perfect for the car. I figured it would have to do a few things

  1. Act as a WiFi Access Point. The iDevices would be configured to connect to this.
  2. Serve as a simple DNS server for this isolated LAN.
  3. Serve media files in some way that the iDevices could understand.
  4. Handle the load of a few clients accessing it

After about a day’s work, I’m happy to report that I believe I have been successful. At least my initial configuration and testing so far is looking very promising. Here’s what I have done:

Configuring the WiFi Access Point
I found a great tutorial from Adafruit for Setting up the Raspberry Pi as an Access Point. Followed the instructions pretty much verbatim, and it ok, but in later testing I noticed wireless performance was pretty slow. I found the article Raspberry Pi with RT5370 Wireless Adapter which described a similar problem but with a different adapter.

Still his solution was simple so I gave it a shot and it worked. What I ended up doing is changing /etc/network/interfaces to add the directive wireless-power off, so the section now looks like

iface wlan0 inet static
  address 192.168.42.1
  netmask 255.255.255.0
  wireless-power off

I was worried that once I disconnected the ethernet side of the Pi, things would go haywire, like maybe the iDevice would no longer want to connect to the Access Point since it could no longer fully access the Internet. But that doesn’t appear to be a problem so far.

Setting up the DNS server
I wasn’t entirely sure I would need this, but thought I’d do it just in case. Found this tutorial for Setting up a simple intranet DNS Server on Linux, using dnsmasq.

I did a couple things different than the how-to though. First, I installed the full dnsmasq package (not just dnsmasq-base).

I also used a different location for the config file, to match the newer pattern for where to store dnsmasq config files, I created the file called/etc/dnsmasq.d/intranet.conf

In there I put

no-dhcp-interface=

server=/localhost/192.168.42.1
server=8.8.8.8

no-hosts
addn-hosts=/etc/hosts.dns

Note that the server=8.8.8.8 directive should probably be removed since in disconnected mode, there is really no upstream server to consult.

Finally, I then went back and tweaked /etc/dhcp/dhcpd.conf, to point the domain-name-servers option to 192.168.42.1

Installing a Media Server
My first hope was to just use Samba to provide a CIFS share to the iDevices. I have a used an app called File Explorer on the iPad which works fine talking to beefier hardware, but the Raspberry Pi struggled to serve media very well, so I looked for other solutions.

In my search I found several DNLA clients available for iOS, so I thought I’d give that a shot, so far it has worked well.

I followed the tutorial Mini DNLA on the Raspberry Pi. Pretty simple and straight forward.

I noticed in the minidlna.log it kept complaining that there were not enough max_user_watches, so I followed the instructions to increase the limit.

DLNA Video Players
And finally, I’ve tried out several DLNA-enabled video players, here’s two good ones I’ve found so far

  • MoliPlayer & MoliPlayer HD – Easy to use player, have a few optional in-app purchases to enhance viewing.
  • XBMC for iOS – If you happen to have a jail broken device, this works well and is free.

My Bandwidth

So I’ve been sampling my incoming and outgoing bandwidth at home ever since I switched over to AT&T U-Verse, but haven’t really done much with it. Out of curiosity I decided to crunch some numbers the other day to see what my typical bandwidth usage has been. Before U-Verse I had AT&T DSL service up to 3.0 Mbps but really maxed out at only 1.5 Mbps. With U-Verse, I have a 12 Mbps plan. Clearly I don’t always use all that bandwidth but its nice to have when I need it. So how much have I needed it?

Here’s a chart of my bandwidth usage in September. What I did was put the samples in bandwidth ranges and calculated percentages in each range.

I guess its pretty reasonable that 55 percent of the time, a little more that 12 hours a day – including weekends, there is really nothing going on, network-wise. So lets just drop off that bit and readjust percentages for the other ranges.

These numbers are looking a little more reasonable, I guess. A large part of the work and other online time, there is mainly IM and email (IMAP+SMTP) traffic. That likely accounts for the 10-25 Kbps range. Lay down occasional bursty web traffic and SSH into the next few ranges, say 25-250 Kbps. More seldom there’s RDP starting to push above 250 Kbps and then into the heavy file transfers pushing 1 Mbps and higher.

One thing is certain, I’ve got a whole lot of bandwidth I need to figure out something to do with.

Pidgin with all the trimmins’ on OSX

So, first off, I really do like Adium on OSX. Its got a great interface, full featured, etc except its missing one thing. The ability to connect to Microsoft Office Communication Server (OCS). Which, with my current employer, is a necessity.

Those who know me, know I hate running multiple apps to do the same thing. So, although I could run Adium for my non-Work IM and Microsoft Messenger for Work IM, I’d really prefer not to.

Fortunately there is at least one solution, some good folks have put together a plugin for Pidgin called SIPE that allows you to configure OCS accounts and so forth. All is good; I use this combo on my Windows desktops all the time, but sadly, every time I have looked, the plugin has not been ported to Adium.

But all is not lost. Apparently, now some folks have ported the (GNOME) GTK to the native OSX Quartz interface, which provides some hope for running Pidgin on OSX more natively, instead of say an X11 window.

Well, I am happy to report that I have been successful in doing just that, here’s the procedure, before I forget.

First off, install MacPorts, just following the regular procedure. Now to the fun stuff.

MacPorts has Porfiles set up for Pidgin and all its prerequisites, so first off, fire up a base install Pidgin

sudo port install pidgin +quartz +no_x11

Now just walk away for a while, there’s a lot that needs to be downloaded, compiled, and installed.

Once its done, you can try to run it, but things will probably be pretty wonky, due to some bugs in the version of pango that gets built. See issue 20924 for more information. The downside is you need to pull down the patch-pango-1.28.1-introspection-revised.diff from the bug, and use it to build a new version of pango

cd ~
wget https://svn.macports.org/attachment/ticket/20924/patch-pango-1.28.1-introspection-revised.diff
cd /opt/local/var/macports/sources/
cd rsync.macports.org/release/ports/x11/pango
patch < ~/patch-pango-1.28.1-introspection-revised.diff
sudo port install pango +quartz +no_x11
sudo port -f activate pango @1.28.1_0+no_x11+quartz

Ok, from here, you should now have a working version of Pidgin that you can fire up, and connect to your Yahoo, AIM & Jabber buddies with.

Next stop, installing SIPE. Grab the multi-platform file from http://sourceforge.net/projects/sipe/files/ and extract it.

Now, configure the package, and build it:

export PATH=/opt/local/bin:$PATH
./configure --prefix=/opt/local
make

Now, I had a problem where the build stopped at one point because gcc warnings were being considered errors. I fixed this by removing the -Werror directives from the QUALITY_FLAGS variable in the Makefiles. The build completed correctly after that.

Finally, install the plugin with

sudo make install

And that is it.

Now, one more thing for bonus points, I try to use Off-The-Record (OTR) messaging when possible, so I like to have the OTR plugin available in Pidgin as well. MacPorts has an answer for that:

sudo port install libotr
sudo port install pidgin-otr

Done.