Speculative Engineering

The world of software engineering is comprised of a variety of interesting subdisciplines relating to shipping and maintaining software. These subdisciplines, ideally while maintaining context of scientific thought and fundamental engineering principles, are generally applied throughout today’s computing industry to create new or innovative technologies.

By practicing and refining these subdisciplines over many years, software engineering has evolved a [growing] set of software development processes that enable teams to collaborate and execute on an idea into production. Though software development methodologies may differ, ultimately what matters—and what these methodologies are trying to solve—is to simply allow engineers to get shit done™. However, sometimes there exists engineering mindsets that negatively impact productivity despite having software development methodologies in practice. An especially dangerous engineering mindset is something I will call speculative engineering.

Speculative engineering is when you avoid solving a problem by raising other problems, questions, or scenarios without [scientific] proof of their usefulness, relevance, or validity to the actual problem. This methodology additionally extends to the process of solving a problem through arbitrary tuning. For instance, statements such as the following are indicative of speculative engineering:

  • You shouldn’t do X because it will slow Y down
  • Let’s build X because users will like it
  • Let’s change X variable because it will make Y faster
  • Let’s make X separate from Y because it will be better

Statements like these are by themselves not sufficient and should not define the direction for how a problem should be solved. Without adequate proof or measurements, these statements are just speculation. Rather than pointing out [possibly irrelevant or not yet applicable] edge cases, anecdotal references, or other non-scientific ideas, engineers would be better off just iteratively building and measuring on the problem in a scientific manner. On the other hand, while some engineers may say these types of statements are allowed because of a multitude of experience, this still may be troublesome. Relying on experience alone may not be enough as you must have confidence that a problem exhibits characteristics that makes your experience applicable.

So why do some people practice speculative engineering? It could be fascination over seeking out edge cases, fear of failure, fear of progress, or complacency among several other possible reasons. In any case, if speculative engineering emerges and is left unattended, the likelihood of a project being delayed or not being shipped could quickly become reality. Everything considered, speculative engineering is dangerous to productivity and teams should proceed with caution when faced with such an engineering mentality.

Cassandra Architecture and Data Modeling

Cassandra solves a variety of problems in the space of realtime distributed systems. My recent work has been around realtime event ingestion, monitoring and log collection, and other timeseries data. These problems exhibit read and write behaviors that are in line with some of Cassandra’s natural use cases. In particular, I needed to satisfy heavy write requirements without compromising efficient reads. This is not for free however. Although Cassandra may be very suitable for scaling out your persistence store with relatively lower devops overhead, I found that it does come at the cost of more complicated data modeling and application code. With this in mind and given that I had a team ready to dive into these sets of problem, I gave an talk on Cassandra. This was a fairly technical talk that primarily covered architecture and data modeling.

Modular Node.js App Structure

I wrote a post on a monolitic app structure a while back. However, I never revisited that post and provided the way I actually organize my node applications. So at long last, here is that post.

To preface, the apps I work on these days are primarily realtime backend services, interfacing with cassandra, kafka, and redis pub/sub, all over express. On initialization of an express app, you will get a routes/ directory. In following this same modular pattern, I decided on the following:

1
2
3
4
5
6
7
8
9
10
app/
  name_of_resource/
    resource_name.js
    views/
models/
config/
  json or yaml configs go here (for environments)
lib/
  code for initing the configs, also other 3rd party libs
test/

Using this structure, you essentially have small modular “apps” (resources) that you can just move around or even drop into another node backend service. By the way, although app somewhat makes sense, this directory may also simply be called resources. I just like my resources to be at the top of the directory, so app works for me. It’s easier for me as a Vim user (NERDTree plugin in this case) where I like to leave the tree open.

You will notice that models/ are on the same level as app/. This is because resources may not necessarily be tied to one model and may share models with other resources.

The config/ directory is used to host configurations for components such as persistence stores, caching, app configs, and others. Each of these files is responsible for specifying their own configs per environment.

The lib/ directory is fairly standard, in that it is responsible for initializing the configs and enabling various 3rd party modules.

The test/ directory is standard to mocha. Typically, I have a test helper file to enable test environment-specific options.

Here is an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
app/
  foo/
    foo_resource.js
    views/
      index.jade
      show.jade
  bar/
    bar_resource.js
    views/
      index.jade
      show.jade
models/
  users.js
config/
  redis.json
  cassandra.json
lib/
  redis.js
  cassandra.js
test/
  test_helper.js
  foo_resource_test.js
  bar_resource_test.js
  user_test.js

Of course no one in the node community has been able to assert one end-all-and-be-all application structure yet. This is fine if people can find what works best for them. The application structure described in this post tries to establish a modular, logical, and clean way to organize a node application.

A Node.js Configuration Management Alternative

I’ve been working in node.js for some time now. In my time thus far it seems to me that the node.js community highly values explicitness. That is, it’s better to be up-front with your intentions than to always be looking for tricky and clever ways to implement solutions. With this in mind, you will, for example, often see explicit require statements at the beginning of files when working in node. I like this, despite these series of requires being blocking, syncrhonous calls. However, for the sake of making things interesting, I want to present an alternative to managing these requires. In addition, I will present this alternative in context of managing configs and directory structure in a node.js application.

The following directory structure is something I personally use. The Modules object, discussed later on, is not something I use, but something I thought would be interesting to discuss.

The example in this blog post will cover similar conventions to other web frameworks. Let’s first take a look at the directory structure and important files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/
  app/
    models/
    resources/
  config/
    application.js
    routes.js
  lib/
  test/
  node_modules/
  playbooks/
    config/
      group_vars/
      roles/
    deploy/
      group_vars/
      roles/
  config.js
  modules.js
  server.js
...
  1. app is where you store you application code. This application in particular has the sole purpose of being an API. The resources directory is exactly for what is sounds like. Moreover, this directory caters to percolator, a framework for building APIs.
  2. config is where you store you application configs.
  3. lib is for library files that may be used across services or for handling legacy or miscellaneous components.
  4. test is for mocha tests.
  5. node_modules is not committed to source control. It’s looked at first in determining binaries and packages to use when running your node app.
  6. playbooks is for Ansible, a tool that allows for simple server orchestration and deployment. This app uses ansible to orchestrate different server types (api, web, mail, cache, queue) across multiple data centers using YAML files.

So now that we have the directory structure layed out, what are those files in there? That is, how do we start the node app and what happens during the boot process? Let’s take a look at the server.js file:

1
2
3
4
5
6
process.title = 'foo-app'

var App = require('./config/application');
var Foo = new App(__dirname);

Foo.start();

The server.js file loads the application.js config file:

1
2
3
4
5
6
Modules = require('../modules');
var Config = require('../config');

function Application (rootDirectory) {
  ... do stuff to setup and start the app ...
}

Here in this example you’ll notice a Modules object. This object is an interesting point in this configuration management pattern. The object is accessible throughout the application and provides a central way of managing node packages in use:

1
2
3
4
5
6
7
8
9
10
11
12
13
module.exports = {
  _:            require('underscore'),
  os:           require('os'),
  percolator:   require('Percolator').Percolator,
  mysql:        require('mysql'),
  sequelize:    require('sequelize'),
  redis:        require('redis'),
  kue:          require('kue'),
  cluster:      require('cluster'),
  memcached:    require('memcached'),
  logger:       require('log-driver').logger,
  moment:       require('moment'),
};

The good thing about the above pattern is centralizing node packages in use into one area such that a person new to the codebase would be able to easily see what the application relies on. However, the question is: does this matter? Although this is useful, I think the question is not as important as asking “What does component X do?”. In the standard way, using require statements in indivudal files on a as-needed basis, the file would explicily use the required files and nothing else. I personally like this better for precisely this reason. The above pattern falls in comparison here since every package would be accessible in every file, which might not be necessary and convoluted. However, the above pattern does process all requires at once on application start, as opposed to processing the requires at runtime. Basic benchmarking shows this to be a micro-optimization.

In summary, this post entertains an alternative configuration management pattern for node applications where I implement patterns similarly to those found in other web frameworks. In comparing to standard node.js configuration styles, the performance benchmarks between the two are insignicant, thus leading to a fairly inconclusive statement: It’s all about preference.

Introducing Alpaca: A Rack Middleware for Blocking IPs

I wrote a gem. Meet Alpaca (outta nowhere). It’s a rack middleware gem that allows developers to quickly and easily manage whitelists and blacklists. The motivation for this gem revolves around satisfying specific security concerns such as malicious clients or denial of service.

Alpaca supports IPv4 and IPv6. It can whitelist or blacklist:

  • Globally across the application
  • All controller actions
  • Subset of controller actions

The above may be whitelisted or blacklisted via a single IP, range of IPs, or hostnames. Configuration is made easy using YAML in config/alpaca.yml. Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
12
# The defaults below are reserved IPv4 and IPv6 addresses used for testing.
# Replace the IP addresses in this configuration file with your own.

whitelist:
  - 0.0.0.1
  - 198.18.0.0/15
  - "::/128"
blacklist:
  - 0.0.0.1
  - 0.0.0.2
  - "2001:db8::/32"
default: allow

This gem additionally comes packaged with the ability to block IPs at the controller level. This functionality is useful in restricting certain API resources or specific endpoints to certain IP addresses. I wrote Alpaca primarily for this reason. I needed to satisfy some organizational security requirements in a production setting.

For more information about the implementation details, you can read the README.