Links Post

Real time analytics article

https://adamsinger.substack.com/p/demystifying-real-time-analytics?utm_source=url&s=r

Danluu on the issues with purchasing solutions.

https://danluu.com/nothing-works/

Nuclear Stuff 

https://austinvernon.site/blog/nuclear.html

https://whatisnuclear.com/economics.html

Are we really engineers?

https://www.hillelwayne.com/post/are-we-really-engineers/

Configuration – Best Practices 

Static vs Dynamic Configuration

There are two main types of configuration static and dynamic. The main difference is in how the configuration is consumed by your services. 

Static configuration is immutable in the life time of an instance of your service. The service can read it once on startup and safely assume it will never change. 

Dynamic configuration is mutable in the life of a service instance. The service cannot assume the configuration it has in memory is still accurate. It must have a reliable way to get configuration changes. 

Implementations of static configuration 

Typically, static configuration is read on startup of your service and kept in memory for its lifetime. 

Config file in Service repository

The easiest way to do static configuration is to include a config file into your service repository. That way your service will always have access to the config file on startup. The config is put through the same code review process as the code. Plus code and configuration are kept in the same place making it easy to reference them against each other. 

The file can be anything from JSON, to a purpose built configuration language. 

Config file in config repository 

Here we simply take a config file we could include in the service repository and put it in a different repository. You could have one main config repository for a collection of services or a separate config repository for each service. 

Why would you do it this way? Splitting out the config from the code lets you deploy the config independently of the code. To release a config change you simply update the config repository then deploy it along your CI/CD pipeline. Finally, you trigger a fleet wide service restart to activate the change. 

In many setups this will be faster than deploying a config update in your service repository. 

Config file in external storage (S3, DynamoDB, distributed filesystem, SQL database, etc)

Keeping the config file in external storage has more tradeoffs than have the config in a repository. Now you have to consider whether the service has access to the external storage upon startup. If it doesn’t you may see repeated restarts none of which succeed. More stable external storage mechanisms are strongly preferred for static configuration. If the external storage goes down you may not be able to scale your system up or deploy new versions of the code. 

Implementation of dynamic configuration 

Dynamic configuration is typically implemented via polling and caching with timeouts. Your service might request config updates every five minutes from a centralized configuration server and update its internal state accordingly. Another option is to maintain a config cache and query the config service only when the particular key in question has exceeded its timeout period say 15 minutes. 

Technically, you could build a push based dynamic configuration system, but I haven’t seen it in the context of web services. As you add more and more services to your system having a client side cache helps keep load down and its easier to keep things in the cache up to date in a pull based system. 

Polling for updates 

A typical way to handle dynamic updates is to have your service regularly check with an external service to see if there were any updates. The results are usually stored in memory. Your service would make a request to its external configuration system on a set schedule from every five minutes to even every few hours. The schedule is set by your scaling needs and how quickly you need configuration to be updated across your fleet. If you want to roll out updates within ten minutes a polling time in the order of minutes is the way to go. 

Per request checking for updates

Another way to handle dynamic configuration is to check with the configuration server for every request. To make this scale you have to include a local cache of some kind. When your service receives a request it checks its cache and if the relevant configuration keys are expired it will make a request to the configuration service to get updates. 

How often will the configuration Change during good times?

The most important consideration for configuration is how often the configuration needs to change. Do you plan on changing it daily? Can it be updated weekly or monthly instead? What events would cause you to need to update it? 

Typically, if a human is generating the change your configuration can update to reflect the new state in hours or days. 

Dig into your use case. 

Are you getting configuration from a partner on a quarterly, monthly or weekly basis? In that case static config is the way to go. 

Are you getting rules updates from the product team on a daily basis?

 In that case static config is still the way to go. But this assumes you can deploy static config changes in a matter of hours reliably. If your team can’t deploy a change to static config reliably this situation might force you to use dynamic configuration. You might want to focus on improving your CI/CD before making the jump to dynamic configuration. 

Are you getting rules or other config changes on an hourly basis from the product team?

This is the inflection point for moving to dynamic configuration. Most teams do not have the ability to deploy changes to static config on an hourly basis. And typically even if you could pull it off dynamic config is a better solution here. 

How quickly does the configuration NEED to be changed in emergencies?

If you have an emergency will you need to update this configuration? Business rules probably do not need to be changed during a production outage. Instead you are likely to want to turn a feature toggle off. Or increase the timeout for a service call. 

How many services will use the configuration 

Having more services involved changes the tradeoffs you need to make when picking a solution. If you have just one service a static file or DynamoDB table can fulfill all your needs. Scaling that architecture might not work as well. You don’t want to move to a situation where you have dynamic configuration for one hundred services stored in one hundred AWS accounts across one hundred DynamoDB tables 

In organizations with more services consistency and automated rules enforcement become more important than convenience. Automatically confirming dynamic configuration conforms to standards across 100 unique service, AWS accounts and dynamodb table structures is likely impossible. Automatically confirming standards in one shared dynamic configuration service with one hundred client services is going to be a lot easier. 

Just tell me what to use!

If you want a general guideline my recommendation is to start with static configuration stored in your service repository. Once you have a serious need for rapidly updating configuration setup a DynamoDB table to store it in. But before you do so make sure that you actually expect to be making multiple configuration updates per day. If they can’t give you a list of the updates they want and the times they need them to go out, you don’t need dynamic configuration. 

Finally, if you have more than three services using dynamic configuration, setup an API service as your shared dynamic configuration store. 

Querying complex JSON objects in AWS Athena

AWS Athena is a managed big data query system based on S3 and Presto. It supports a bunch of big data formats like JSON, CSV, Parquet, ION, etc. Schemas are applied at query time via AWS Glue. 

To get started with Athena you define your Glue table in the Athena UI and start writing SQL queries. For my project I’ve been working on heavily nested JSON. The schema for this data is very involved so I opted to just use JSON path and skip defining my schema in AWS Glue. Of course that was after I wasted two days trying to figure out how to do it. 

Setting up AWS Athena JSON tables

AWS Athena requires that your JSON objects be single line. That means you can’t have pretty printed JSON in your S3 bucket. 

{
    "key" : "value",
    "hats" : ["sombrero", "bowler", "panama"]
} 

If your JSON files look like the example above they won’t work. You have to remove line breaks so your json objects take up a single line.

{"key":"value","hats":["sombrero","bowler","panama"]}

One tripping point is that its not obvious in the docs how to define your table if you aren’t going to use Glue schemas.  

My dataset consists of S3 objects that look like this. 

{
    "data":
    {
        "component1":
        {
            "date": "2020-02-24",
            "qualifier": "Turnips",
            "objects":
            [
                {
                    "type": " bike",
                    "date": "2022-01-14",
                    "xy" : {
                        "xy-01": 1432
                    }
                },
                {
                    "type": "pen",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 11
                    }
                },
                {
                    "type": "pencil",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 67
                    }
                }
            ]
        },
        "component2":
        {
            "date": "2021 - 03 - 15",
            "qualifier": "Arnold",
            "objects":
            [
                {
                    "type": "mineral",
                    "date": "2010-01-30",
                    "xy" : {
                        "xy-01": 182
                    }
                }
            ]
        },
        "component3":
        {
            "date": "2020-02-24",
            "qualifier": "Camera",
            "objects":
            [
                {
                    "type": "machine",
                    "date": "2022-09-14",
                    "xy" : {
                        "xy-01": 13
                    }
                },
                {
                    "type": "animal",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 87
                    }
                }
            ]
        }
    }
}

To use Athena and JSON Path for a nested JSON object like this. You simply define a single column called “data” with a string type and you are done. 

Setting up an Athena table for complex JSON object queries.

Then you query using JSON Path like this. 

Example of a simple JSON PATH Athena query

SELECT 
    json_extract_scalar(data, '$.component1.date') as service_date, 
    json_extract_scalar(data, '$.component2.objects[0].type') as component2_json,
    json_extract_scalar(data, '$["component3"]["objects"][0]["xy"]["xy-01"]') as xy_value
FROM "athena_blog_post"

Note that the json_extract and json_extract_scalar functions take a string column as their argument.

One thing to watch out for is that the dash character “-” is banned from the first type of JSON Path. So you might have to use the bracket notation instead.

Using the UNNEST operator to pivot and flatten JSON arrays of objects.

Often when working with JSON data you will have arrays of objects. You might want to split each entry in a nested array into its own row. This can be accomplished with the UNNEST operator.

Using the UNNEST operator to pull data from nested arrays into rows.

Here we have used json_extract to retrieve a nested JSON object. Then we cast it into a json array which the UNNEST operator accepts.

SELECT  json_extract_scalar(data, '$.component1.date') as service_date, 
		json_extract_scalar(components, '$.date') as component_date,
		json_extract_scalar(components, '$.type') as component_type,
		components,
		"$path"
FROM "athena_blog_post"
CROSS JOIN UNNEST(CAST(json_extract(data, '$.component1.objects') as array(json))) as t (components)

What is your design budget

We all want to have good longterm software architecture. Build it right the first time. But some organizations fall into the trap of trying to get a perfect design before they start building.

They include more stakeholders, try to plan for contingencies, develop a high availability strategy. 

Those things are all good to have. But you don’t need them before you have users. If you have no users you shouldn’t be thinking about high availability. If your product doesn’t work yet you don’t need scalability.

Users > working product > scaling limitations > high availability

Another problem is having a committee to approve architecture proposals. One or two people is fine. But having three or more people who can block the start of the project is a recipe for pointless delays.

The reason we switched to agile methodologies is because it’s hard to know what the difficulties will be before you start building the software.

Be careful to structure your organization such that design doesn’t become more important than delivery.

A working product with customers is always more valuable than a product that doesn’t have customers but is highly available.

A design budget is a way to avoid falling into this trap. Simply budget a week or a month or whatever for designing up front. At the end of that time enforce a hard stop on doing more design. No more committee reviews or deliberation. If the architecture isn’t developed enough to start building, do a spike on a smaller scale.