SledgeConf July 15th 2022

4:30PM Pacific Time

https://sledgeconf.dev

Hey, I’m bringing back Sledgeconf on July 15th. We have two talks this time ‘Planning your Cloud Migration’ and ‘Practical Software Estimation’. 

“Planning your Cloud Migration”

Jay Manning joins us again to talk about planning and executing Cloud Migrations. Jay has extensive experience with AWS partners working through their Cloud journey. 

“Practical Software Estimation” 

I will be speaking on my research and interviews on the subject of practical software estimation. How do we estimate software today and what are the best practices?

Make sure to join the mailing list to get the invite link and blog updates! See you all in July!

https://sledgeconf.dev

Querying complex JSON objects in AWS Athena

AWS Athena is a managed big data query system based on S3 and Presto. It supports a bunch of big data formats like JSON, CSV, Parquet, ION, etc. Schemas are applied at query time via AWS Glue. 

To get started with Athena you define your Glue table in the Athena UI and start writing SQL queries. For my project I’ve been working on heavily nested JSON. The schema for this data is very involved so I opted to just use JSON path and skip defining my schema in AWS Glue. Of course that was after I wasted two days trying to figure out how to do it. 

Setting up AWS Athena JSON tables

AWS Athena requires that your JSON objects be single line. That means you can’t have pretty printed JSON in your S3 bucket. 

{
    "key" : "value",
    "hats" : ["sombrero", "bowler", "panama"]
} 

If your JSON files look like the example above they won’t work. You have to remove line breaks so your json objects take up a single line.

{"key":"value","hats":["sombrero","bowler","panama"]}

One tripping point is that its not obvious in the docs how to define your table if you aren’t going to use Glue schemas.  

My dataset consists of S3 objects that look like this. 

{
    "data":
    {
        "component1":
        {
            "date": "2020-02-24",
            "qualifier": "Turnips",
            "objects":
            [
                {
                    "type": " bike",
                    "date": "2022-01-14",
                    "xy" : {
                        "xy-01": 1432
                    }
                },
                {
                    "type": "pen",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 11
                    }
                },
                {
                    "type": "pencil",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 67
                    }
                }
            ]
        },
        "component2":
        {
            "date": "2021 - 03 - 15",
            "qualifier": "Arnold",
            "objects":
            [
                {
                    "type": "mineral",
                    "date": "2010-01-30",
                    "xy" : {
                        "xy-01": 182
                    }
                }
            ]
        },
        "component3":
        {
            "date": "2020-02-24",
            "qualifier": "Camera",
            "objects":
            [
                {
                    "type": "machine",
                    "date": "2022-09-14",
                    "xy" : {
                        "xy-01": 13
                    }
                },
                {
                    "type": "animal",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 87
                    }
                }
            ]
        }
    }
}

To use Athena and JSON Path for a nested JSON object like this. You simply define a single column called “data” with a string type and you are done. 

Setting up an Athena table for complex JSON object queries.

Then you query using JSON Path like this. 

Example of a simple JSON PATH Athena query

SELECT 
    json_extract_scalar(data, '$.component1.date') as service_date, 
    json_extract_scalar(data, '$.component2.objects[0].type') as component2_json,
    json_extract_scalar(data, '$["component3"]["objects"][0]["xy"]["xy-01"]') as xy_value
FROM "athena_blog_post"

Note that the json_extract and json_extract_scalar functions take a string column as their argument.

One thing to watch out for is that the dash character “-” is banned from the first type of JSON Path. So you might have to use the bracket notation instead.

Using the UNNEST operator to pivot and flatten JSON arrays of objects.

Often when working with JSON data you will have arrays of objects. You might want to split each entry in a nested array into its own row. This can be accomplished with the UNNEST operator.

Using the UNNEST operator to pull data from nested arrays into rows.

Here we have used json_extract to retrieve a nested JSON object. Then we cast it into a json array which the UNNEST operator accepts.

SELECT  json_extract_scalar(data, '$.component1.date') as service_date, 
		json_extract_scalar(components, '$.date') as component_date,
		json_extract_scalar(components, '$.type') as component_type,
		components,
		"$path"
FROM "athena_blog_post"
CROSS JOIN UNNEST(CAST(json_extract(data, '$.component1.objects') as array(json))) as t (components)

What is your design budget

We all want to have good longterm software architecture. Build it right the first time. But some organizations fall into the trap of trying to get a perfect design before they start building.

They include more stakeholders, try to plan for contingencies, develop a high availability strategy. 

Those things are all good to have. But you don’t need them before you have users. If you have no users you shouldn’t be thinking about high availability. If your product doesn’t work yet you don’t need scalability.

Users > working product > scaling limitations > high availability

Another problem is having a committee to approve architecture proposals. One or two people is fine. But having three or more people who can block the start of the project is a recipe for pointless delays.

The reason we switched to agile methodologies is because it’s hard to know what the difficulties will be before you start building the software.

Be careful to structure your organization such that design doesn’t become more important than delivery.

A working product with customers is always more valuable than a product that doesn’t have customers but is highly available.

A design budget is a way to avoid falling into this trap. Simply budget a week or a month or whatever for designing up front. At the end of that time enforce a hard stop on doing more design. No more committee reviews or deliberation. If the architecture isn’t developed enough to start building, do a spike on a smaller scale. 

Polyglot programming is a bad goal 

People have extended the goal of building micro service systems such that we aren’t locked into a language to a new goal of every repository being able to use any language.

It’s phrased as ‘shouldn’t we use the best language for the job.‘ But the implication is that we should be able to pick a language for this new project independent of the rest of our existing software stack.

The problem is that it’s not a free choice. Even if you’re in an ideal microservices environment where everything is API based, logging and metrics are handled by the platform and shared libraries are banned there are costs to having more languages in play.

The most basic cost is expertise splintering. If you only use one language and everyone knows it. Then everyone can read and extend all the code. But as you start throwing more languages and paradigms in that slowly becomes impossible.

You quickly reach a point where transferring people to a new team is expensive. Transferring from one Java project to another in the same stack is easy. Switching from Java to async JavaScript is trickier. Switching from Java to Haskell takes awhile.

Polyglot makes away team work more difficult. And it does it slowly so you won’t really notice until it’s too late. 

Shared libraries are another area polyglot sucks. You can’t always make a library an API. 

Whether it’s because you can’t externalize metrics and logging or you have high performance business logic you need to run locally. Often times you will need to duplicate shared libraries across languages.

Polyglot is nice in a lot of ways. I’d rather get to use Pandas for data science if I had the opportunity. We don’t really need to use only Java for everything. But when it comes to application code, I’d rather everyone used the same language, framework and RPC schemas. It doesn’t make any sense to write certain microservices in Java, others in Haskell, some in Python and a few in Typescript. If they are RPC or event based services you are just using five languages to do the same job instead of one language to do one job. 

Then go ahead and write sidecars in Go or Rust, frontends in Typescript and data science stuff in Python. Those are all different use cases with different trade offs. 

Away team work

Away team work is a critical component of high performance software organizations. It is a way for high priority teams to work around other teams’ manpower constraints to deliver software. Without a well established culture of away team project work your organization will default to a standard of ‘shut up and wait while your item is in the backlog’. 

What is away team work exactly? Away team work is when your team implements a feature or integration in another team’s codebase. Strictly speaking in away team work the host team does not review or approve pull requests. Your team’s senior engineers will get approval at an architectural level then handle pull request review themselves. 

The Away team process allows your team to unblock itself when dependencies don’t have manpower to spare. The key component is first that your team not the dependencies team does the work. Both the implementation, testing and code review. Some effort obviously is required of the host team, but it should be minimized. 

Away team work acts as a release valve on the conflict between the host team’s priorities and potential client’s goals of delivering features to customers. If the host doesn’t have manpower available your team can provide the engineers to make it happen. 

Without away team work your organization will have to make more tradeoffs on the scheduling side. More often than not you’ll have to pick a migration over the new feature. But Away team work would have enabled your organization as a whole to deliver both. 

What do you need to make away team work happen?

The key thing is that you have to make a formal process that specifies the maximum standards. You can’t allow teams to be squeamish about it. If the requirement is that two senior engineers approve that’s fine. Or maybe only senior engineers can do away team work in your company. Thats fine but you need a formal standard to avoid negotiations happening on a per project basis.