Querying complex JSON objects in AWS Athena

AWS Athena is a managed big data query system based on S3 and Presto. It supports a bunch of big data formats like JSON, CSV, Parquet, ION, etc. Schemas are applied at query time via AWS Glue. 

To get started with Athena you define your Glue table in the Athena UI and start writing SQL queries. For my project I’ve been working on heavily nested JSON. The schema for this data is very involved so I opted to just use JSON path and skip defining my schema in AWS Glue. Of course that was after I wasted two days trying to figure out how to do it. 

Setting up AWS Athena JSON tables

AWS Athena requires that your JSON objects be single line. That means you can’t have pretty printed JSON in your S3 bucket. 

{
    "key" : "value",
    "hats" : ["sombrero", "bowler", "panama"]
} 

If your JSON files look like the example above they won’t work. You have to remove line breaks so your json objects take up a single line.

{"key":"value","hats":["sombrero","bowler","panama"]}

One tripping point is that its not obvious in the docs how to define your table if you aren’t going to use Glue schemas.  

My dataset consists of S3 objects that look like this. 

{
    "data":
    {
        "component1":
        {
            "date": "2020-02-24",
            "qualifier": "Turnips",
            "objects":
            [
                {
                    "type": " bike",
                    "date": "2022-01-14",
                    "xy" : {
                        "xy-01": 1432
                    }
                },
                {
                    "type": "pen",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 11
                    }
                },
                {
                    "type": "pencil",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 67
                    }
                }
            ]
        },
        "component2":
        {
            "date": "2021 - 03 - 15",
            "qualifier": "Arnold",
            "objects":
            [
                {
                    "type": "mineral",
                    "date": "2010-01-30",
                    "xy" : {
                        "xy-01": 182
                    }
                }
            ]
        },
        "component3":
        {
            "date": "2020-02-24",
            "qualifier": "Camera",
            "objects":
            [
                {
                    "type": "machine",
                    "date": "2022-09-14",
                    "xy" : {
                        "xy-01": 13
                    }
                },
                {
                    "type": "animal",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 87
                    }
                }
            ]
        }
    }
}

To use Athena and JSON Path for a nested JSON object like this. You simply define a single column called “data” with a string type and you are done. 

Setting up an Athena table for complex JSON object queries.

Then you query using JSON Path like this. 

Example of a simple JSON PATH Athena query

SELECT 
    json_extract_scalar(data, '$.component1.date') as service_date, 
    json_extract_scalar(data, '$.component2.objects[0].type') as component2_json,
    json_extract_scalar(data, '$["component3"]["objects"][0]["xy"]["xy-01"]') as xy_value
FROM "athena_blog_post"

Note that the json_extract and json_extract_scalar functions take a string column as their argument.

One thing to watch out for is that the dash character “-” is banned from the first type of JSON Path. So you might have to use the bracket notation instead.

Using the UNNEST operator to pivot and flatten JSON arrays of objects.

Often when working with JSON data you will have arrays of objects. You might want to split each entry in a nested array into its own row. This can be accomplished with the UNNEST operator.

Using the UNNEST operator to pull data from nested arrays into rows.

Here we have used json_extract to retrieve a nested JSON object. Then we cast it into a json array which the UNNEST operator accepts.

SELECT  json_extract_scalar(data, '$.component1.date') as service_date, 
		json_extract_scalar(components, '$.date') as component_date,
		json_extract_scalar(components, '$.type') as component_type,
		components,
		"$path"
FROM "athena_blog_post"
CROSS JOIN UNNEST(CAST(json_extract(data, '$.component1.objects') as array(json))) as t (components)

What is your design budget

We all want to have good longterm software architecture. Build it right the first time. But some organizations fall into the trap of trying to get a perfect design before they start building.

They include more stakeholders, try to plan for contingencies, develop a high availability strategy. 

Those things are all good to have. But you don’t need them before you have users. If you have no users you shouldn’t be thinking about high availability. If your product doesn’t work yet you don’t need scalability.

Users > working product > scaling limitations > high availability

Another problem is having a committee to approve architecture proposals. One or two people is fine. But having three or more people who can block the start of the project is a recipe for pointless delays.

The reason we switched to agile methodologies is because it’s hard to know what the difficulties will be before you start building the software.

Be careful to structure your organization such that design doesn’t become more important than delivery.

A working product with customers is always more valuable than a product that doesn’t have customers but is highly available.

A design budget is a way to avoid falling into this trap. Simply budget a week or a month or whatever for designing up front. At the end of that time enforce a hard stop on doing more design. No more committee reviews or deliberation. If the architecture isn’t developed enough to start building, do a spike on a smaller scale. 

Polyglot programming is a bad goal 

People have extended the goal of building micro service systems such that we aren’t locked into a language to a new goal of every repository being able to use any language.

It’s phrased as ‘shouldn’t we use the best language for the job.‘ But the implication is that we should be able to pick a language for this new project independent of the rest of our existing software stack.

The problem is that it’s not a free choice. Even if you’re in an ideal microservices environment where everything is API based, logging and metrics are handled by the platform and shared libraries are banned there are costs to having more languages in play.

The most basic cost is expertise splintering. If you only use one language and everyone knows it. Then everyone can read and extend all the code. But as you start throwing more languages and paradigms in that slowly becomes impossible.

You quickly reach a point where transferring people to a new team is expensive. Transferring from one Java project to another in the same stack is easy. Switching from Java to async JavaScript is trickier. Switching from Java to Haskell takes awhile.

Polyglot makes away team work more difficult. And it does it slowly so you won’t really notice until it’s too late. 

Shared libraries are another area polyglot sucks. You can’t always make a library an API. 

Whether it’s because you can’t externalize metrics and logging or you have high performance business logic you need to run locally. Often times you will need to duplicate shared libraries across languages.

Polyglot is nice in a lot of ways. I’d rather get to use Pandas for data science if I had the opportunity. We don’t really need to use only Java for everything. But when it comes to application code, I’d rather everyone used the same language, framework and RPC schemas. It doesn’t make any sense to write certain microservices in Java, others in Haskell, some in Python and a few in Typescript. If they are RPC or event based services you are just using five languages to do the same job instead of one language to do one job. 

Then go ahead and write sidecars in Go or Rust, frontends in Typescript and data science stuff in Python. Those are all different use cases with different trade offs. 

Away team work

Away team work is a critical component of high performance software organizations. It is a way for high priority teams to work around other teams’ manpower constraints to deliver software. Without a well established culture of away team project work your organization will default to a standard of ‘shut up and wait while your item is in the backlog’. 

What is away team work exactly? Away team work is when your team implements a feature or integration in another team’s codebase. Strictly speaking in away team work the host team does not review or approve pull requests. Your team’s senior engineers will get approval at an architectural level then handle pull request review themselves. 

The Away team process allows your team to unblock itself when dependencies don’t have manpower to spare. The key component is first that your team not the dependencies team does the work. Both the implementation, testing and code review. Some effort obviously is required of the host team, but it should be minimized. 

Away team work acts as a release valve on the conflict between the host team’s priorities and potential client’s goals of delivering features to customers. If the host doesn’t have manpower available your team can provide the engineers to make it happen. 

Without away team work your organization will have to make more tradeoffs on the scheduling side. More often than not you’ll have to pick a migration over the new feature. But Away team work would have enabled your organization as a whole to deliver both. 

What do you need to make away team work happen?

The key thing is that you have to make a formal process that specifies the maximum standards. You can’t allow teams to be squeamish about it. If the requirement is that two senior engineers approve that’s fine. Or maybe only senior engineers can do away team work in your company. Thats fine but you need a formal standard to avoid negotiations happening on a per project basis. 

The core problem of micro services – how many feature can you fit through the pipeline before it breaks down?

At Amazon I had the chance to watch a monolithic service reach the point where it had to be split up into microservices. When I started we had about thirty software engineers contributing to our service. We had the great idea of developing a framework to speed up feature delivery for everyone in Alexa Shopping. The framework ended up working and the number of people contributing code to our service shot up over the next two years. As of July 2021 we had around 200-300 people in our partner support slack channel.

What happened is that we gradually spent more and more time supporting tests in our CD pipeline. First we had an oncall who did operational support and pipeline release support. Then once that person got swamped and complaints about release frequency got louder we added a second person to operational support. Then we had one person doing releases and outages with a second doing partner pull requests and office hours. 

Growth continued during this time and we attempted a number of changes to federate features and split out responsibility. We split end to end tests into separate suites so it would be easier to find out who knew what a feature was supposed to be doing. This helped a lot, prior to federating the test suites our oncalls would spend a lot of time deep diving end to end tests to figure out what was going on. Afterwards it was a lot easier to find out who we could ask for help. 

One thing that was a huge failure was expecting other teams to debug and fix their end to end tests in the pipeline. Typically, team A has a launch deadline and it asking us to deploy now. We get their code to the staging environment and we see test failure from teams B, C, and D. Teams B, C and D do not have a launch deadline so they are not prioritizing fixing their end to end tests. 

Another big failure was splitting each teams rules into a different repository. We ended up with ten repositories with one relevant file in each. It could have just been a folder in the original project. Plus it was a lot harder to figure out where to put things with ten different projects. The nail in the coffin from my perspective was that the rules were still deployed together. 

One final issue was integration tests. We wanted people to write integration tests, but everybody (including myself) avoided it as much as possible. The reality was the DSL for our end to end tests was significantly better than for integration tests. It was hard to reason how you were testing your specific feature at the API level. But in the end all our feature changes were directly testable at the end to end test level. It was just a lot quicker and easier to write an end to end test than API level test.

Finally, we reached the point where we had to split up the monolith purely because the pipeline was blocking too many teams. It was a risk because in the short term we knew it would increase operational overhead and delay needed upgrades. Unfortunately, the team lost about 50% of its people especially the most experienced ones. And I was one of them so I don’t know how things ended up.