Querying complex JSON objects in AWS Athena

AWS Athena is a managed big data query system based on S3 and Presto. It supports a bunch of big data formats like JSON, CSV, Parquet, ION, etc. Schemas are applied at query time via AWS Glue. 

To get started with Athena you define your Glue table in the Athena UI and start writing SQL queries. For my project I’ve been working on heavily nested JSON. The schema for this data is very involved so I opted to just use JSON path and skip defining my schema in AWS Glue. Of course that was after I wasted two days trying to figure out how to do it. 

Setting up AWS Athena JSON tables

AWS Athena requires that your JSON objects be single line. That means you can’t have pretty printed JSON in your S3 bucket. 

{
    "key" : "value",
    "hats" : ["sombrero", "bowler", "panama"]
} 

If your JSON files look like the example above they won’t work. You have to remove line breaks so your json objects take up a single line.

{"key":"value","hats":["sombrero","bowler","panama"]}

One tripping point is that its not obvious in the docs how to define your table if you aren’t going to use Glue schemas.  

My dataset consists of S3 objects that look like this. 

{
    "data":
    {
        "component1":
        {
            "date": "2020-02-24",
            "qualifier": "Turnips",
            "objects":
            [
                {
                    "type": " bike",
                    "date": "2022-01-14",
                    "xy" : {
                        "xy-01": 1432
                    }
                },
                {
                    "type": "pen",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 11
                    }
                },
                {
                    "type": "pencil",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 67
                    }
                }
            ]
        },
        "component2":
        {
            "date": "2021 - 03 - 15",
            "qualifier": "Arnold",
            "objects":
            [
                {
                    "type": "mineral",
                    "date": "2010-01-30",
                    "xy" : {
                        "xy-01": 182
                    }
                }
            ]
        },
        "component3":
        {
            "date": "2020-02-24",
            "qualifier": "Camera",
            "objects":
            [
                {
                    "type": "machine",
                    "date": "2022-09-14",
                    "xy" : {
                        "xy-01": 13
                    }
                },
                {
                    "type": "animal",
                    "date": "2021-07-11",
                    "xy" : {
                        "xy-01": 87
                    }
                }
            ]
        }
    }
}

To use Athena and JSON Path for a nested JSON object like this. You simply define a single column called “data” with a string type and you are done. 

Setting up an Athena table for complex JSON object queries.

Then you query using JSON Path like this. 

Example of a simple JSON PATH Athena query

SELECT 
    json_extract_scalar(data, '$.component1.date') as service_date, 
    json_extract_scalar(data, '$.component2.objects[0].type') as component2_json,
    json_extract_scalar(data, '$["component3"]["objects"][0]["xy"]["xy-01"]') as xy_value
FROM "athena_blog_post"

Note that the json_extract and json_extract_scalar functions take a string column as their argument.

One thing to watch out for is that the dash character “-” is banned from the first type of JSON Path. So you might have to use the bracket notation instead.

Using the UNNEST operator to pivot and flatten JSON arrays of objects.

Often when working with JSON data you will have arrays of objects. You might want to split each entry in a nested array into its own row. This can be accomplished with the UNNEST operator.

Using the UNNEST operator to pull data from nested arrays into rows.

Here we have used json_extract to retrieve a nested JSON object. Then we cast it into a json array which the UNNEST operator accepts.

SELECT  json_extract_scalar(data, '$.component1.date') as service_date, 
		json_extract_scalar(components, '$.date') as component_date,
		json_extract_scalar(components, '$.type') as component_type,
		components,
		"$path"
FROM "athena_blog_post"
CROSS JOIN UNNEST(CAST(json_extract(data, '$.component1.objects') as array(json))) as t (components)

Polyglot programming is a bad goal 

People have extended the goal of building micro service systems such that we aren’t locked into a language to a new goal of every repository being able to use any language.

It’s phrased as ‘shouldn’t we use the best language for the job.‘ But the implication is that we should be able to pick a language for this new project independent of the rest of our existing software stack.

The problem is that it’s not a free choice. Even if you’re in an ideal microservices environment where everything is API based, logging and metrics are handled by the platform and shared libraries are banned there are costs to having more languages in play.

The most basic cost is expertise splintering. If you only use one language and everyone knows it. Then everyone can read and extend all the code. But as you start throwing more languages and paradigms in that slowly becomes impossible.

You quickly reach a point where transferring people to a new team is expensive. Transferring from one Java project to another in the same stack is easy. Switching from Java to async JavaScript is trickier. Switching from Java to Haskell takes awhile.

Polyglot makes away team work more difficult. And it does it slowly so you won’t really notice until it’s too late. 

Shared libraries are another area polyglot sucks. You can’t always make a library an API. 

Whether it’s because you can’t externalize metrics and logging or you have high performance business logic you need to run locally. Often times you will need to duplicate shared libraries across languages.

Polyglot is nice in a lot of ways. I’d rather get to use Pandas for data science if I had the opportunity. We don’t really need to use only Java for everything. But when it comes to application code, I’d rather everyone used the same language, framework and RPC schemas. It doesn’t make any sense to write certain microservices in Java, others in Haskell, some in Python and a few in Typescript. If they are RPC or event based services you are just using five languages to do the same job instead of one language to do one job. 

Then go ahead and write sidecars in Go or Rust, frontends in Typescript and data science stuff in Python. Those are all different use cases with different trade offs. 

So you’re making an NFT….

Hey, everyone I’m working on an NFT project. Originally, I wanted to get programmer NFTs for the team at Guava Labs. I wanted to do something like Coders waiting for Compilers, but that didn’t have enough pizazz.  Then I was talking to some NFT artists and seeing if we could sell them a website with web3 components to tie into their NFT projects, but none of them were ready to go. So I ended up getting getting an idea and making up a mythos.

Compiler Fighters Club

Synopsis

The compiler fighters are a group of people dedicated to fighting compilers. This band of software developers, engineers, testers and hackers is taking the fight to all enemies of good code. Against them are arrayed the fearsome forces of disorder. The Infinite Bug Army, the Zero Day League, and even the compilers themselves. 

The Compiler Fighter Club is an NFT project exploring the meaning behind ‘fighting the compiler’. We hope to create a fun set of collectible NFTs and identities for all the hackers out there. 

Portraits

I am trying to figure out a way to let people have NFT characters created based on them. The really big NFT projects so far have mostly relied on computer generating 1000s of NFTs. Generated art is neat, but at this point do we need more computer generated projects? Maybe if someone did a ML GAN style NFT run it would be unique. And why 

Characters 

Compiler Fighters

The Compiler Fighters are a group of misfit web developers, software engineers, software testers, sysadmins, hackers, and anyone with the will to fight the compiler. Most Compiler Fighters align with the software engineer or IT stereotype. But as Compiler Fighters are a misfit group in the real world as well as in the NFTs we expect to see body builders, metrosexuals and all kinds of crazy characters. 

Compiler Fighters should be dressed in attire appropriate for professional programers at work. Since dress codes are rare in the industry that means anything from t-shirts and flip flops to suits. Although, few hackers would be caught dead in a suite. 

Compiler Fighters are typically armed with whatever weapons they can salvage from their offices when the forces of disorder attack. This includes nerf guns, pool cues, nerf swords, and anything else hackers are known to carry. Artistic license is encouraged in the armament of Compiler Fighters. 

Infinite Bug Army

The Infinite Bug Army is a force of bugs which seek to sow disorder wherever they can. The Infinite Bug army does not include any humans and some people even question if computer bugs are alive at all. Nonetheless they are fearsome foes. 

I’ll make sure to include an NFT of the original bug, a moth which got trapped in a relay in old school computers. 

Bug Army NFTs can be anything from images of bugs to images of code containing common or meme worthy software bugs. 

Zero Day League

The Zero Day League is a collection of software vulnerabilities, bugs and hacks. Examples include Heartbleed (https://heartbleed.com), ShellShock, StageFright, etc. Thematically they should all probably be actual zero days in the initial launch. Meaning serious bugs that were exploited in the wild before being announced publicly such that there were zero days of warning before people were hacked. 

Compilers of Doom 

The compilers of doom are a force of software compilers which have risen up against the forces of order. They seek to overturn the software universe and end the era of human machine cooperation. 

Compilation errors are one area to explore for Compilers of Doom cards. Anyone with coding experience will recognize compiler errors from different programming languages. 

Whats next?

I’m having a great time with the concept. I’ll keep building out the mythos and some character ideas. Then we will work with some artists to do an initial release. I’m hoping to launch with some thematic NFTs and of course the Guava Labs team NFTs which was the original inspiration.

Doorkeeper with Rails and Zapier

I set up Oauth2 authentication with Zapier for Seviipay.com recently and figured it was worth a short post on the gotchas I encountered.

The biggest problem I had was with the Refresh token configuration. I would do a test connection, validate that I wanted to give Zapier access, send a test notification and it would work. Then an hour later I get an email from Zapier that they got a 401 error and are turning off my integration. The issues were that Doorkeeper does not enable refresh tokens by default and that Zapier does not pass the client ID or secret by default to the refresh api.

Below is my doorkeeper.rb with all comments removed. I had to uncomment the line for use_refresh_token.


Doorkeeper.configure do
  orm :active_record

  resource_owner_authenticator do
       current_user || warden.authenticate!(scope: :user)
  end

  admin_authenticator do |_routes|
    current_user || warden.authenticate!(scope: :user)
  end

  access_token_expires_in 2.hours
  use_refresh_token
end


Then in Zapier I just had to update my request options to include the client id and client secret.

One other issue I ran into is that Doorkeeper requires you to set a SCOPE. I just put the string ‘READ’ in for my scope and it seems to be working.

Seviipay – cryptocurrency payments for everyone

I’m working on a new project. Seviipay is a software as a service project through which I hope to make blockchain integration easy for everyone. It is in free alpha launch right now and you can check it out at www.seviipay.com

Why blockchain integration? And why now?

I first got involved with Ethereum back in 2015 before the network went live. It was a fun toy then and I spent some time getting the test version running on my gaming desktop in college. Then I got bored of it and stopped paying attention until the 2017/18 bubble. The story would have ended there except one of my friends did a NFT project this spring and I decided to buy one of them to support him. That led me to seeing the UI at opensea.io, veefriends.com and other web3 applications. Let me tell you the user experience has come a long way in the last six years.

Web3 applications have serious advantages over the traditional web. Digital identity is a billion dollar business in the traditional web, for web3 everyone has access to identity built in. There are no accounts and no signups in web3. Your ethereum wallet is your account on every web3 site. You don’t need to remember passwords for hundreds of websites, you just use your wallet to verify your identity and you are done. 

Payments in web3 are comparable in ease of use to ApplePay, possibly superior. Your browser wallet has the ability to authorize payments instantly. No need to enter your address and worry that you typed the wrong data in. Additionally, you never need to worry about your credit card number being stolen after you place an order at a random website. Making a payment via the Ethereum network does not leak secret payment information. 

I’m super excited about the potential of web3 to improve identity and payments on the internet. The idea behind Seviipay is to make top level web3 UX easy to access while also providing ways to link actions on the blockchain to the rest of the world. 

You can use Seviipay to setup a payment button on your website. Then get an email when anyone places an order. I’m working on a Zapier integration so that you can get a slack message or automatically add someone to a teachable course when they make a purchase.