Cloud & Virtualisation

Dude, where's my data?

QA's Principal Technologist gives his experience of the Google Data Engineering course.

One man's experience of the Google Data Engineering course

Wow!

No, that's too short, right?

As part of our commitment to empowering our customers' Digital Transformation strategies, QA recently announced we have expanded our Cloud Computing portfolio by introducing authorised Google Cloud course content, making us the first in the UK to deliver this training. While in the process of acquiring our Premium Partnership status, Google requires us to on-board a number of our trainers in the various GCP educational tracks. Due to my background in development, data and the cloud, I was volunteered, I mean I eagerly volunteered to be our first trainer on the Data Engineering track.

In July 2017, I attended my first ever GCP course, which happened to be the Data Engineering one. And ... Wow!

Prerequisites

Core GCP Knowledge

I had dabbled a little with GCP prior to attending this course, but I really could have done with some guided learning before pitching up at their training centre near Victoria on that sunny summer's day. Some kind of Google Cloud Fundamentals course would have made some aspects of the learning a bit more straightforward. Unfortunately, this was the first available course for me. I would suggest attending the one day fundamentals course either GCP Fundamentals: Big Data & Machine Learning or even GCP Fundamentals: Core Infrastructure if new to GCP or have limited knowledge of the Cloud.

Cloud Knowledge

My intimate familiarity with the principles of Cloud Computing certainly helped me out. QA's Introduction to Cloud Computing or the platform-specific GCP Fundamentals: Core Infrastructure would help others.

Data Skills

It's possibly redundant to mention this but, we are talking about Data Engineering here, so my SQL and Data Modelling background were very useful, as was my history of extensive work in Extracting, Transforming and Loading (ETL) data. There is of course a bit of Pig in the course, but not so much that you'd need intimate knowledge of it up front.

[NOTE: To be honest, when I was doing ETL for a living, I was unaware that it had such a name. I called it "getting data from somewhere, doing stuff to it and putting it somewhere else". And still do.]

Data Science Skills

There are a number of words in the English language that I really struggle to pronounce. One of them is statistics. This is a fact, not a joke: just ask any of your colleagues who has attended one of my courses. If there was any mention of that word, I will have mangled it repeatedly. I have a similar problem with the practice of calculating statistics or even remembering how to calculate them. Yes of course I have used the STDEV function in Excel, but remembering what it means is very hard for me. Since attending the course, I have swotted up on the stats but my lack of ability in this particular field didn't hinder me overmuch.

Similar goes for Machine Learning. Something that Google are, of course, very good at. And to a large extent have democratised. I am aware of it, but need to invest some serious study time before I'm comfortable with delivering it. My lack in this area was not a huge hindrance to my ability to follow the course, but I feel I would have got more out of the course had I known more going in. I was still at the "absorbing the possibilities" phase whilst my instructor was showing me how to build my own ML model. Jaw-dropping stuff.

Coding Skills

In a former life, I was a Java dev. Happily, I've forgotten more Java than I ever learned and am now more familiar with Python. In-depth knowledge of either or both of those languages is essential to get the most out of this course. In my opinion, GCP is a very developer-centric platform, with a bias towards those two languages, so they are very useful for GCP as a whole.

Course Structure

As a bleeding edge computer nerd, I really, really like the flow of the course. The hands-on labs environments last for the whole day, so you get to play around on a "free" GCP account during the breaks and lunch as well as during the lab times. Each slide deck has multiple breaks for labs, so it's very much a case of "here's a few knowledge bombs, go and play for a while, here's a few more, go and play," throughout the course. As an experienced trainer (read that as "Luddite change-resistant moaner"), I'm not sure about the frequent breaks; to me the labs have always come at the end of the module. "Here's all of the knowledge bombs about this topic, now go and put them together in an hour-long lab". In terms of learning effectiveness, my inner nerd thinks my outer trainer is wrong and any courses I develop from now on will try to follow the Google approach.

Course Content

NOTE: It seems that the running order for the course has changed in the couple of weeks since I attended it. Welcome to the ever-evolving Cloud! I'll go through it in the new running order.

Working with Unstructured Data

The course outline naturally uses the word "leveraging" here but as I'm British I'll use real words instead. This first section of the course is kind of "legacy". No, I'm not sure when I started applying the word "legacy" to Cloud services but yes it does feel weird!

It's a lot of coverage around Dataproc which, as history buffs will know, is Google's managed Hadoop. Yes I'm grossly over-simplifying the situation but if I go into sufficiently technically correct detail you won't need to come and attend the course!

The course covers the basics of Dataproc, running jobs in Dataproc, customisation options like tweaking the servers it runs on for your specific needsand integrating it with other GCP features. This section wraps up with a gentle introduction to Google's Machine Learning APIs. Their pre-written models, effectively.

For many people MapReduce is Big Data. Many users will be migrating "Hadoopy" workflows to the Cloud, so I can see why this section has been brought forward from when I attended. I consider this to be "legacy" these days as we can now do a lot of Big Data processing without worrying about servers and without using MapReduce. Google agrees with me I'm pleased to say, so the next section is:

Serverless Data Analysis with BigQuery and Dataflow

NOTE: Serverless doesn't mean "no servers". It means "no servers that you have to worry about".

The recommended approach for new Big Data projects is to use BigQuery and Dataflow. What's BigQuery? It's a thing for doing Big Queries. The course covers the basics of BigQuery (BQ), covering legacy syntax and ANSI-compliant syntax, getting data into BQ, getting out of BQ and some very important performance considerations. Also the pricing model. If there's no servers to pay for, how do you pay for it?

It then moves on to Dataflow, whose API is now also known as Apache Beam, which ties in with BQ in terms of being a mechanism by which we can move our data around. It's the ETL piece of the Data Engineering story on GCP and in my opinion, it's a lot easier to work with using Python although some features are at the time of writing only available in the Java SDK. The course covers batch processing at this stage, but there's a little heads-up that we can use it to perform stream processing as well.

Serverless Machine Learning with TensorFlow

You might already have noticed, but Machine Learning (ML) is kind of a Big Deal at the moment. You can get a doctorate in ML. Clearly, four modules of an ILT course aren't going to get you a doctorate. We went from the basics; what is ML? to working with TensorFlow, the Google API for building ML training models through to deploying models, exposing them as an API and advanced Feature Engineering. All breath-taking stuff and more than enough to start you thinking about what your Machines can Learn about your business. Just don't expect a PhD at the end of it.

Building Resilient Streaming Systems

Having trailed it earlier, after our segue into the world of ML, we looped back around to talk about stream processing. The challenges of late arrival of messages, of windowing functions. We covered Cloud Pub/Sub, GCP's messaging service, using Dataflow over streaming data, streaming it into BQ, streaming it into BigTable (Google's NoSQL solution) and building dashboards using Google Data Studio. There was a little bit of Cloud Spanner (Google's RDBMS) in there too.

What did I think of it?

I got an awful lot out of the course. Possibly more than the designers intended, as I was basically doing my first bit of "in-anger" work on GCP! So I learnt a lot about the console, which is clearly a good thing as, like all Cloud providers' consoles, there's a lot of it. But that meant that I was still learning fundamentals when I should have been building on them during the labs.

Did the course meet its stated objectives? (Reads the objectives on the Google website). Yes, I learnt about all of those things. I'm now spending time putting them into practice for my own purposes and prepping for the Google Certified Professional Data Engineer certification. Of course I'll blog about that once I've taken it. You're welcome.

Related Articles