Creating an automatic monitoring application with Node.js, Playwright, and Google Sheets on Google Cloud Platform

Niels Kersic
11 min readNov 2, 2020

Do you have a value that you manually need to monitor and log? Automate the whole process with a simple Node.js application on Google Cloud!

Introduction

I have a resource that lives on multiple websites. I want to monitor this resource and log it in Google Sheets. The problem is that there are no APIs available for me to automatically get the statistics that I want.

Enter: Playwright

Playwright is a Node.js library for browser automation created by Microsoft. It can run a so-called "headless" version of a WebKit, Chromium, or Firefox browser. That simply means a browser without the actual GUI. This can be super useful for automated testing, but today, we'll be using it to scrape values from public-facing websites.

You might be familiar with another Node.js browser automation library called Puppeteer. Puppeteer would work just as well for this guide, but I personally wanted to try out Playwright.

Disclaimer: Automatically scraping content from a website is often not allowed. Check the Terms & Conditions of the website that you want to scrape before you decide to implement any system like this.

Using Playwright to fetch values from websites

Installing Playwright is easy enough using npm i playwright. This will install Playwright and all three browsers that Playwright can run, namely WebKit, Chromium, and Firefox. We're not doing any cross-browser testing, so we really only need one browser. That is also easy enough, using npm i playwright-webkit , which only installs the WebKit browser. The same is possible for Chromium and Firefox, the choice is all yours.

Below is a simple example in which Playwright goes to my portfolio website, niels.codes, and fetches my name. Handy in case I ever forget that.

Let's go over what it does:

  1. Open an instance of the WebKit browser
  2. Go to the URL https://niels.codes
  3. Wait for the title element to appear on the page, using its CSS selector
  4. Get the title element's h1 tag
  5. Read the inner text from the h1 element

These steps will generally be the same for any value that you might want to fetch from a site. The URL might change and the element whose content you want might change, but the logic will be the same. You just open a page, select an element with its CSS selector, and extract the value.

Tip: If you can't seem to get the desired value, or the content doesn't look like you expect, it might be useful to take a screenshot with Playwright using await page.screenshot({ path: 'example.png' }) . This will show you exactly what Playwright is seeing.

Adding the values to Google Sheets

Now that I have retrieved the statistics with Playwright, I want to add them to Google Sheets. For this, we can use the Google Sheets API.

Setting up secure access to Google Sheets

Specifically, we're using V4 (or later) of the Sheets API. With earlier versions, your spreadsheet had to be public, which is usually not desirable. V4 changed this.

So, how do we then allow our application to write to a private Google Sheet? It's actually pretty simple!

Services that run on Google Cloud use so-called Service Accounts. Basically, a Service Account is a service's identity. It says "who" your service is. This way you can assign permissions and control what your service can and cannot do. The interesting part is that Service Accounts always have an email address too. You simply add the email address for the Service Account to your spreadsheet as an editor and voila!

The exact steps will be covered in the section on deploying your application.

Writing the values to Google Sheets

Now that we have established a secure connection to Google Sheets, we can start using the API to actually write our data to the sheet! Here's a simple example of how to append a line to a sheet using the API:

It's very easy to append a new row in Google Sheets, because the API offers a built-in method for this. Appending new columns is a lot more difficult. For this reason, I recommend creating a sheet where the first row contains all the headers and all following rows contain raw data.

Tip: You can create graphs in Google Sheets that automatically update based on the data added by your application. In my case, I append all raw data to the first tab and then use 4 other tabs for all of my graphs.

Getting ready for deployment

At this point, we have an application that gets data from websites and writes it to Google Sheets. Now we just need to deploy it to completely automate the process!

For this, we'll be using Google Cloud. Specifically, our application will be running on Google Cloud Run. Cloud Run is one of Google Cloud's serverless solutions and it is pretty awesome! It allows you to deploy any Docker Container with an API and it handles all the provisioning and scaling for you. It scales down to zero, meaning that you don't pay anything when your application isn't running and the first 2 million requests per month are totally free.

Aside from Cloud Run, we're also going to use Google Cloud Build to get our code onto Cloud Run and Google Cloud Scheduler to trigger our code at defined intervals, but more on that later.

To be able to deploy to Cloud Run, we need to do two things:

  1. We need to make our application listen for traffic on the PORT environment variable. It is a best-practice to default to port 8080 if the environment variable is not set.
  2. We need to put all of our code in a Docker container.

So, let's start with the first point.

Creating a web server

We'll be using the very popular Node.js framework Express to achieve this. Express makes it really easy to create a web server. The following code sample should give you an idea of how our application will look:

The application will now listen on the port defined in the PORT environment variable that is set by Cloud Run, or default to port 8080. Any time it receives a GET request on the root endpoint (the app.get('/') part), it will execute the code. After Playwright has fetched the value and Google Sheets has written it to the sheet, the application will send a response back, to acknowledge that the request was received and handled.

Containerizing our application

Now we just need to put our code in a Docker container and we'll be ready to deploy! Although Docker can be quite a daunting concept, it's pretty easy to create a Docker container for this app. In your project, create a file called Dockerfile. Then add the following:

This Dockerfile contains instructions for how to put our application in a Docker container. It does the following:

  1. It uses a Playwright base image from Microsoft. Playwright needs some special binaries to run and this base image has all of those pre-installed.
  2. It defines the working directory and then copies over our package.json .
  3. Set an environment variable called PLAYWRIGHT_BROWSERS_PATH .
    This is necessary for a completely clean install of Playwright.
  4. All dependencies are installed using npm ci . This does the same as npm install but it is optimized for automated systems like the one we're using.
  5. Copy over all of our other files.
  6. Start the server by using node followed by the location of the application. This could just be app.js or it could be ./dist/app.js if you're using something like TypeScript.

With this set up, we can hop into Google Cloud Platform and configure everything!

Deploying our application

To get everything up and running in Google Cloud, we need to do the following:

  1. Creating a Google Cloud Project.
  2. Enabling the Google Sheets API.
  3. Creating a Service Account for Cloud Run.
    This is the one that we will add to our Google Sheets as an editor.
  4. Creating a Service Account for Cloud Scheduler.
    This will be used to start our application in a secure manner.
  5. Initializing Cloud Run. During this process, we will also set up Cloud Build.
  6. Configuring Cloud Scheduler to call our application at predefined times.

Creating a Google Cloud project

Go to the Google Cloud Console. Create a new project and give it a descriptive name. Note that the project name can be changed later, but the project ID cannot.

Enabling the Google Sheets API

Type "Google Sheets API" in the search box at the top of the page. Click on the "Marketplace" result. You should then get this page, where you can enable the API.

The Google Sheets API page on the Google Cloud Marketplace

Creating our service accounts

Navigate to IAM & Admin in the Google Cloud Console and go to Service Accounts.

Let's first create the Service Account for Cloud Scheduler. On the Service Account page, click “Create Service Account” and give the Service Account a name. You can also edit the email address and add a description if you want to. Also, make sure to grant it the Cloud Run Invoker role.

Creating a Service Account with the Cloud Run Invoker role

After this, we repeat the process for our other Service Account. The only difference is that we don't need to assign that one any roles. We'll be adding it as an editor to our Google Sheet, so make sure to give it a descriptive email address.

Next, go to your Google Sheet and click the Share button. Fill in the email address of the Service Account you just created and give it the "Editor" role.

Adding the Service Account to our Google Sheet

The last pop-up will only appear if you're part of a Google Workspace organization.

Deploying our application to Cloud Run

This is probably the most exciting part; we finally get to deploy our new application! Go to Cloud Run via the menu on the left, or using the search bar at the top. Once there, click Create Service.

We'll be using the fully managed version of Cloud Run. I suggest using the default region, although you are free to select any region. Just make sure that Cloud Scheduler is also available in the region that you select. Give your service a descriptive name. This cannot be changed later.

Creating a Cloud Run service

At this point, we have the following two options:

  1. Deploy one revision from an existing container image
  2. Continuously deploy new revisions from a source repository

If you're not using any source control system, like Git, you should choose the first option. However, you will first need to manually submit your build.

We will be choosing the second option. Select "Continuously deploy new revisions from a source repository" and click Set up with Cloud Build.

Setting up a Cloud Build Continuous Deployment pipeline

We can enable the Cloud Build API by clicking the first button. Then select GitHub or BitBucket and authenticate.

You have to authenticate before you can select your repository

After authenticating with GitHub, you can select the repository with your code. Google Cloud will automatically keep a clone of your repository, so there's a consent box to allow this. Check the box and go to the next step.

Choose your repository

In the next step, you can select which branches will trigger a new build. Next, select Dockerfile as the build type. If the Dockerfile is in the root folder of your project, the default /Dockerfile will be correct.

Save your settings and Google Cloud will automatically do the rest when you create the Cloud Run service.

Lastly, we need to change a few more settings before we can deploy our service. Before going to the next step in the Cloud Run setup, click on "Show advanced settings".

Instead of the default Service Account, we want Cloud Run to use the Service Account that we created. Select the Service Account that you added to your Google Sheets document.

Because Playwright runs a whole browser instance, we need a little bit more than the default 256MiB of memory. Usually, 1GiB should be enough.

If your code also needs to use environment variables, for instance for API keys, you can add them under the variables tab.

Select the right Service Account and increase the memory for the service

For the last step, select "Require authentication". We will use the other Service Account that we created to authenticate when setting up Cloud Scheduler. Now click "create" and your Cloud Run service will be initialized.

Any time you push a new version of your code to your repository, Cloud Build will create a new Docker container and deploy it to Cloud Run!

Scheduling our application

The only thing left to do is to make sure our code runs whenever we want it to run. In Cloud Run, copy the URL of your service.

Copy your Cloud Run service URL

Next, go to Cloud Scheduler, click create job and select the same region you selected for the Cloud Run service.

Select the same region as your Cloud Run service

In the following step, give your scheduler a descriptive name and fill in the Frequency field to set when your application will run. This value should follow the unix-cron format, which is used to schedule jobs in Unix systems. You can use a website like crontab.guru to figure out what your cron string will look like. I want my application to run every day at noon, so my cron string looks like this 0 12 * * * . After filling in the frequency, choose which timezone you want Scheduler to follow. If you want the system to run at a certain time in your timezone, then select that timezone.

For the target, select HTTP, paste your Cloud Run URL and select the GET HTTP method.

Add your Cloud Run URL as your target URL, using the GET method

The only thing left to do is add our Service Account for authentication, which we can do under "Show more". Under Auth header, select "Add OIDC token" and in the Service Account field, add the email address of the Service Account with the Cloud Run Invoker role.

Add your Service Account email address for authentication

Now we can click create and we're done!

We can test the whole system by clicking on run now.

It's magic!

And that's it! The program will now monitor whatever you want, whenever you want, without you having to do anything!

In closing

For the purposes of this guide, I have created a test page on my own website, which shows a different word every time you visit it. I have also created a version of the application that visits this page, scrapes the word, and adds it to Google Sheets.

You can find the code I used to build this system on GitHub here. Feel free to use https://scraping-target.niels.codes for your own testing!

--

--

Niels Kersic

Full-stack developer, Google certified Associate Cloud Engineer and music enthusiast. 🕊️