Google Cloud Storage Data Lake

Google Cloud Storage (GCS) is a popular web service for storing and accessing your data in the Google Cloud Platform infrastructure. It offers state-of-the-art performance and scalability, along with ensuring the security and privacy of your data.

RudderStack supports GCS data lake as a destination to which you can securely send your data.

User permissions for GCS data lake

To set up GCS data lake as a destination in RudderStack, you will need to create a new user role and grant the required permissions.

Creating the role and adding permissions

Role details
  • Fill the details and click on ADD PERMISSIONS.
  • Under Filter permissions by role, select Storage Object Admin and grant the required permissions:
Setting permissions
  • The permission required to successfully use the GCS data lake destination is shown:
storage.objects.create
  • Then, click on CREATE. This will successfully create the role.

Creating the service account and attaching the role

Service account details
  • Then, select the previously created role and click on CONTINUE.
Role details
  • Finally, after completing Steps 1 and 2, click on DONE.

Creating and downloading the JSON key

  • Click on the three dots under Actions in the service account that you just created and select Manage keys, as shown:
Manage keys option
  • Click on ADD KEY, followed by Create new key, as shown:
Creating new key
  • In the resulting pop-up, select JSON and click on CREATE.
Downloading JSON key
  • Finally, download this JSON file. This file is required while setting up the GCS data lake destination in RudderStack.

Setting up the GCS data lake destination

To start sending data to your GCS data lake, you will first need to add it as a destination in RudderStack and connect it to a data source. Once the destination is enabled, the events will automatically start flowing to your data lake.

Follow these steps to configure your GCS data lake as a destination in RudderStack:

  • From your RudderStack dashboard, configure the data source. Then, select GCS Data Lake from the list of destinations.

Refer to the Adding a Source and Destination in RudderStack guide for more information.

  • Assign a name to your destination and click on Next. You should then see the following screen:
GCS data lake destination settings in RudderStack

Connection settings

Add the following credentials in the Connection Settings:

  • Staging GCS Storage Bucket Name: Enter the name of the S3 bucket that will be used to store data before loading it into the S3 data lake.
  • Prefix: If specified, RudderStack will create a folder in the bucket with this prefix and push all the data within that folder.
  • Namespace: If specified, all the data for the destination will be pushed to https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>.

If you don't specify a namespace in the settings, RudderStack sets it to the source name by default.

  • Credentials: Enter the content of the downloaded JSON key file in this field.

Finding your data in the GCS data lake

RudderStack converts your events into Parquet files and dumps them into the configured GCS bucket. Before dumping the events, RudderStack groups the files by the event name based on the time (UTC) they were received.

The folder structure is similar to the following format:

https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/<tableName>/YYYY/MM/DD/HH

As specified in the Connnection settings section above:

  • <prefix> is the GCS prefix used while configuring the GCS data lake destination in RudderStack. If not specified, RudderStack will omit this value.
  • <namespace> is the namespace specified in the destination settings. If not specified, RudderStack sets it to the source name.
  • <tableName> is set to the event name.
  • YYYY, MM, DD, and HH are replaced by the actual time values.

A combination of the YYYY, MM, DD, and HH values represents the UTC time.

Use case

Suppose that RudderStack tracks the following two events:

Event NameTimestamp
Product Purchased"2019-10-12T08:40:50.52Z" UTC
Cart Viewed"2019-11-12T09:34:50.52Z" UTC

RudderStack then converts these events into Parquet files and dumps them into the following folders:

Event NameFolder Location
Product Purchasedhttps://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/product_purchased/2019/10/12/08
Cart Viewedhttps://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/cart_viewed/2019/11/12/09

Contact us

If you come across any issues while setting up using the GCS data lake destination, you can contact us or start a conversation on our Slack channel.