Integrating with ClickHouse

Overview

This guide describes how to connect a ClickHouse data source to Drivetrain. Drivetrain supports two integration methods:

  1. Method 1: Sharing a ClickHouse View (Recommended) Provides a curated, aggregated interface for efficient, low-volume data extraction.

  2. Method 2: Direct Database Access Provides direct querying of tables or custom SQL, with safeguards to control data volume.

Both methods require that the credentials provided have the appropriate permissions to read the specified objects.

Sharing a ClickHouse view allows you to expose only the necessary business logic and pre-aggregated data to Drivetrain. This improves performance and reduces data transfer.

Required details

Provide the following information:

  • Hostname

  • Port

  • Username (A ClickHouse user with read access to the view)

  • Password

  • View name (fully qualified, e.g., <database>.<view_name>)

Permissions

Ensure that the user has:

  • SELECT privileges on the specified view

  • Access to any underlying objects the view depends on

Best practices

  • Use a read-only user scoped to only the required view

  • Expose only the fields needed by Drivetrain

  • Pre-aggregate data where possible to limit data volume

Method 2: Direct Database Access

This method connects directly to the ClickHouse database and pulls data using a specified query. Drivetrain will use the host/port/credentials to run the query.

Required details

Provide the following:

  • Hostname

  • Port

  • Username

  • Password

  • Database name

  • Table name (the target table to extract from)

  • Query to pull the data (A SELECT query that returns only the necessary results)

Query Requirements

  • Return only necessary columns

  • Consider date filters or limits to prevent large data scans

  • Avoid SELECT * on large tables

Permissions

Ensure the user has:

  • SELECT privileges on the target table(s)

  • Appropriate access to any referenced schemas

Best practices

  • Always include filters (e.g., date ranges) to control volume

  • Use views layer if the query is complex or resource-intensive

  • Validate performance and indexing before enabling production sync

Last updated

Was this helpful?