Trust But Verify: Evaluating Security When Choosing Data Infrastructure Tools

A data leader's guide to maintaining security when working with third-party vendors

Feb 18, 2025

How data is stored on a server, as envisioned by Hackers (1995) Dir. Iain Softley ©Metro-Goldwyn-Mayer, Inc.

Data warehouse optimization tools connect to platforms such as Snowflake, Amazon Redshift, and Google BigQuery to analyze usage and improve performance and cost. Since these tools interface with sensitive enterprise environments, robust security practices are critical.

Below, we’ll dive into security aspects that you should consider when choosing warehouse optimization and observability tools for your organization. More generally, these guidelines can apply to other tools that access your infrastructure or data.

Data Access

How can you be confident in the safety of the tool that's analyzing your warehouse? That's a tough question. By knowing what it can access and by what means, you can begin to form an answer.

Least Privilege

An important aspect of security is the “principle of least privilege”, or making sure that different services (especially tools that aren’t developed internally) can only access what they need to be effective. In a data mart set up, that can be a lot of critical information, so it’s important to minimize exposing anything that doesn’t need to be.

Limiting an integration to a dedicated user is a good first step. Giving this user minimal permissions, such as read-only metadata-only access, also alleviates some risk as this user could be revoked with minimal disruption if needed.

For more complex integrations, it may be necessary to have role-based access controls (RBAC) to manage who has access to what. For example, someone inspecting the metadata may not be the same person setting up billing or data connections, but maybe they should still be able to see what connections are set up. Some tools can go even further and support data-specific privilege and limit who can access different parts of the (meta)data and optimization and observability results themselves.

Twing Data takes this principle seriously by supporting secure connections to your warehouses and limiting access to read-only metadata by design and following best practices internally.

API & Integration Security

Different vendors have different, and usually multiple, ways to connect to their systems such as passwords, key-pair authentication, gateways, or entirely unique service accounts.

It’s important to remember that encryption is only as strong as its implementation, and to verify that a more secure method actually is more secure. For example, simply switching from HTTP to HTTPS doesn’t guarantee security if certificate validation is improperly handled, allowing for man-in-the-middle attacks. Similarly, using OAuth 2.0 for API authentication is a step up from basic authentication, but failing to properly manage token expiration, refresh, and revocation could expose your integration to token hijacking or replay attacks. Always evaluate the entire security chain, not just the algorithm or protocol. We’ll talk more about encryption below.

A quick internet search can also reveal any vulnerabilities that were exposed in the past for a given vendor or user.

Data Management

Whether it’s just metadata like the queries being run or the data they return, whoever is analyzing the data needs to ingest and store it somewhere for analysis, and possibly to surface it back to their users. This section analyzes which aspects may require a closer look.

Data Retention

Since the vendor may have access to business-critical data (or metadata) it’s important to consider their data retention policy. What a “good” policy is will vary from business to business and use case to use case, but typically you don’t want someone to hold on to your data “forever” (especially not past the point of canceling the service). Long-term data retention can introduce unnecessary risks, such as unauthorized access, data breaches, or compliance violations. On the flipside, too short of retention might give less useful insights.

When vetting vendors, ask about their data retention and deletion policies up front and negotiate shorter retention periods or periodic purges, especially for sensitive information. It’s also a good idea to have an offboarding plan that verifies that the third-party has deleted your data after your integration has ended.

While some vendors don’t share this policy, Twing Data retains 90 days of metadata and persists the resulting analysis during the duration of service. This can be adjusted on a company-by-company basis if requested.

Data Isolation

There are also different methods for segmenting or isolating data and access. Separating customers’ data from one another can be done by setting up separate environments for each customer, ensuring that risk from one customer isn’t carried over to another. This can be especially important for services with privileges that extend beyond read-only, such as those that automatically adjust the size, clustering, or resources of your warehouses, or services that rely on data beyond metadata.

Twing Data can safely grant low-level access to our users by only exposing our analyses through company-specific accounts and views, ensuring that users (or threat actors) aren’t able to view anything they’re not supposed to.

Data Anonymization

Besides the tactics listed above, a vendor can, and should, anonymize sensitive data — even metadata. Even seemingly innocuous queries that may be executed can contain personal information that is not necessary for an external tool to see. Consider a query such as “WHERE username = ‘celebrity@personalemail.com’ AND phonenumber = ‘(555) 867-5309’” being exposed. A leaked query such as this not only exposes a high-profile individual’s email address but also ties it to their phone number.

Twing Data offers to redact query text at ingestion, replacing it with a constant placeholder token. This way, we can still analyze query structures (e.g. to identify patterns or heavy queries) without storing any actual identifiers or literals. Keep in mind, though, some analyses may not be as useful if we can't identify what makes different query pattern executions different from one another. We’re always looking to strike the right balance..

Encryption (Data at Rest & In Transit)

Nowadays, it’s safe to assume that any enterprise integration is going to use a secure protocol, but it’s still good to make sure. Any person or server that is accessing your data (metadata or otherwise) should be doing so securely. If not — run!

The (meta)data itself should also be stored securely. Twing Data leverages Google BigQuery which encrypts data at rest and adds additional access controls and auditing.

App Architecture

The “cloud”, as we know, is generally “someone else’s servers”. It enables distributing and scaling applications across the world while optimizing cost and speed.

However, as infrastructure evolves and becomes more complex, there are more points of failure and more visibility and failsafes are needed. Let’s take a look at what that means for warehouse optimization tools.

Cloud Infrastructure Security

Similar to data isolation, containerization or virtual machine (VM) isolation can ease security concerns, especially for more than read-only access. A setup like this may spin up VMs per tenant, either on-premise or managed by another provider like AWS or GCP.

Some providers, like fly.io (which Twing Data uses for hosting our application), are easy to scale globally while maintaining security practices. Twing Data’s analyses run on demand and on-premises, interfacing securely with our databases.

Logging and Monitoring

For any organization, it’s important to log events to detect, prevent, or alert on security issues or outages. For big data, an outage can mean inaccurate insights and lost revenue, so alerting and monitoring is crucial. Similarly, monitoring and alerting help catch bugs or security vulnerabilities sooner than later (with additional context about what happened and why), making it quicker to resolve them.

For example, if an API integration suddenly starts returning a high volume of 500 errors, alerting can notify the team before customers are impacted. In a security context, logs showing a spike in failed authentication attempts could indicate a brute-force attack, allowing for a proactive response. Additionally, monitoring data pipeline performance can catch bottlenecks – like a slow ETL job – that might otherwise cause reporting delays or incomplete/inaccurate analytics.

Some services (like fly.io) offer built-in logging, monitoring, and alerting (e.g. Grafana and Sentry integrations) that makes it fast, easy, and secure to set up the necessary infrastructure.

Company Practices

Lastly, company practices can give additional insight into what to expect when working with them. Do they take security seriously as part of their culture, or just to check some boxes? If a problem occurs, how likely are they to fix it within a given timeframe?

Compliance & Certification

Some certifications (SOC 2 Type I) evaluate security at a single point in time, and others (SOC 2 Type II) evaluate security over a period of time. These evaluations are based on many factors such as access controls, monitoring & logging, incident response, data encryption & integrity, secure development practices, and others.

Seeing that a vendor has these certifications gives a clear indication of their security practices. It’s something Twing Data is working towards!

Incident Response & Recovery

A company’s processes for detecting, responding to, and recovering from security incidents is just as important as their processes to avoid and alert on such incidents. After all, an alert is only valuable if it cuts through the noise and prompts action – otherwise, it’s just another overlooked warning.

SOC2 certification requires to document related processes thoroughly and test them at least annually. Of course, companies without the certification can still have a plan in place.

These processes cover responsibilities and actions that would be performed internally (e.g. patching a vulnerability) and externally (e.g. notifying affected users) with clear steps for identifying, containing, mitigating, and resolving incidents. In some cases, a service level agreement (SLA) between the vendor and customer may dictate time-to-resolution requirements, ensuring accountability.

For example, if a critical data breach occurs, an engineering team might immediately isolate the affected systems while security and legal teams work together to draft and send a breach notification within the SLA-defined timeframe.

Secure Development Practices

Cutting-edge companies should embed security throughout the development lifecycle. Increasingly, security and code quality checks are happening earlier in the development lifecycle. Tools like static code analysis, linting, code reviews, and automated security scanning help catch issues before they even reach a staging environment.

Security-minded SaaS providers also conduct regular automated and manual penetration testing on their application and infrastructure, and any critical findings are fixed as part of their vulnerability management process.

By embedding security holistically within the development workflow, companies can better safeguard sensitive data and reduce the overall cost and risk of fixing vulnerabilities later.

Third-Party Risk Management

Third-party risks typically encompass technical topics such as security breaches and downtime, but can include more qualitative risks like reputational risk.

Using reputable third parties like GCP and fly.io that have their own certifications is one way to mitigate risk that comes with outsourcing to external service providers, especially when it comes to some of the topics above around data and cloud security — items that are most commonly outsourced (since maintaining your own hardware comes with its own high costs and risks).

Vendors can also be classified based on risk — a database vendor is of higher risk than an office supplier, for example, and possibly doesn’t need to be reassessed as frequently.

Choosing the right data vendor for your business

As I was writing and researching for this article, the main takeaway became clear: it's essential to choose a vendor that analyzes your company's unique and mission-critical data effectively and securely. I hope these findings help you mitigate that risk by knowing what to look for and avoid when digging into vendors’ security policies.

Twing Data