Databricks-Certified-Data-Engineer-Associate Exam Questions - Real Practice Questions for Guaranteed Success

Question 1

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?

ASilver tables contain a less refined, less clean view of data than Bronze data.

BSilver tables contain aggregates while Bronze data is unaggregated.

CSilver tables contain more data than Bronze tables.

DSilver tables contain a more refined and cleaner view of data than Bronze tables.

ESilver tables contain less data than Bronze tables.

Correct : D

In a medallion architecture, a common data design pattern for lakehouses, data flows from Bronze to Silver to Gold layer tables, with each layer progressively improving the structure and quality of data. Bronze tables store raw data ingested from various sources, while Silver tables apply minimal transformations and cleansing to create an enterprise view of the data. Silver tables can also join and enrich data from different Bronze tables to provide a more complete and consistent view of the data. Therefore, option D is the correct answer, as Silver tables contain a more refined and cleaner view of data than Bronze tables. Option A is incorrect, as it is the opposite of the correct answer. Option B is incorrect, as Silver tables do not necessarily contain aggregates, but can also store detailed records. Option C is incorrect, as Silver tables may contain less data than Bronze tables, depending on the transformations and cleansing applied. Option E is incorrect, as Silver tables may contain more data than Bronze tables, depending on the joins and enrichments applied.Reference:What is a Medallion Architecture?,Transforming Bronze Tables in Silver Tables,What is the medallion lakehouse architecture?

Options Selected by Other Users:

Mark Question:

Start a Discussions

Submit Your Answer:

ASilver tables contain a less refined, less clean view of data than Bronze data.

BSilver tables contain aggregates while Bronze data is unaggregated.

CSilver tables contain more data than Bronze tables.

DSilver tables contain a more refined and cleaner view of data than Bronze tables.

ESilver tables contain less data than Bronze tables.

0 / 1500

Question 2

A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.

Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?

ASELECT * FROM sales

Bspark.delta.table

Cspark.sql

DThere is no way to share data between PySpark and SQL.

Espark.table

Correct : C

The spark.sql operation allows the data engineering team to run a SQL query and return the result as a PySpark DataFrame. This way, the data engineering team can use the same query that the data analyst has developed and operate with the results in PySpark. For example, the data engineering team can use spark.sql(''SELECT * FROM sales'') to get a DataFrame of all the records from the sales Delta table, and then apply various tests or transformations using PySpark APIs. The other options are either not valid operations (A, D), not suitable for running a SQL query (B, E), or not returning a DataFrame (A).Reference:Databricks Documentation - Run SQL queries,Databricks Documentation - Spark SQL and DataFrames.

Options Selected by Other Users:

Mark Question:

Start a Discussions

Submit Your Answer:

ASELECT * FROM sales

Bspark.delta.table

Cspark.sql

DThere is no way to share data between PySpark and SQL.

Espark.table

0 / 1500

Question 3

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

AWhen they are working interactively with a small amount of data

BWhen they are running automated reports to be refreshed as quickly as possible

CWhen they are working with SQL within Databricks SQL

DWhen they are concerned about the ability to automatically scale with larger data

EWhen they are manually running reports with a large amount of data

Correct : A

The scenario in which a data engineer will want to use a single-node cluster is when they are working interactively with a small amount of data.A single-node cluster is a cluster consisting of an Apache Spark driver and no Spark workers1.A single-node cluster supports Spark jobs and all Spark data sources, including Delta Lake1.A single-node cluster is helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis1.A single-node cluster can run Spark locally, spawn one executor thread per logical core in the cluster, and save all log output in the driver log1.A single-node cluster can be created by selecting the Single Node button when configuring a cluster1.

The other options are not suitable for using a single-node cluster.When running automated reports to be refreshed as quickly as possible, a data engineer will want to use a multi-node cluster that can scale up and down automatically based on the workload demand2.When working with SQL within Databricks SQL, a data engineer will want to use a SQL Endpoint that can execute SQL queries on a serverless pool or an existing cluster3.When concerned about the ability to automatically scale with larger data, a data engineer will want to use a multi-node cluster that can leverage the Databricks Lakehouse Platform and the Delta Engine to handle large-scale data processing efficiently and reliably4. When manually running reports with a large amount of data, a data engineer will want to use a multi-node cluster that can distribute the computation across multiple workers and leverage the Spark UI to monitor the performance and troubleshoot the issues.

1:Single Node clusters | Databricks on AWS

2:Autoscaling | Databricks on AWS

3:SQL Endpoints | Databricks on AWS

4:Databricks Lakehouse Platform | Databricks on AWS

: [Spark UI | Databricks on AWS]

Options Selected by Other Users:

Mark Question:

Start a Discussions

Submit Your Answer:

AWhen they are working interactively with a small amount of data

BWhen they are running automated reports to be refreshed as quickly as possible

CWhen they are working with SQL within Databricks SQL

DWhen they are concerned about the ability to automatically scale with larger data

EWhen they are manually running reports with a large amount of data

0 / 1500

Question 4

A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.

They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

ANone of these lines of code are needed to successfully complete the task

BUSING CSV

CFROM CSV

DUSING DELTA

EFROM 'path/to/csv'

Correct : E

A data lakehouse is a new paradigm that can be used to simplify and unify siloed data architectures that are specialized for specific use cases. A data lakehouse combines the best of both data lakes and data warehouses, providing a single platform that supports diverse data types, open standards, low-cost storage, high-performance queries, ACID transactions, schema enforcement, and governance. A data lakehouse enables data engineers to build reliable and scalable data pipelines that can serve various downstream applications and users, such as data science, machine learning, analytics, and reporting. A data lakehouse leverages the power of Delta Lake, a storage layer that brings reliability and performance to data lakes.Reference:What is a data lakehouse?,Delta Lake,Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Options Selected by Other Users:

Mark Question:

Start a Discussions

Submit Your Answer:

ANone of these lines of code are needed to successfully complete the task

BUSING CSV

CFROM CSV

DUSING DELTA

EFROM 'path/to/csv'

0 / 1500

Question 5

A data engineer wants to create a new table containing the names of customers who live in France.

They have written the following command:

CREATE TABLE customersInFrance

_____ AS

SELECT id,

firstName,

lastName

FROM customerLocations

WHERE country = 'FRANCE';

A senior data engineer mentions that it is organization policy to include a table property indicating that the new table includes personally identifiable information (Pll).

Which line of code fills in the above blank to successfully complete the task?

ACOMMENT 'Contains PIT

B511

C'COMMENT PII'

DTBLPROPERTIES PII

Correct : D

To include a property indicating that a table contains personally identifiable information (PII), the TBLPROPERTIES keyword is used in SQL to add metadata to a table. The correct syntax to define a table property for PII is as follows:

CREATE TABLE customersInFrance

USING DELTA

TBLPROPERTIES ('PII' = 'true')

SELECT id,

firstName,

lastName

FROM customerLocations

WHERE country = 'FRANCE';

The TBLPROPERTIES ('PII' = 'true') line correctly sets a table property that tags the table as containing personally identifiable information. This is in accordance with organizational policies for handling sensitive information.

Reference: Databricks documentation on Delta Lake: Delta Lake on Databricks

Options Selected by Other Users:

Mark Question:

Start a Discussions

Submit Your Answer:

ACOMMENT 'Contains PIT

B511

C'COMMENT PII'

DTBLPROPERTIES PII

0 / 1500

Master Databricks-Certified-Data-Engineer-Associate Exam with Reliable Practice Questions

Options Selected by Other Users:

Options Selected by Other Users:

Options Selected by Other Users:

Options Selected by Other Users:

Options Selected by Other Users: