Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?
Schema of first partition:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
Schema of second partition:
1. root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- rollId: integer (nullable = true)
7. |-- f: integer (nullable = true)
8. |-- tax_id: integer (nullable = false)
Correct : B
This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.
spark.read.option('mergeSchema', 'true').parquet(filePath)
Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or
more columns with the same name that appear in both partitions would have different data types.
spark.read.parquet(filePath)
Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition
(e.g. tax_id) would be lost.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith('.parquet'):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.union(df_temp)
nx = nx+1
df
Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical
data types.
spark.read.parquet(filePath, mergeSchema='y')
False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a
boolean or string variable. But 'y' is not a valid option.
nx = 0
for file in dbutils.fs.ls(filePath):
if not file.name.endswith('.parquet'):
continue
df_temp = spark.read.parquet(file.path)
if nx == 0:
df = df_temp
else:
df = df.join(df_temp, how='outer')
nx = nx+1
df
No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all
columns that
are included in the partitions should appear exactly once.
More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium
Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)
Start a Discussions
Which of the following code blocks returns a DataFrame that matches the multi-column DataFrame itemsDf, except that integer column itemId has been converted into a string column?
Correct : B
itemsDf.withColumn('itemId', col('itemId').cast('string'))
Correct. You can convert the data type of a column using the cast method of the Column class. Also note that you will have to use the withColumn method on itemsDf for replacing the existing itemId
column with the new version that contains strings.
itemsDf.withColumn('itemId', col('itemId').convert('string'))
Incorrect. The Column object that col('itemId') returns does not have a convert method.
itemsDf.withColumn('itemId', convert('itemId', 'string'))
Wrong. Spark's spark.sql.functions module does not have a convert method. The Question: is trying to mislead you by using the word 'converted'. Type conversion is also called 'type
casting'. This
may help you remember to look for a cast method instead of a convert method (see correct answer).
itemsDf.select(astype('itemId', 'string'))
False. While astype is a method of Column (and an alias of Column.cast), it is not a method of pyspark.sql.functions (what the code block implies). In addition, the Question: asks to return a
full
DataFrame that matches the multi-column DataFrame itemsDf. Selecting just one column from itemsDf as in the code block would just return a single-column DataFrame.
spark.cast(itemsDf, 'itemId', 'string')
No, the Spark session (called by spark) does not have a cast method. You can find a list of all methods available for the Spark session linked in the documentation below.
More info:
- pyspark.sql.Column.cast --- PySpark 3.1.2 documentation
- pyspark.sql.Column.astype --- PySpark 3.1.2 documentation
- pyspark.sql.SparkSession --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 42 (Databricks import instructions)
Start a Discussions
The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose
the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)
Correct : E
Correct code block:
len(spark.read.csv(filePath, comment='#').columns)
This is a challenging Question: with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a Question: of this difficulty level
appears in the
exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.
Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1,
returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard
this answer option.
Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but
this method only returns the length of an array or map stored within a column (documentation linked below). So, using a size() method is not an option here. This leaves us with two potentially valid
answers.
We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql,
which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session
(pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.
More info:
- pyspark.sql.functions.size --- PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.csv --- PySpark 3.1.2 documentation
- pyspark.sql.SparkSession.read --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 50 (Databricks import instructions)
Start a Discussions
Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?
Correct : A
itemsDf.write.mode('overwrite').parquet(filePath)
Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode='overwrite' to the parquet() command.
Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.
itemsDf.write.mode('overwrite').path(filePath)
No. A pyspark.sql.DataFrameWriter instance does not have a path() method.
itemsDf.write.option('parquet').mode('overwrite').path(filePath)
Incorrect, see above. In addition, a file format cannot be passed via the option() method.
itemsDf.write(filePath, mode='overwrite')
Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how
Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.
itemsDf.write().parquet(filePath, mode='overwrite')
False. See above.
More info: pyspark.sql.DataFrameWriter.parquet --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 56 (Databricks import instructions)
Start a Discussions
The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code
block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__
Correct : B
Correct code block:
transactionsDf.select('storeId').printSchema()
The difficulty of this Question: is that it is hard to solve with the stepwise first-to-last-gap approach that has worked well for similar questions, since the answer options are so different from
one
another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers.
A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated.
By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to
limit('storeId') can be eliminated.
Given that we are interested in information about the data type, you should Question: whether the answer that resolves to limit(1).columns provides you with this information. While
DataFrame.columns is a valid call, it will only report back column names, but not column types. So, you can eliminate this option.
The two remaining options either use the printSchema() or print_schema() command. You may remember that DataFrame.printSchema() is the only valid command of the two. The select('storeId')
part just returns the storeId column of transactionsDf - this works here, since we are only interested in that column's type anyways.
More info: pyspark.sql.DataFrame.printSchema --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 57 (Databricks import instructions)
Start a Discussions