
|
|||||||||||||||||||||||||||
The Databricks Certified Associate Developer for Apache Spark 3.5 exam is a
90-minute, proctored, online assessment with 45 multiple-choice questions. The
exam costs $200, and a passing score of 65% is required. While no prerequisites
are mandatory, hands-on experience with Apache Spark, the DataFrame API, and
Python is highly recommended
Recommended experience
Experience: 6+ months of hands-on experience with the tasks outlined in the exam
guide
Skills: Understanding of Spark architecture, Spark DataFrame API, and Spark SQL
Recommended: A year or more of hands-on experience with Spark and Python is
suggested for the Python-focused version, according to MeasureUp
Key topics covered
Apache Spark architecture
DataFrame API
Spark SQL
Structured Streaming
Spark Connect
Pandas API on Apache Spark
Preparation resources
Related Training:
Instructor-led or self-paced courses from Databricks Academy are highly
recommended.
Databricks Certified Associate Developer for Apache Spark
The Databricks Certified Associate Developer for Apache Spark certification exam
assesses the understanding of the Apache Spark Architecture and Components and
the ability to apply the Spark DataFrame API to complete basic data manipulation
tasks within a Spark session. These tasks include selecting, renaming and
manipulating columns; filtering, dropping, sorting, and aggregating rows;
handling missing data; combining, reading, writing and partitioning DataFrames
with schemas; and working with UDFs and Spark SQL functions. In addition, the
exam will assess the basics of the Spark architecture like execution/deployment
modes, the execution hierarchy, fault tolerance, garbage collection, lazy
evaluation, Shuffling and usage of Actions and broadcasting, Structured
Streaming, Spark Connect, and common troubleshooting and tuning techniques.
Individuals who pass this certification exam can be expected to complete basic
Spark DataFrame tasks using Python.
This exam covers:
Apache Spark Architecture and Components - 20%
Using Spark SQL - 20%
Developing Apache Spark™ DataFrame/DataSet API Applications - 30%
Troubleshooting and Tuning Apache Spark DataFrame API Applications - 10%
Structured Streaming - 10%
Using Spark Connect to deploy applications - 5%
Using Pandas API on Apache Spark - 5%
Assessment Details
Type: Proctored certification
Total number of questions: 45
Time limit: 90 minutes
Registration fee: $200
Question types: Multiple choice
Test aides: None allowed
Languages: English
Delivery method: Online proctored, OnSite Proctored
Prerequisites: None, but related training highly recommended
Recommended experience: 6+ months of hands-on experience performing the tasks
outlined in the exam guide
Validity period: 2 years
Recertification: Recertification is required every two years to maintain your
certified status. To recertify, you must take the current version of the exam.
Please review the “Getting Ready for the Exam” section below to prepare for your
recertification exam.
Unscored content: Exams may include unscored items to gather statistical
information for future use. These items are not identified on the form and do
not impact your score. Additional time is factored into the exam to account for this content.
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Brain Dumps Exam + Online / Offline and Android Testing Engine & 4500+ other exams included
$50 - $25 (you save $25)
Buy Now
QUESTION 1
A data scientist of an e-commerce company is working with user data obtained
from its subscriber
database and has stored the data in a DataFrame df_user. Before further
processing the data, the
data scientist wants to create another DataFrame df_user_non_pii and store only
the non-PII
columns in this DataFrame. The PII columns in df_user are first_name, last_name,
email, and birthdate.
Which code snippet can be used to meet this requirement?
A. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
B. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
C. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")
D. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")
Answer: A
Explanation:
To remove specific columns from a PySpark DataFrame, the drop() method is used.
This method
returns a new DataFrame without the specified columns. The correct syntax for
dropping multiple
columns is to pass each column name as a separate argument to the drop() method.
Correct Usage:
df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")
This line of code will return a new DataFrame df_user_non_pii that excludes the
specified PII columns.
Explanation of Options:
A . Correct. Uses the drop() method with multiple column names passed as
separate arguments,
which is the standard and correct usage in PySpark.
B . Although it appears similar to Option A, if the column names are not
enclosed in quotes or if
there's a syntax error (e.g., missing quotes or incorrect variable names), it
would result in an error.
However, as written, it's identical to Option A and thus also correct.
C . Incorrect. The dropfields() method is not a method of the DataFrame class in
PySpark. It's used
with StructType columns to drop fields from nested structures, not top-level
DataFrame columns.
D . Incorrect. Passing a single string with comma-separated column names to
dropfields() is not valid syntax in PySpark.
Reference:
PySpark Documentation: DataFrame.drop
Stack Overflow Discussion: How to delete columns in PySpark DataFrame
QUESTION 2
A data engineer is working on a Streaming DataFrame streaming_df with the given
streaming data:
Which operation is supported with streamingdf ?
A. streaming_df. select (countDistinct ("Name") )
B. streaming_df.groupby("Id") .count ()
C. streaming_df.orderBy("timestamp").limit(4)
D. streaming_df.filter (col("count") < 30).show()
Answer: D
Explanation:
Which operation is supported with streaming_df?
A. streaming_df.select(countDistinct("Name"))
B. streaming_df.groupby("Id").count()
C. streaming_df.orderBy("timestamp").limit(4)
D. streaming_df.filter(col("count") < 30).show()
Answer: B
Explanation:
In Structured Streaming, only a limited subset of operations is supported due to
the nature of
unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct)
require a
full view of the dataset, which is not possible with streaming data unless
specific watermarks or
windows are defined.
Review of Each Option:
A . select(countDistinct("Name"))
Not allowed ” Global aggregation like countDistinct() requires the full dataset
and is not supported
directly in streaming without watermark and windowing logic.
Reference: Databricks Structured Streaming Guide “ Unsupported Operations.
B . groupby("Id").count()
Supported ” Streaming aggregations over a key (like groupBy("Id")) are
supported. Spark maintains
intermediate state for each key.
Reference: Databricks Docs → Aggregations in Structured Streaming
(https://docs.databricks.com/structured-streaming/aggregation.html)
C . orderBy("timestamp").limit(4)
Not allowed ” Sorting and limiting require a full view of the stream (which is
infinite), so this is
unsupported in streaming DataFrames.
Reference: Spark Structured Streaming “ Unsupported Operations (ordering without
watermark/window not allowed).
D . filter(col("count") < 30).show()
Not allowed ” show() is a blocking operation used for debugging batch DataFrames;
it's not allowed
on streaming DataFrames.
Reference: Structured Streaming Programming Guide “ Output operations like
show() are not
supported.
Reference Extract from Official Guide:
oeOperations like orderBy, limit, show, and countDistinct are not supported in
Structured Streaming
because they require the full dataset to compute a result. Use groupBy(...).agg(...)
instead for
incremental aggregations.
” Databricks Structured Streaming Programming Guide
QUESTION 3
An MLOps engineer is building a Pandas UDF that applies a language model that
translates English
strings into Spanish. The initial code is loading the model on every call to the
UDF, which is hurting
the performance of the data pipeline.
The initial code is:
def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the
language model is loaded?
A. Convert the Pandas UDF to a PySpark UDF
B. Convert the Pandas UDF from a Series → Series UDF to a Series → Scalar
UDF
C. Run the in_spanish_inner() function in a mapInPandas() function call
D. Convert the Pandas UDF from a Series → Series UDF to an Iterator[Series]
→ Iterator[Series] UDF
Answer: D
Explanation:
The provided code defines a Pandas UDF of type Series-to-Series, where a new
instance of the
language model is created on each call, which happens per batch. This is
inefficient and results in
significant overhead due to repeated model initialization.
To reduce the frequency of model loading, the engineer should convert the UDF to
an iterator-based
Pandas UDF (Iterator[pd.Series] -> Iterator[pd.Series]). This allows the model
to be loaded once per
executor and reused across multiple batches, rather than once per call.
From the official Databricks documentation:
oeIterator of Series to Iterator of Series UDFs are useful when the UDF
initialization is expensive¦ For
example, loading a ML model once per executor rather than once per row/batch.
” Databricks Official Docs: Pandas UDFs
Correct implementation looks like:
python
CopyEdit
@pandas_udf("string")
def translate_udf(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
model = get_translation_model(target_lang='es')
for batch in batch_iter:
yield batch.apply(model)
This refactor ensures the get_translation_model() is invoked once per executor
process, not per
batch, significantly improving pipeline performance.
QUESTION 4
A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the
DataFrame is
too large to fit entirely in memory.
What is the likely behavior when Spark runs out of memory to store the DataFrame?
A. Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in
memory, the DataFrame is stored and retrieved from the disk entirely.
B. Spark splits the DataFrame evenly between memory and disk, ensuring balanced
storage utilization.
C. Spark will store as much data as possible in memory and spill the rest to
disk when memory is full, continuing processing with performance overhead.
D. Spark stores the frequently accessed rows in memory and less frequently
accessed rows on disk, utilizing both resources to offer balanced performance.
Answer: C
Explanation:
When using the MEMORY_AND_DISK storage level, Spark attempts to cache as much of
the
DataFrame in memory as possible. If the DataFrame does not fit entirely in
memory, Spark will store
the remaining partitions on disk. This allows processing to continue, albeit
with a performance
overhead due to disk I/O.
As per the Spark documentation:
"MEMORY_AND_DISK: It stores partitions that do not fit in memory on disk and
keeps the rest in
memory. This can be useful when working with datasets that are larger than the
available memory."
” Perficient Blogs: Spark - StorageLevel
This behavior ensures that Spark can handle datasets larger than the available
memory by spilling
excess data to disk, thus preventing job failures due to memory constraints.
QUESTION 5
A data engineer is building a Structured Streaming pipeline and wants the
pipeline to recover from
failures or intentional shutdowns by continuing where the pipeline left off.
How can this be achieved?
A. By configuring the option checkpointLocation during readStream
B. By configuring the option recoveryLocation during the SparkSession
initialization
C. By configuring the option recoveryLocation during writeStream
D. By configuring the option checkpointLocation during writeStream
Answer: D
Explanation:
To enable a Structured Streaming query to recover from failures or intentional
shutdowns, it is
essential to specify the checkpointLocation option during the writeStream
operation. This checkpoint
location stores the progress information of the streaming query, allowing it to
resume from where it left off.
According to the Databricks documentation:
"You must specify the checkpointLocation option before you run a streaming
query, as in the following example:
.option("checkpointLocation", "/path/to/checkpoint/dir")
.toTable("catalog.schema.table")
” Databricks Documentation: Structured Streaming checkpoints
By setting the checkpointLocation during writeStream, Spark can maintain state
information and
ensure exactly-once processing semantics, which are crucial for reliable
streaming applications.
Students Feedback / Reviews/ Discussion
Mahrous Mostafa Adel Amin 1 week, 2 days ago - Abuhib- United Arab
Emirates
Passed the exam today, Got 98 questions in total, and 2 of them weren’t from
exam topics. Rest of them was exactly the same!
upvoted 4 times
Mbongiseni Dlongolo - South Africa2 weeks, 5 days ago
Thank you so much, I passed Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 today! 41 questions out of 44 are from
Certkingdom
upvoted 2 times
Kenyon Stefanie 1 month, 1 week ago - USA State / Province = Virginia
Thank you so much, huge help! I passed Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks today! The big majority
of questions were from here.
upvoted 2 times
Danny 1 month, 1 week ago - United States CUSTOMER_STATE_NAME: Costa Mesa =
USA
Passed the exam today, 100% points. Got 44 questions in total, and 3 of them
weren’t from exam topics. Rest of them was exactly the same!
MENESES RAUL 93% 2 week ago - USA = Texas
was from this topic! I did buy the contributor access. Thank you certkingdom!
upvoted 4 times
Zemljaric Rok 1 month, 2 weeks ago - Ljubljana Slovenia
Cleared my exam today - Over 80% questions from here, many thanks certkingdom
and everyone for the meaningful discussions.
upvoted 2 times