18 questions from the last 7 days
0
votes
1
answer
82
views
Accessing Azure Key Vault Secrets list from Fabric Notebook using Managed Private Endpoint
I'm trying to retrieve the list of secrets from an Azure Key Vault using a Fabric Notebook in PySpark. I've a Managed Private Endpoint configured in the Fabric workspace pointing to the Key Vault:
...
1
vote
2
answers
49
views
How can I use a PySpark UDF in a for loop?
I need a PySpark UDF with a for loop to create new columns but with conditions based on the iterator value.
def test_map(col):
if x == 1:
if col < 0.55:
return 1.2
...
1
vote
2
answers
56
views
Databricks dataframe join - ambiguous columns
I am facing a problem in my Databricks Delta Live Table (DLT) notebook. I am trying to join together two dataframes, of which one df is derived from the other, but I keep getting the following error:
&...
0
votes
1
answer
34
views
Best Practices for Selecting Primary Key Combinations from Multiple Columns
I am working in Azure Databricks with a large PySpark DataFrame that has 170 columns. I need to identify the best possible combination of 2-3 columns to use as the primary key, ensuring:
Uniqueness: ...
0
votes
0
answers
29
views
Allocation of Executors in Callee/Caller Spark Notebooks
I am working in Azure Synapse analytics and have a wrapper notebook calling two other notebooks, so am running something like:
mssparkutils.notebook.run('/notebook_1_path', 3600)
mssparkutils.notebook....
1
vote
0
answers
39
views
spark connect udf fails with "SparkContext or SparkSession should be created first"
I have a Spark Connect server running. Things are fine when I don't use UDFs (df.show() always works fine). But when I use UDF, it fails with SparkContext or SparkSession should be created first. ...
-5
votes
0
answers
31
views
TypeError: 'Column' object is not callable in pyspark file reading [closed]
Trying to plot a histogram
It says column not collable. I am using hadoop framework and doing a churn analysis using spypark, reading file in spypark.
I am leaning python as a new comer and first time ...
0
votes
0
answers
31
views
Filtering a dataframe provides a different output every time
im having an issue when creating this python script with pyspark. I am pulling old data from two sources, merging them, adding some columns to the dataframe, and then applying a filter on them to ...
0
votes
0
answers
20
views
Unable to read messages from kafka broker with PySpark
Problem: I am trying to connect to a Kafka broker using PySpark, but when consuming messages from the test-topic, I receive empty NULL values instead of the expected JSON content.
consumer.py code
def ...
-1
votes
0
answers
27
views
Pyspark - spark-submit logging for both driver and executor
New to PySpark, I am using spark-submit to execute the program and logging.config package to log the executor logs to a file and exception to email errors. But logging is not working, nothing is ...
0
votes
0
answers
31
views
How to flatten nested JSON in pyspark
I have a JSON file that looks like this:
[
{ "student_id": 1234,
"room_id": "abc",
"enrolled": false
},
{ "student_id": 4321,
&...
0
votes
0
answers
28
views
Delta Lake Merge Rewrites unchanged files
I want to do a merge on a subset of my delta table partitions to do incremental upserts to keep two tables in sync. I do not use a whenNotMatchedBySource statement to clean up stale rows in my target ...
0
votes
1
answer
27
views
Databricks: Generate Multiple Excels for SQL Query
I am getting "OSError: Errno 95: Operation not supported for the code below. I have 'openpyxl 3.1.5' installed on the cluster and have imported all required modules. I am sure this is something ...
0
votes
0
answers
36
views
how concatenate a json key columns using coma as separator
i have a problem to resolve this:
Having a json file, like:
{
"database": {
"table_1": {
"load_type": "Delta",
"columns&...
0
votes
0
answers
25
views
Py4JJavaError : An error occurred while calling o745.save.\n: org.apache.spark.SparkException
I've just started working with Spark, I've built and trained the model, I'm having trouble saving it.
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(featuresCol="features",...
0
votes
1
answer
30
views
DLT pipeline inserts pipeline Update ID into the table name and raises permission denied error on it
I have a DLT pipeline in Databricks that I am trying to execute - not for the first time as it has worked before, but I'm seeing this strange behaviour in the pipeline such that it uses the Update ID ...
0
votes
0
answers
33
views
Suppress py4j.clientserver logs in pyspark (databricks)
This seems to have been asked a few times, but I am raising this since none of the answers work for me.
This is the problem I have: databricks db article
I have a python whl task in databricks (...
0
votes
0
answers
17
views
Using pyspark databricks UDFs with outside function imports
Problem with minimal example
The below minimal example does not run locally with databricks-connect==15.3 but does run within databricks workspace.
main.py
from databricks.connect import ...