Most active 'pyspark' questions

0 votes

1 answer

82 views

Accessing Azure Key Vault Secrets list from Fabric Notebook using Managed Private Endpoint

I'm trying to retrieve the list of secrets from an Azure Key Vault using a Fabric Notebook in PySpark. I've a Managed Private Endpoint configured in the Fabric workspace pointing to the Key Vault: ...

coding

167

asked Feb 28 at 9:36

1 vote

2 answers

49 views

How can I use a PySpark UDF in a for loop?

I need a PySpark UDF with a for loop to create new columns but with conditions based on the iterator value. def test_map(col): if x == 1: if col < 0.55: return 1.2 ...

Chuck

1,305

asked 18 hours ago

1 vote

2 answers

56 views

Databricks dataframe join - ambiguous columns

I am facing a problem in my Databricks Delta Live Table (DLT) notebook. I am trying to join together two dataframes, of which one df is derived from the other, but I keep getting the following error: &...

Mads

37

asked Feb 28 at 17:33

0 votes

1 answer

34 views

Best Practices for Selecting Primary Key Combinations from Multiple Columns

I am working in Azure Databricks with a large PySpark DataFrame that has 170 columns. I need to identify the best possible combination of 2-3 columns to use as the primary key, ensuring: Uniqueness: ...

anuj

142

asked 2 days ago

0 votes

0 answers

29 views

Allocation of Executors in Callee/Caller Spark Notebooks

I am working in Azure Synapse analytics and have a wrapper notebook calling two other notebooks, so am running something like: mssparkutils.notebook.run('/notebook_1_path', 3600) mssparkutils.notebook....

Michael

1

asked Feb 28 at 21:00

1 vote

0 answers

39 views

spark connect udf fails with "SparkContext or SparkSession should be created first"

I have a Spark Connect server running. Things are fine when I don't use UDFs (df.show() always works fine). But when I use UDF, it fails with SparkContext or SparkSession should be created first. ...

Kashyap

17.5k

asked 2 days ago

-5 votes

0 answers

31 views

TypeError: 'Column' object is not callable in pyspark file reading [closed]

Trying to plot a histogram It says column not collable. I am using hadoop framework and doing a churn analysis using spypark, reading file in spypark. I am leaning python as a new comer and first time ...

Anuj Patil

1

asked Feb 27 at 9:35

0 votes

0 answers

31 views

Filtering a dataframe provides a different output every time

im having an issue when creating this python script with pyspark. I am pulling old data from two sources, merging them, adding some columns to the dataframe, and then applying a filter on them to ...

Cris Manrique

49

asked 11 hours ago

0 votes

0 answers

20 views

Unable to read messages from kafka broker with PySpark

Problem: I am trying to connect to a Kafka broker using PySpark, but when consuming messages from the test-topic, I receive empty NULL values instead of the expected JSON content. consumer.py code def ...

KurczakChrupiacy2

55

asked 11 hours ago

-1 votes

0 answers

27 views

Pyspark - spark-submit logging for both driver and executor

New to PySpark, I am using spark-submit to execute the program and logging.config package to log the executor logs to a file and exception to email errors. But logging is not working, nothing is ...

Kavya shree

446

asked 2 days ago

0 votes

0 answers

31 views

How to flatten nested JSON in pyspark

I have a JSON file that looks like this: [ { "student_id": 1234, "room_id": "abc", "enrolled": false }, { "student_id": 4321, &...

unlocknew

3

asked 11 hours ago

0 votes

0 answers

28 views

Delta Lake Merge Rewrites unchanged files

I want to do a merge on a subset of my delta table partitions to do incremental upserts to keep two tables in sync. I do not use a whenNotMatchedBySource statement to clean up stale rows in my target ...

ExploitedRoutine

11

asked yesterday

0 votes

1 answer

27 views

Databricks: Generate Multiple Excels for SQL Query

I am getting "OSError: Errno 95: Operation not supported for the code below. I have 'openpyxl 3.1.5' installed on the cluster and have imported all required modules. I am sure this is something ...

libpekin1847

17

asked 2 days ago

0 votes

0 answers

36 views

how concatenate a json key columns using coma as separator

i have a problem to resolve this: Having a json file, like: { "database": { "table_1": { "load_type": "Delta", "columns&...

Julio

551

asked 2 days ago

0 votes

0 answers

25 views

Py4JJavaError : An error occurred while calling o745.save.\n: org.apache.spark.SparkException

I've just started working with Spark, I've built and trained the model, I'm having trouble saving it. from pyspark.ml.regression import GBTRegressor gbt = GBTRegressor(featuresCol="features",...

Burak Turan

11

asked Mar 2 at 18:59

0 votes

1 answer

30 views

DLT pipeline inserts pipeline Update ID into the table name and raises permission denied error on it

I have a DLT pipeline in Databricks that I am trying to execute - not for the first time as it has worked before, but I'm seeing this strange behaviour in the pipeline such that it uses the Update ID ...

Rcheologist

303

asked Mar 1 at 0:10

0 votes

0 answers

33 views

Suppress py4j.clientserver logs in pyspark (databricks)

This seems to have been asked a few times, but I am raising this since none of the answers work for me. This is the problem I have: databricks db article I have a python whl task in databricks (...

Saugat Mukherjee

1,000

asked Feb 28 at 16:05

0 votes

0 answers

17 views

Using pyspark databricks UDFs with outside function imports

Problem with minimal example The below minimal example does not run locally with databricks-connect==15.3 but does run within databricks workspace. main.py from databricks.connect import ...

Tobias

1

asked 13 hours ago

Collectives™ on Stack Overflow

Accessing Azure Key Vault Secrets list from Fabric Notebook using Managed Private Endpoint

How can I use a PySpark UDF in a for loop?

Databricks dataframe join - ambiguous columns

Best Practices for Selecting Primary Key Combinations from Multiple Columns

Allocation of Executors in Callee/Caller Spark Notebooks

spark connect udf fails with "SparkContext or SparkSession should be created first"

TypeError: 'Column' object is not callable in pyspark file reading [closed]

Filtering a dataframe provides a different output every time

Unable to read messages from kafka broker with PySpark

Pyspark - spark-submit logging for both driver and executor

How to flatten nested JSON in pyspark

Delta Lake Merge Rewrites unchanged files

Databricks: Generate Multiple Excels for SQL Query

how concatenate a json key columns using coma as separator

Py4JJavaError : An error occurred while calling o745.save.\n: org.apache.spark.SparkException

DLT pipeline inserts pipeline Update ID into the table name and raises permission denied error on it

Suppress py4j.clientserver logs in pyspark (databricks)

Using pyspark databricks UDFs with outside function imports

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags