Skip to content
pvmehta.com

pvmehta.com

  • Home
  • About Me
  • Toggle search form
  • Find Time Consuming SQL Statements in Oracle 10g Oracle
  • How does one SELECT a value from a table into a Unix variable? From SQL to Shell Linux/Unix
  • How To Limit The Access To The Database So That Only One User Per Schema Are Connected (One Concurrent User Per Schema) Oracle
  • Database link password in user_db_links Oracle
  • Follwoing korn shell is used to move files from source folder to destination folder and gzip it in destination folder. mv_iotto.ksh Linux/Unix
  • How to find who is using which Rollback segment and how many rows or blocks in that rollback segments, Oracle
  • fkwoind.sql fkwoindex.sql Oracle
  • Check SQL Server edition SQL Server
  • Oracle vs MYSQL Architecture differences (For DBAs) MYSQL
  • Trace a SQL session from another session using ORADEBUG Oracle
  • Nice notes on wait events Oracle
  • sql_plan9i.sql Oracle
  • _B_TREE_BITMAP_PLANS issue during 8.1.7 to 9.2.0.8 upgrade Oracle
  • How to find pinned objects from shared pool. (pinned via dbms_shared_pool.keep) Oracle
  • find the files that are 1 day old. Linux/Unix

Read CSV file using PySpark

Posted on 30-Sep-202330-Sep-2023 By Admin No Comments on Read CSV file using PySpark
from pyspark.sql.functions import col

 

# File location and type

file_location = “/FileStore/tables/sales_data_part1.csv”
file_type = “csv”

# CSV options
infer_schema = “false”
first_row_is_header = “true”
delimiter = “,”

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option(“inferSchema”, infer_schema) \
  .option(“header”, first_row_is_header) \
  .option(“sep”, delimiter) \
  .load(file_location)

display(df)

# Renaming column names methods.

#method-1 to rename column name
# Rename single column
df1=df.withColumnRenamed(“InvoiceNo”, “InvNo”)
# Rename multiple columns
df2=df.withColumnRenamed(“StockCode”, “StkCode”).withColumnRenamed(“Quantity”, “Qty”).withColumnRenamed(“InvoiceDate”, “InvDayte”)
df.display()
df1.display()
df2.display()


 
# Method-2 for renaming columns. THis will actully reduce the number of columns from select-list.
df3 = df.selectExpr(“InvoiceNo as Inv_no”, “StockCode as stk_code”, “Description as Desc”)
df.display()
df3.display()



# Method-3 for renaming columns. THis will actully reduce the number of columns from select-list.
# # Remember: To use “col” function you need to import it using following
# from pyspark.sql.functions import col


df4 = df.select(col(“InvoiceNo”).alias(“inv”), )
df4.display()


# Create a view or table

temp_table_name = “sales_data_part1_csv”

df.createOrReplaceTempView(temp_table_name)

%sql

/* Query the created temp table in a SQL cell */

select * from sales_data_part1_csv


# With this registered as a temp view, it will only be available to this particular notebook. If you’d like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = “t_sales_data_part1_csv”

df.write.format(“parquet”).saveAsTable(permanent_table_name)

1.This Notebook will be generated automatically when you load a CSV file in “DATA” section.

2.Note: Hyphen is not allowed in Table name so replace all hyphens with underscores or other characters.

3.Parquet format is compressed Text format and occupies much less space than CSV format. 2GB ASCII to 200MB Parquet.

4.Infer_schema=false shows all columns will come as string data type. If Infer_schema=true then notebook will identify all datatypes and present in the table.
Python/PySpark

Post navigation

Previous Post: Read CSV File using Python
Next Post: Getting started with notebook

Related Posts

  • Getting started with notebook Python/PySpark
  • Reading config file from other folder inside class Python/PySpark
  • How to connect to Oracle Database with Wallet with Python. Oracle
  • Python class import from different folders Python/PySpark
  • Add new columns in dataframe Python/PySpark
  • Read CSV File using Python Python/PySpark

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Ansible (0)
  • AWS (2)
  • Azure (1)
  • Django (0)
  • GIT (1)
  • Linux/Unix (149)
  • MYSQL (5)
  • Oracle (393)
  • PHP/MYSQL/Wordpress (10)
  • POSTGRESQL (1)
  • Power-BI (0)
  • Python/PySpark (7)
  • RAC (17)
  • rman-dataguard (26)
  • shell (149)
  • SQL scripts (342)
  • SQL Server (6)
  • Uncategorized (0)
  • Videos (0)

Recent Posts

  • Complete Git Tutorial for Beginners25-Dec-2025
  • Postgres DB user and OS user.25-Dec-2025
  • Trace a SQL session from another session using ORADEBUG30-Sep-2025
  • SQL Server Vs Oracle Architecture difference25-Jul-2025
  • SQL Server: How to see historical transactions25-Jul-2025
  • SQL Server: How to see current transactions or requests25-Jul-2025
  • T-SQL Vs PL/SQL Syntax25-Jul-2025
  • Check SQL Server edition25-Jul-2025
  • Checking SQL Server Version25-Jul-2025
  • Oracle vs MYSQL Architecture differences (For DBAs)24-Jul-2025

Archives

  • 2025
  • 2024
  • 2023
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • catall.sh Linux/Unix
  • Rename Tablespace Oracle
  • On solaris 10, “S” link is not part of $ORACLE_HOME/bin/oracle as default. ( For 9.2.0.8) Oracle
  • Establishing trusted relationship between dbmonitor( central monitoring) and monitoring targets. Linux/Unix
  • Sequence Resetting Oracle
  • oracle fast start failover best practice Oracle
  • Complete Git Tutorial for Beginners GIT
  • Passing from Unix to PLSQL using bind variables Linux/Unix

Copyright © 2026 pvmehta.com.

Powered by PressBook News WordPress theme