Skip to content
pvmehta.com

pvmehta.com

  • Home
  • About Me
  • Toggle search form
  • USER_TABLES.Freelists Oracle
  • Privilege to describe the table. Oracle
  • Set Role explaination. Oracle
  • Finding Oracle Patches with opatch Oracle
  • tar and untar a dolder with all its subfolder. Linux/Unix
  • find_du.ksh to find # of files, their sizes in current folder and its subdolder Linux/Unix
  • default permission on ~/.ssh/authorized_keys2 or authorized_keys Linux/Unix
  • Standby Database File Management in 10g with STANDBY_FILE_MANAGEMENT Oracle
  • plan10g.sql good Oracle
  • Vivek Tuning for Row Locks. Oracle
  • JSON/XML Types in Oracle Oracle
  • Kernel Parameters for Solaris Linux/Unix
  • Oracle Release Explaination Oracle
  • Jai Shree Ram Oracle
  • PLSQL Table Syntax 1 Oracle

Read CSV file using PySpark

Posted on 30-Sep-202330-Sep-2023 By Admin No Comments on Read CSV file using PySpark
from pyspark.sql.functions import col

 

# File location and type

file_location = “/FileStore/tables/sales_data_part1.csv”
file_type = “csv”

# CSV options
infer_schema = “false”
first_row_is_header = “true”
delimiter = “,”

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option(“inferSchema”, infer_schema) \
  .option(“header”, first_row_is_header) \
  .option(“sep”, delimiter) \
  .load(file_location)

display(df)

# Renaming column names methods.

#method-1 to rename column name
# Rename single column
df1=df.withColumnRenamed(“InvoiceNo”, “InvNo”)
# Rename multiple columns
df2=df.withColumnRenamed(“StockCode”, “StkCode”).withColumnRenamed(“Quantity”, “Qty”).withColumnRenamed(“InvoiceDate”, “InvDayte”)
df.display()
df1.display()
df2.display()


 
# Method-2 for renaming columns. THis will actully reduce the number of columns from select-list.
df3 = df.selectExpr(“InvoiceNo as Inv_no”, “StockCode as stk_code”, “Description as Desc”)
df.display()
df3.display()



# Method-3 for renaming columns. THis will actully reduce the number of columns from select-list.
# # Remember: To use “col” function you need to import it using following
# from pyspark.sql.functions import col


df4 = df.select(col(“InvoiceNo”).alias(“inv”), )
df4.display()


# Create a view or table

temp_table_name = “sales_data_part1_csv”

df.createOrReplaceTempView(temp_table_name)

%sql

/* Query the created temp table in a SQL cell */

select * from sales_data_part1_csv


# With this registered as a temp view, it will only be available to this particular notebook. If you’d like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = “t_sales_data_part1_csv”

df.write.format(“parquet”).saveAsTable(permanent_table_name)

1.This Notebook will be generated automatically when you load a CSV file in “DATA” section.

2.Note: Hyphen is not allowed in Table name so replace all hyphens with underscores or other characters.

3.Parquet format is compressed Text format and occupies much less space than CSV format. 2GB ASCII to 200MB Parquet.

4.Infer_schema=false shows all columns will come as string data type. If Infer_schema=true then notebook will identify all datatypes and present in the table.
Python/PySpark

Post navigation

Previous Post: Read CSV File using Python
Next Post: Getting started with notebook

Related Posts

  • Getting started with notebook Python/PySpark
  • Python class import from different folders Python/PySpark
  • Add new columns in dataframe Python/PySpark
  • Reading config file from other folder inside class Python/PySpark
  • Read CSV File using Python Python/PySpark
  • How to connect to Oracle Database with Wallet with Python. Oracle

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • AWS (2)
  • Azure (1)
  • Linux/Unix (149)
  • Oracle (392)
  • PHP/MYSQL/Wordpress (10)
  • Power-BI (0)
  • Python/PySpark (7)
  • RAC (17)
  • rman-dataguard (26)
  • shell (149)
  • SQL scripts (341)
  • Uncategorized (0)
  • Videos (0)

Recent Posts

  • load SPM baseline from cursor cache05-Jun-2025
  • Drop all SPM baselines for SQL handle05-Jun-2025
  • Load SPM baseline from AWR05-Jun-2025
  • Drop specific SQL plan baseline – spm05-Jun-2025
  • findinfo.sql (SQL for getting CPU and Active session info)27-May-2025
  • SQL Tracker by SID sqltrackerbysid.sql22-Apr-2025
  • How to connect to Oracle Database with Wallet with Python.21-Mar-2025
  • JSON/XML Types in Oracle18-Mar-2025
  • CPU Core related projections12-Mar-2025
  • Exadata Basics10-Dec-2024

Archives

  • 2025
  • 2024
  • 2023
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • CPU Core related projections AWS
  • My FTP Job Scheduling for www.pvmehta.com PHP/MYSQL/Wordpress
  • Virtual Indexes in Oracle Oracle
  • catting.sh Linux/Unix
  • Wait time tuning research Oracle
  • mutex in Oracle 10.2.0.2 or Oracle 10g Oracle
  • find checksum of a file. Linux/Unix
  • DBMS_STATS Metalinks Notes Oracle

Copyright © 2025 pvmehta.com.

Powered by PressBook News WordPress theme