Skip to content
pvmehta.com

pvmehta.com

  • Home
  • About Me
  • Toggle search form
  • Looping for remote servers and find its database from oratab file. Linux/Unix
  • How To Transfer Passwords Between Databases (ref note: 199582.1) Oracle
  • sql_doing_fts.sql Oracle
  • remove archfiles only when it is applied to DR rm_archfiles.sh Linux/Unix
  • chk_space_SID.ksh Linux/Unix
  • ext#.sql Oracle
  • newupload.html PHP/MYSQL/Wordpress
  • SQL_PLAN.sql for checking real execution plan Oracle
  • SAN Linux/Unix
  • Oracle 11g Training on 29JAN1010 Oracle
  • Creating never expiring DB user accounts in Oracle Oracle
  • _B_TREE_BITMAP_PLANS issue during 8.1.7 to 9.2.0.8 upgrade Oracle
  • How to find the real execution plan and binds used in that explain plan in Oracle 10g?? Oracle
  • Creating a Container Database using dbaascli Uncategorized
  • V$transaction notes for finding XID composition. Oracle

Read CSV file using PySpark

Posted on 30-Sep-202330-Sep-2023 By Admin No Comments on Read CSV file using PySpark

from pyspark.sql.functions import col

 

# File location and type

file_location = “/FileStore/tables/sales_data_part1.csv”
file_type = “csv”

# CSV options
infer_schema = “false”
first_row_is_header = “true”
delimiter = “,”

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type)
  .option(“inferSchema”, infer_schema)
  .option(“header”, first_row_is_header)
  .option(“sep”, delimiter)
  .load(file_location)

display(df)

# Renaming column names methods.

#method-1 to rename column name
# Rename single column
df1=df.withColumnRenamed(“InvoiceNo”, “InvNo”)
# Rename multiple columns
df2=df.withColumnRenamed(“StockCode”, “StkCode”).withColumnRenamed(“Quantity”, “Qty”).withColumnRenamed(“InvoiceDate”, “InvDayte”)
df.display()
df1.display()
df2.display()

 
# Method-2 for renaming columns. THis will actully reduce the number of columns from select-list.
df3 = df.selectExpr(“InvoiceNo as Inv_no”, “StockCode as stk_code”, “Description as Desc”)
df.display()
df3.display()

# Method-3 for renaming columns. THis will actully reduce the number of columns from select-list.
# # Remember: To use “col” function you need to import it using following
# from pyspark.sql.functions import col

df4 = df.select(col(“InvoiceNo”).alias(“inv”), )
df4.display()

# Create a view or table

temp_table_name = “sales_data_part1_csv”

df.createOrReplaceTempView(temp_table_name)

%sql

/* Query the created temp table in a SQL cell */

select * from sales_data_part1_csv

# With this registered as a temp view, it will only be available to this particular notebook. If you’d like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.

permanent_table_name = “t_sales_data_part1_csv”

df.write.format(“parquet”).saveAsTable(permanent_table_name)

1.This
Notebook will be generated automatically when you load a CSV file in “DATA” section.

2.Note: Hyphen is not allowed in Table name
so replace all hyphens with underscores or other characters.

3.Parquet format is compressed Text format
and occupies much less space than CSV format. 2GB ASCII to 200MB Parquet.

4.Infer_schema=false shows all columns will come as
string data type. If Infer_schema=true then notebook will identify all datatypes and present in the table.

Python/PySpark

Post navigation

Previous Post: Read CSV File using Python
Next Post: Getting started with notebook

Related Posts

  • Getting started with notebook Python/PySpark
  • How to connect to Oracle Database with Wallet with Python. Oracle
  • Read CSV File using Python Python/PySpark
  • Reading config file from other folder inside class Python/PySpark
  • Python class import from different folders Python/PySpark
  • Add new columns in dataframe Python/PySpark

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Categories

  • Ansible (0)
  • AWS (2)
  • Azure (1)
  • Django (0)
  • GIT (1)
  • Linux/Unix (149)
  • MYSQL (5)
  • Oracle (403)
  • PHP/MYSQL/Wordpress (10)
  • POSTGRESQL (1)
  • Power-BI (0)
  • Python/PySpark (7)
  • RAC (18)
  • rman-dataguard (26)
  • shell (150)
  • SQL scripts (350)
  • SQL Server (6)
  • Uncategorized (5)
  • Videos (0)

Recent Posts

  • Key Management in Oracle: The Core Issue: Missing Master Key12-May-2026
  • SAT Mathematics 10 questions and answer at the end.30-Apr-2026
  • top 10 AI news today30-Apr-2026
  • runon_allpdbs_show_conname.sh23-Apr-2026
  • runon_allcdbs_find_pdbs.sql23-Apr-2026
  • Running PDB on single node in RAC09-Apr-2026
  • find_arc.sql09-Apr-2026
  • pvm_pre_change.sql08-Apr-2026
  • find_encr_wallet.sql08-Apr-2026
  • find_pdbs.sql08-Apr-2026

Archives

  • 2026
  • 2025
  • 2024
  • 2023
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • get_aix_vmstat.ksh Oracle
  • Oracle Material from OTN Oracle
  • OEM-troubleshooting on 20-MAY-08 Oracle
  • find_log_switch.sql Find log switches in graphical manner Oracle
  • Exadata Basics Oracle
  • Search and replace editor command in vi Linux/Unix
  • Restoring a user’s original password 1051962.101 Oracle
  • Formatter Explain plan Output 1 Oracle

Copyright © 2026 pvmehta.com.

Powered by PressBook News WordPress theme