koalas icon indicating copy to clipboard operation
koalas copied to clipboard

Skiprows not working in case of Koalas Dataframe read_csv method

Open shakirshakeelzargar opened this issue 5 years ago • 4 comments

Here is my code

import findspark
findspark.init()
import databricks.koalas as ks
from datetime import datetime
import time
x=datetime.utcnow()
df=ks.read_csv("c:/users/file.csv",skiprows=6)

I want to skip first n number of rows because actual data does not start from first row. i have some junk data in couple of top rows. Any suggestions?

shakirshakeelzargar avatar Nov 17 '20 12:11 shakirshakeelzargar

@shakirshakeelzargar Unfortunately, Spark doesn't support such operations, and neither does Koalas, at least so far. If the file is small enough, you can use pandas and convert it to Koalas.

df = ks.from_pandas(pd.read_csv("c:/users/file.csv", skiprows=6))

ueshin avatar Nov 17 '20 19:11 ueshin

@ueshin If its a large file. What is the best way to skip n number of rows?? Only using Koalas?

shakirshakeelzargar avatar Nov 18 '20 12:11 shakirshakeelzargar

Thanks for the interest to Koalas, @shakirshakeelzargar !

Could you show me an example of "junk data" placed at the top of 6 rows ??

itholic avatar Nov 19 '20 05:11 itholic

@ueshin @itholic This is top 13 rows of my data

Date Range,5 Jul, 2020 - 4 Aug, 2020,
,
Date Based on,Current portfolio configuration
Data/D,Cataombined
Data Based on,Transaction Date
Data for,Daily
Locale for report,en_IN

col1,col2,col3,col4
xxx,yyy,zzz,zzz
xxx,yyy,zzz,zzz
xxx,yyy,zzz,zzz
xxx,yyy,zzz,zzz

So I want to use this row(col1,col2,col3,col4) as header names and skip all rows above it. In pandas we can easily do it by passing skipwrows parameter to read_csv. But how this can be achieved in koalas?

shakirshakeelzargar avatar Nov 19 '20 08:11 shakirshakeelzargar