Skiprows not working in case of Koalas Dataframe read_csv method
Here is my code
import findspark
findspark.init()
import databricks.koalas as ks
from datetime import datetime
import time
x=datetime.utcnow()
df=ks.read_csv("c:/users/file.csv",skiprows=6)
I want to skip first n number of rows because actual data does not start from first row. i have some junk data in couple of top rows. Any suggestions?
@shakirshakeelzargar Unfortunately, Spark doesn't support such operations, and neither does Koalas, at least so far. If the file is small enough, you can use pandas and convert it to Koalas.
df = ks.from_pandas(pd.read_csv("c:/users/file.csv", skiprows=6))
@ueshin If its a large file. What is the best way to skip n number of rows?? Only using Koalas?
Thanks for the interest to Koalas, @shakirshakeelzargar !
Could you show me an example of "junk data" placed at the top of 6 rows ??
@ueshin @itholic This is top 13 rows of my data
Date Range,5 Jul, 2020 - 4 Aug, 2020,
,
Date Based on,Current portfolio configuration
Data/D,Cataombined
Data Based on,Transaction Date
Data for,Daily
Locale for report,en_IN
col1,col2,col3,col4
xxx,yyy,zzz,zzz
xxx,yyy,zzz,zzz
xxx,yyy,zzz,zzz
xxx,yyy,zzz,zzz
So I want to use this row(col1,col2,col3,col4) as header names and skip all rows above it. In pandas we can easily do it by passing skipwrows parameter to read_csv. But how this can be achieved in koalas?