OSCI
OSCI copied to clipboard
Unable to get basic example to run
Hey, folks --
I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the daily-osci-rankings
stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.
I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.
Any help pointing me in the right direction would be appreciated!
$ python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py daily-osci-rankings -td 2020-01-02
# failure (see full log below)
# ...
[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n at py4j.Gateway.invoke(Gateway.java:282)\n at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n at py4j.commands.CallCommand.execute(CallCommand.java:79)\n at py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
# ...
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
File "osci-cli.py", line 93, in <module>
cli(standalone_mode=False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
return self._execute(**self._process_params(kwargs))
File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
commits = osci_ranking_job.extract(to_date=to_day).cache()
File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
return self.spark_session.read.load(paths, **options)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
Full Error Log:
[2022-03-22 18:11:05,996] [INFO] ENV: None
[2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists
[2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml
[2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.LocalFileSystemConfig'>
[2022-03-22 18:11:06,000] [DEBUG] {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.Config'>
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.datalake.datalake.DataLake'>
[2022-03-22 18:11:06,113] [INFO] Execute action `daily-osci-rankings`
[2022-03-22 18:11:06,113] [INFO] Action params `{'to_day': '2020-01-02'}`
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.osci_ranking.OSCIRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.commits_ranking.OSCICommitsRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.jobs.session.Session'>
[2022-03-22 18:11:06,115] [DEBUG] Loaded paths for (None 2020-01-02 00:00:00) []
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[2022-03-22 18:11:08,127] [DEBUG] Command to send: A
fb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
[2022-03-22 18:11:08,142] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,142] [DEBUG] Command to send: j
i
rj
org.apache.spark.SparkConf
e
[2022-03-22 18:11:08,143] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,143] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.java.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.python.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.ml.python.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.mllib.api.python.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.api.python.*
e
[2022-03-22 18:11:08,145] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,145] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.hive.*
e
[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: j
i
rj
scala.Tuple2
e
[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: r
u
SparkConf
rj
e
[2022-03-22 18:11:08,147] [DEBUG] Answer received: !ycorg.apache.spark.SparkConf
[2022-03-22 18:11:08,148] [DEBUG] Command to send: i
org.apache.spark.SparkConf
bTrue
e
[2022-03-22 18:11:08,154] [DEBUG] Answer received: !yro0
[2022-03-22 18:11:08,154] [DEBUG] Command to send: c
o0
contains
sspark.serializer.objectStreamReset
e
[2022-03-22 18:11:08,158] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,158] [DEBUG] Command to send: c
o0
set
sspark.serializer.objectStreamReset
s100
e
[2022-03-22 18:11:08,158] [DEBUG] Answer received: !yro1
[2022-03-22 18:11:08,158] [DEBUG] Command to send: m
d
o1
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
contains
sspark.rdd.compress
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
set
sspark.rdd.compress
sTrue
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yro2
[2022-03-22 18:11:08,159] [DEBUG] Command to send: m
d
o2
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e
[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e
[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e
[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
get
sspark.master
e
[2022-03-22 18:11:08,161] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,161] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e
[2022-03-22 18:11:08,162] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
get
sspark.app.name
e
[2022-03-22 18:11:08,162] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
contains
sspark.home
e
[2022-03-22 18:11:08,163] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,163] [DEBUG] Command to send: c
o0
getAll
e
[2022-03-22 18:11:08,163] [DEBUG] Answer received: !yto3
[2022-03-22 18:11:08,163] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,164] [DEBUG] Command to send: a
g
o3
i0
e
[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yro4
[2022-03-22 18:11:08,164] [DEBUG] Command to send: c
o4
_1
e
[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysspark.rdd.compress
[2022-03-22 18:11:08,165] [DEBUG] Command to send: c
o4
_2
e
[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysTrue
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
g
o3
i1
e
[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yro5
[2022-03-22 18:11:08,166] [DEBUG] Command to send: c
o5
_1
e
[2022-03-22 18:11:08,166] [DEBUG] Answer received: !ysspark.serializer.objectStreamReset
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o5
_2
e
[2022-03-22 18:11:08,167] [DEBUG] Answer received: !ys100
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
g
o3
i2
e
[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yro6
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o6
_1
e
[2022-03-22 18:11:08,170] [DEBUG] Answer received: !ysspark.master
[2022-03-22 18:11:08,170] [DEBUG] Command to send: c
o6
_2
e
[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
g
o3
i3
e
[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yro7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: c
o7
_1
e
[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ysspark.submit.pyFiles
[2022-03-22 18:11:08,172] [DEBUG] Command to send: c
o7
_2
e
[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ys
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,172] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
g
o3
i4
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro8
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_1
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysspark.submit.deployMode
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_2
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysclient
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
g
o3
i5
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro9
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_1
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ysspark.ui.showConsoleProgress
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_2
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ystrue
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
g
o3
i6
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yro10
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_1
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !ysspark.app.name
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_2
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,175] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,175] [DEBUG] Command to send: m
d
o3
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,175] [DEBUG] Command to send: r
u
JavaSparkContext
rj
e
[2022-03-22 18:11:08,186] [DEBUG] Answer received: !ycorg.apache.spark.api.java.JavaSparkContext
[2022-03-22 18:11:08,186] [DEBUG] Command to send: i
org.apache.spark.api.java.JavaSparkContext
ro0
e
[2022-03-22 18:11:09,483] [DEBUG] Answer received: !yro11
[2022-03-22 18:11:09,483] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,489] [DEBUG] Answer received: !yro12
[2022-03-22 18:11:09,490] [DEBUG] Command to send: c
o12
conf
e
[2022-03-22 18:11:09,499] [DEBUG] Answer received: !yro13
[2022-03-22 18:11:09,500] [DEBUG] Command to send: r
u
PythonAccumulatorV2
rj
e
[2022-03-22 18:11:09,501] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonAccumulatorV2
[2022-03-22 18:11:09,502] [DEBUG] Command to send: i
org.apache.spark.api.python.PythonAccumulatorV2
s127.0.0.1
i45879
sfb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
e
[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro14
[2022-03-22 18:11:09,502] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro15
[2022-03-22 18:11:09,503] [DEBUG] Command to send: c
o15
register
ro14
e
[2022-03-22 18:11:09,505] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,505] [DEBUG] Command to send: r
u
PythonUtils
rj
e
[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:09,506] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
e
[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,506] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
ro11
e
[2022-03-22 18:11:09,507] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,508] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:09,509] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
u
org.apache.spark.SparkFiles
rj
e
[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ycorg.apache.spark.SparkFiles
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
m
org.apache.spark.SparkFiles
getRootDirectory
e
[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,511] [DEBUG] Command to send: c
z:org.apache.spark.SparkFiles
getRootDirectory
e
[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/userFiles-58b63090-eb7f-4872-8939-2710678287d1
[2022-03-22 18:11:09,512] [DEBUG] Command to send: c
o13
get
sspark.submit.pyFiles
s
e
[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys
[2022-03-22 18:11:09,513] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:09,514] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,514] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e
[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,516] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e
[2022-03-22 18:11:09,517] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,517] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
getLocalDir
e
[2022-03-22 18:11:09,519] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,519] [DEBUG] Answer received: !yro16
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o16
conf
e
[2022-03-22 18:11:09,520] [DEBUG] Answer received: !yro17
[2022-03-22 18:11:09,520] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
getLocalDir
ro17
e
[2022-03-22 18:11:09,520] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
[2022-03-22 18:11:09,520] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:09,521] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,521] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e
[2022-03-22 18:11:09,523] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e
[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
createTempDir
e
[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
createTempDir
s/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
spyspark
e
[2022-03-22 18:11:09,524] [DEBUG] Answer received: !yro18
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
o18
getAbsolutePath
e
[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/pyspark-bc66966b-69a0-4a5b-b7ab-b0b7c8e45101
[2022-03-22 18:11:09,525] [DEBUG] Command to send: c
o13
get
sspark.python.profile
sfalse
e
[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ysfalse
[2022-03-22 18:11:09,525] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,544] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,545] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
getDefaultSession
e
[2022-03-22 18:11:09,567] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,567] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
getDefaultSession
e
[2022-03-22 18:11:09,568] [DEBUG] Answer received: !yro19
[2022-03-22 18:11:09,568] [DEBUG] Command to send: c
o19
isDefined
e
[2022-03-22 18:11:09,569] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,569] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,570] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,570] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,571] [DEBUG] Answer received: !yro20
[2022-03-22 18:11:09,571] [DEBUG] Command to send: i
org.apache.spark.sql.SparkSession
ro20
e
[2022-03-22 18:11:09,620] [DEBUG] Answer received: !yro21
[2022-03-22 18:11:09,620] [DEBUG] Command to send: c
o21
sqlContext
e
[2022-03-22 18:11:09,621] [DEBUG] Answer received: !yro22
[2022-03-22 18:11:09,621] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,622] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,622] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setDefaultSession
e
[2022-03-22 18:11:09,623] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,623] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setDefaultSession
ro21
e
[2022-03-22 18:11:09,623] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,623] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,624] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setActiveSession
e
[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,624] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setActiveSession
ro21
e
[2022-03-22 18:11:09,625] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,625] [DEBUG] Command to send: c
o22
read
e
[2022-03-22 18:11:10,432] [DEBUG] Answer received: !yro23
[2022-03-22 18:11:10,432] [DEBUG] Command to send: r
u
PythonUtils
rj
e
[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:10,433] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e
[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ym
[2022-03-22 18:11:10,433] [DEBUG] Command to send: i
java.util.ArrayList
e
[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ylo24
[2022-03-22 18:11:10,434] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
toSeq
ro24
e
[2022-03-22 18:11:10,434] [DEBUG] Answer received: !yro25
[2022-03-22 18:11:10,434] [DEBUG] Command to send: m
d
o24
e
[2022-03-22 18:11:10,435] [DEBUG] Answer received: !yv
[2022-03-22 18:11:10,435] [DEBUG] Command to send: c
o23
load
ro25
e
22/03/22 18:11:10 WARN DataSource: All paths were ignored:
[Stage 0:> (0 + 1) / 1]
[2022-03-22 18:11:11,839] [DEBUG] Answer received: !xro26
[2022-03-22 18:11:11,839] [DEBUG] Command to send: c
o26
toString
e
[2022-03-22 18:11:11,840] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,840] [DEBUG] Command to send: c
o26
getCause
e
[2022-03-22 18:11:11,840] [DEBUG] Answer received: !yn
[2022-03-22 18:11:11,840] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:11,842] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,842] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:11,844] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,844] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:11,848] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,848] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e
[2022-03-22 18:11:11,849] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e
[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
exceptionString
e
[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,849] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
exceptionString
ro26
e
[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n at py4j.Gateway.invoke(Gateway.java:282)\n at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n at py4j.commands.CallCommand.execute(CallCommand.java:79)\n at py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
[2022-03-22 18:11:11,852] [DEBUG] Command to send: m
d
o0
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o4
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o5
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o6
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o7
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o8
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o9
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o10
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o12
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o15
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o16
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o17
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o18
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o19
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o20
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
File "osci-cli.py", line 93, in <module>
cli(standalone_mode=False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
return self._execute(**self._process_params(kwargs))
File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
commits = osci_ranking_job.extract(to_date=to_day).cache()
File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
return self.spark_session.read.load(paths, **options)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException[2022-03-22 18:11:11,879] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:11,881] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,881] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql
rj
e
[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal
rj
e
[2022-03-22 18:11:11,883] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal.SQLConf
rj
e
[2022-03-22 18:11:11,883] [DEBUG] Answer received: !ycorg.apache.spark.sql.internal.SQLConf
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
m
org.apache.spark.sql.internal.SQLConf
get
e
[2022-03-22 18:11:11,885] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
z:org.apache.spark.sql.internal.SQLConf
get
e
[2022-03-22 18:11:11,885] [DEBUG] Answer received: !yro27
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
o27
pysparkJVMStacktraceEnabled
e
[2022-03-22 18:11:11,886] [DEBUG] Answer received: !ybfalse
: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,924] [DEBUG] Command to send: m
d
o27
e
[2022-03-22 18:11:11,927] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,965] [DEBUG] Command to send: m
d
o26
e
[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o25
e
[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o23
e
[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
@cm-howard any thoughts on this? Alternatively, would appreciate anything you could do to point me in the right direction
@theycallmeswift are there any files in the '/data' dir?
@vlad-isayko yep!
python3 osci-cli.py get-github-daily-push-events -d YYYY-MM-DD
produces YYYY-MM-DD-[0-23].parquet
files in /data/landing/github/events/push/YYYY/MM/DD/
and
python3 osci-cli.py process-github-daily-push-events -d YYYY-MM-DD
produces COMPANY-YYYY-MM-DD.parquet
files in /data/staging/github/raw-events/push/YYYY/MM/DD
@theycallmeswift I have a similar error on Ubuntu 20.04 Did you manage to fix the error locally?
@jerpelea I did not unfortunately. The docs need a serious overhaul from someone who knows the system better than me!
@theycallmeswift @jerpelea Hello, the problem is really outdated and incomplete documentation. We will fix this in the coming days. I'll keep you posted
@vlad-isayko can you share some quick update here before updating the documentation
At the moment, this is the current way to start
-
python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
-
python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
-
python3 osci-cli.py daily-active-repositories -d 2020-01-01
-
python3 osci-cli.py load-repositories -d 2020-01-01
-
python3 osci-cli.py filter-unlicensed -d 2020-01-01
-
python3 osci-cli.py daily-osci-rankings -td 2020-01-01
-
python3 osci-cli.py get-change-report -d 2020-01-01
You can write to me if you have any problems
@vlad-isayko
Thanks for your quick answer
Everything behaved normal until step 6 python3 osci-cli.py daily-osci-rankings -td 2020-01-01
attached is the log log.log
I am running Ubuntu 20.04 with python 3.8
@jerpelea can you also share what version of pyspark and spark do you have?
@vlad-isayko
packages from .local/lib/python3.8/site-packages installed by pip install -r requirements.txt
aiohttp-3.8.1.dist-info aiosignal-1.2.0.dist-info async_timeout-4.0.2.dist-info attrs-21.4.0.dist-info azure_common-1.1.25.dist-info azure_core-1.7.0.dist-info azure_functions-1.3.0.dist-info azure_functions_durable-1.1.3.dist-info azure_nspkg-3.0.2.dist-info azure_storage_blob-12.3.2.dist-info azure_storage_common-2.1.0.dist-info azure_storage_nspkg-3.1.0.dist-info cachetools-4.2.4.dist-info charset_normalizer-2.0.12.dist-info click-7.1.2.dist-info deepmerge-0.1.1.dist-info frozenlist-1.3.0.dist-info furl-2.1.3.dist-info google_api_core-1.31.5.dist-info googleapis_common_protos-1.56.1.dist-info google_auth-1.35.0.dist-info google_cloud_bigquery-1.25.0.dist-info google_cloud_core-1.7.2.dist-info google_resumable_media-0.5.1.dist-info iniconfig-1.1.1.dist-info isodate-0.6.1.dist-info Jinja2-2.11.3.dist-info MarkupSafe-2.0.1.dist-info more_itertools-8.13.0.dist-info msrest-0.6.21.dist-info multidict-6.0.2.dist-info numpy-1.19.5.dist-info orderedmultidict-1.0.1.dist-info packaging-21.3.dist-info pandas-1.0.3.dist-info pbr-5.9.0.dist-info pip-22.1.2.dist-info pluggy-0.13.1.dist-info protobuf-4.21.1.dist-info py-1.11.0.dist-info py4j-0.10.9.dist-info pyarrow-0.17.1.dist-info pyasn1-0.4.8.dist-info pyasn1_modules-0.2.8.dist-info pypandoc-1.5.dist-info pyparsing-3.0.9.dist-info pyspark-3.0.1.dist-info pytest-6.0.1.dist-info python_dateutil-2.8.1.dist-info PyYAML-5.4.dist-info requests_oauthlib-1.3.1.dist-info rsa-4.8.dist-info six-1.13.0.dist-info testresources-2.0.1.dist-info toml-0.10.2.dist-info XlsxWriter-1.2.3.dist-inf
@jerpelea may be there are some problems with parquet file. We need to check it
@vlad-isayko what version are you using? Do you have any suggestions how to check it?
@jerpelea we use the same libraries with the same versions. Can you share some files that generated in staging area?
@vlad-isayko thanks for your quick answer Here is the file repository-2021-01-01.zip
@jerpelea
Is there any files in /staging/github/events/push/2021/01/01/
?
Before step 6 there should be files in directories:
-
/staging/github/raw-events/push/2021/01/01/
-
/staging/github/repository/2021/01/
-
/staging/github/events/push/2021/01/01/
@vlad-isayko I have /landing/githug/events/push/2021/01/01/ /staging/github/raw-events/push/2021/01/01/ /staging/github/repository/2021/01/
there is no /staging/github/events/push/2021/01/01/
Thanks
@jerpelea
Can you rerun step 5 python3 osci-cli.py filter-unlicensed -d 2020-01-01
and share logs from this command?
I think that there some problem at this step.
@vlad-isayko attached are the log file and some result files
filter-unlicensed.zip github.zip
thanks
@jerpelea
Ok, it's strange that repository file in staging is empty...
Is there this file /landing/github/repository/2021/01/2021-01-01.csv
?
Can you share it?
2021-01-01.zip @vlad-isayko
@jerpelea
So the error occurred at step 4 when getting information about the repositories from the Github API.
I ran this step on my own with your source file and I will then check the output.
Could you check your config for a valid github api token?
github:
token: '394***************************************77'
@vlad-isayko thanks for pointing it out I think that token setup is a missing step on the README I added the token in local.yml and restarted step 4
this is how the logs look now [2022-06-13 09:42:38,265] [INFO] Get repository MinCiencia/Datos-COVID19 information [2022-06-13 09:42:38,265] [DEBUG] Make request to Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={} [2022-06-13 09:42:38,485] [DEBUG] https://api.github.com:443 "GET /repos/MinCiencia/Datos-COVID19 HTTP/1.1" 200 None [2022-06-13 09:42:38,486] [DEBUG] Get response[200] from Github API method=GET, url=https://api.github.com/repos/MinCiencia/Datos-COVID19, kwargs={'headers': {'Authorization': 'token gxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxdo'}}
I will keep you updated on the progress Thanks for support
@vlad-isayko new errors at step6 daily-osci-rankings.zip
@jerpelea
Can you share this files:
-
/data/staging/github/events/push/2020/01/01/unity_technologies-2020-01-01.parquet
-
/data/staging/github/events/push/2020/01/01/secops_solutions-2020-01-01.parquet
-
/data/staging/github/events/push/2020/01/01/luxoft-2020-01-01.parquet
-
/data/staging/github/events/push/2020/01/01/lyft-2020-01-01.parquet
-
/data/staging/github/events/push/2020/01/01/cloudbees-2020-01-01.parquet
@jerpelea
Ok, there is a bug in saving pandas dataframe in parquet format. A column where all None values are converted to Int32 when stored.
This case is quite rare, apparently because of this we did not catch this bug earlier.
We plan to fix this bug.
At the moment, you can resave these files in the correct conversion.
@vlad-isayko how do I resave them ?
@jerpelea
You can run this simple script. Or can share files from /data/staging/github/events/push/
, so I can do it for you
import pandas as pd
from pathlib import Path
for path in Path('/data/staging/github/events/push/').rglob('*.parquet'):
pd.read_parquet(path).astype({'language': str, 'org_name': str}).to_parquet(path, index=False)
@vlad-isayko thanks for the fix
It fixed the issue and step 6 completed