scancode.io
scancode.io copied to clipboard
Output Structure for Licenseclassifier
I'm developing a Python based package of Google Licenseclassifier for this year's edition of GSoC. Right now, I'm able to give the output of the scan as a JSON file and would like to integrate the result with Scancode.io to form a new pipeline. Is there any particular structure that I need to follow?
My current JSON structure is as follows:
@AvishrantsSh yes, you want to use the CodebaseResource model structure for the files.
Based on the data above, you want to feed the data in the following fields: path
, copyrights
, and licenses
.
See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipes/init.py#L64 as a pipe example for creating CodebaseResource.
I was able to make a new pipeline, with minimum functionality, but ran into some problems.
2021-06-17 13:07:39.87 Pipeline [glc_scan] starting
2021-06-17 13:07:39.87 Step [copy_inputs_to_codebase_directory] starting
2021-06-17 13:07:39.90 Step [copy_inputs_to_codebase_directory] completed in 0.03 seconds
2021-06-17 13:07:39.91 Step [run_extractcode] starting
2021-06-17 13:07:40.60 Step [run_extractcode] completed in 0.69 seconds
2021-06-17 13:07:40.60 Step [run_glc] starting
2021-06-17 13:07:52.91 Step [run_glc] completed in 12.30 seconds
2021-06-17 13:07:52.91 Step [build_inventory_from_scan] starting
2021-06-17 13:07:53.52 Pipeline failed
Here is the error log:
An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.
Traceback:
File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/__init__.py", line 96, in execute
step(self)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/glc_scan.py", line 61, in build_inventory_from_scan
scancode.create_codebase_resources(project, scanned_codebase)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipes/scancode.py", line 360, in create_codebase_resources
CodebaseResource.objects.get_or_create(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 588, in get_or_create
return self.create(**params), True
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
obj.save(force_insert=True, using=self.db)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 1029, in save
super().save(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 599, in save
self.add_error(error)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 629, in add_error
return self.project.add_error(
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 505, in add_error
return ProjectError.objects.create(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
obj.save(force_insert=True, using=self.db)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 726, in save
self.save_base(using=using, force_insert=force_insert,
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 763, in save_base
updated = self._save_table(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 868, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 906, in _do_insert
return manager._insert(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 1270, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
cursor.execute(sql, params)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 98, in execute
return super().execute(sql, params)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 78, in _execute
self.db.validate_no_broken_transaction()
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/base/base.py", line 447, in validate_no_broken_transaction
raise TransactionManagementError(
Also, the new pipeline seems to work on smaller codebases. Following image showcases the result of scanning a codebase of 2 files with the new pipeline.
nit picking: for texts it is usually better to copy/paste the text rather than taking a screen shot of it :) images cannot be searched and copied :)
@AvishrantsSh what is your ScanCode.io setup? Running with Docker or with make run
?
Also, can you provide a link your code so we can have a look.
@AvishrantsSh what is your ScanCode.io setup? Running with Docker or with
make run
?
I'm using make run
You can try to desactivate the multiprocessing with SCANCODEIO_PROCESSES=-1
in your local .env file, and re-run the pipeline.
You also want to provide your code in a branch with instructions so we can run it on our side and help with debugging.
You also want to provide your code in a branch with instructions so we can run it on our side and help with debugging.
Yes I'm working on a different branch for this new pipeline. See AvishrantsSh/scancode.io/tree/glc. Nothing special required I guess rn. I have added the prototype python package as a requirement so that you can use make dev
.
@AvishrantsSh any success with disabling multiprocessing?
@AvishrantsSh any success with disabling multiprocessing?
Sadly no. But I don't think its a problem with multiprocessing, because rest of the pipelines are working smoothly but still load_inventory pipeline is unable to load the JSON generated by my package. I am trying to use VirtualCodebase (as in scancode pipeline) before storing the results into the Database. I think the problem lies somewhere around here.
Do I need to manually feed all the data into the CodebaseResource model? Or can I use the existing architecture to do the same?
But I don't think its a problem with multiprocessing, because rest of the pipelines are working smoothly but still load_inventory pipeline is unable to load the JSON generated by my package.
load_inventory expect an output generated by scancode-toolkit as its input. Trying to feed an output from another tools is likely to fail.
Do I need to manually feed all the data into the CodebaseResource model? Or can I use the existing architecture to do the same?
I would suggest to write a pipe function in your pipes.glc module that parse the output from the previous steps and create the CodebaseResource objects in the database. The value of the function is to map the data from glc to scancode.io, you can call pipes.make_codebase_resource for the actual database insertion.
I would suggest to write a pipe function in your pipes.glc module that parse the output from the previous steps and create the CodebaseResource objects in the database
Yes!! This worked for me. But there seems to be some problem that I can't get hold off. Sometimes insertion into the database would suddenly fail at get_or_create
. In my case, it occured when a very large copyright expression was encountered.
ut there seems to be some problem that I can't get hold off. Sometimes insertion into the database would suddenly fail at get_or_create. In my case, it occured when a very large copyright expression was encountered.
Pasting the error output here would help to debug.
Pasting the error output here would help to debug.
The error log is same as submitted earlier:
An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.
Traceback:
File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/__init__.py", line 96, in execute
step(self)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/glc_scan.py", line 61, in build_inventory_from_scan
scancode.create_codebase_resources(project, scanned_codebase)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipes/scancode.py", line 360, in create_codebase_resources
CodebaseResource.objects.get_or_create(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 588, in get_or_create
return self.create(**params), True
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
obj.save(force_insert=True, using=self.db)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 1029, in save
super().save(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 599, in save
self.add_error(error)
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 629, in add_error
return self.project.add_error(
File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 505, in add_error
return ProjectError.objects.create(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
obj.save(force_insert=True, using=self.db)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 726, in save
self.save_base(using=using, force_insert=force_insert,
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 763, in save_base
updated = self._save_table(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 868, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 906, in _do_insert
return manager._insert(
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 1270, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
cursor.execute(sql, params)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 98, in execute
return super().execute(sql, params)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 78, in _execute
self.db.validate_no_broken_transaction()
File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/base/base.py", line 447, in validate_no_broken_transaction
raise TransactionManagementError(
@AvishrantsSh Are you still having this issue?