scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Output Structure for Licenseclassifier

Open AvishrantsSh opened this issue 3 years ago • 17 comments

I'm developing a Python based package of Google Licenseclassifier for this year's edition of GSoC. Right now, I'm able to give the output of the scan as a JSON file and would like to integrate the result with Scancode.io to form a new pipeline. Is there any particular structure that I need to follow?

My current JSON structure is as follows:

121688801-7ab3fb00-cae1-11eb-8a2f-7cb78231e36a

AvishrantsSh avatar Jun 16 '21 12:06 AvishrantsSh

@AvishrantsSh yes, you want to use the CodebaseResource model structure for the files. Based on the data above, you want to feed the data in the following fields: path, copyrights, and licenses.

tdruez avatar Jun 16 '21 12:06 tdruez

See https://github.com/nexB/scancode.io/blob/main/scanpipe/pipes/init.py#L64 as a pipe example for creating CodebaseResource.

tdruez avatar Jun 16 '21 13:06 tdruez

I was able to make a new pipeline, with minimum functionality, but ran into some problems.

2021-06-17 13:07:39.87 Pipeline [glc_scan] starting
2021-06-17 13:07:39.87 Step [copy_inputs_to_codebase_directory] starting
2021-06-17 13:07:39.90 Step [copy_inputs_to_codebase_directory] completed in 0.03 seconds
2021-06-17 13:07:39.91 Step [run_extractcode] starting
2021-06-17 13:07:40.60 Step [run_extractcode] completed in 0.69 seconds
2021-06-17 13:07:40.60 Step [run_glc] starting
2021-06-17 13:07:52.91 Step [run_glc] completed in 12.30 seconds
2021-06-17 13:07:52.91 Step [build_inventory_from_scan] starting
2021-06-17 13:07:53.52 Pipeline failed

Here is the error log:

An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.

Traceback:
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/__init__.py", line 96, in execute
    step(self)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/glc_scan.py", line 61, in build_inventory_from_scan
    scancode.create_codebase_resources(project, scanned_codebase)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipes/scancode.py", line 360, in create_codebase_resources
    CodebaseResource.objects.get_or_create(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 588, in get_or_create
    return self.create(**params), True
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 1029, in save
    super().save(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 599, in save
    self.add_error(error)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 629, in add_error
    return self.project.add_error(
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 505, in add_error
    return ProjectError.objects.create(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 726, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 763, in save_base
    updated = self._save_table(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 868, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 906, in _do_insert
    return manager._insert(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 98, in execute
    return super().execute(sql, params)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 78, in _execute
    self.db.validate_no_broken_transaction()
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/base/base.py", line 447, in validate_no_broken_transaction
    raise TransactionManagementError(

Also, the new pipeline seems to work on smaller codebases. Following image showcases the result of scanning a codebase of 2 files with the new pipeline.

image

AvishrantsSh avatar Jun 17 '21 13:06 AvishrantsSh

nit picking: for texts it is usually better to copy/paste the text rather than taking a screen shot of it :) images cannot be searched and copied :)

pombredanne avatar Jun 17 '21 13:06 pombredanne

@AvishrantsSh what is your ScanCode.io setup? Running with Docker or with make run?

tdruez avatar Jun 17 '21 13:06 tdruez

Also, can you provide a link your code so we can have a look.

tdruez avatar Jun 17 '21 13:06 tdruez

@AvishrantsSh what is your ScanCode.io setup? Running with Docker or with make run?

I'm using make run

AvishrantsSh avatar Jun 17 '21 13:06 AvishrantsSh

You can try to desactivate the multiprocessing with SCANCODEIO_PROCESSES=-1 in your local .env file, and re-run the pipeline. You also want to provide your code in a branch with instructions so we can run it on our side and help with debugging.

tdruez avatar Jun 17 '21 13:06 tdruez

You also want to provide your code in a branch with instructions so we can run it on our side and help with debugging.

Yes I'm working on a different branch for this new pipeline. See AvishrantsSh/scancode.io/tree/glc. Nothing special required I guess rn. I have added the prototype python package as a requirement so that you can use make dev.

AvishrantsSh avatar Jun 17 '21 14:06 AvishrantsSh

@AvishrantsSh any success with disabling multiprocessing?

tdruez avatar Jun 17 '21 14:06 tdruez

@AvishrantsSh any success with disabling multiprocessing?

Sadly no. But I don't think its a problem with multiprocessing, because rest of the pipelines are working smoothly but still load_inventory pipeline is unable to load the JSON generated by my package. I am trying to use VirtualCodebase (as in scancode pipeline) before storing the results into the Database. I think the problem lies somewhere around here.

AvishrantsSh avatar Jun 17 '21 16:06 AvishrantsSh

Do I need to manually feed all the data into the CodebaseResource model? Or can I use the existing architecture to do the same?

AvishrantsSh avatar Jun 17 '21 16:06 AvishrantsSh

But I don't think its a problem with multiprocessing, because rest of the pipelines are working smoothly but still load_inventory pipeline is unable to load the JSON generated by my package.

load_inventory expect an output generated by scancode-toolkit as its input. Trying to feed an output from another tools is likely to fail.

Do I need to manually feed all the data into the CodebaseResource model? Or can I use the existing architecture to do the same?

I would suggest to write a pipe function in your pipes.glc module that parse the output from the previous steps and create the CodebaseResource objects in the database. The value of the function is to map the data from glc to scancode.io, you can call pipes.make_codebase_resource for the actual database insertion.

tdruez avatar Jun 18 '21 07:06 tdruez

I would suggest to write a pipe function in your pipes.glc module that parse the output from the previous steps and create the CodebaseResource objects in the database

Yes!! This worked for me. But there seems to be some problem that I can't get hold off. Sometimes insertion into the database would suddenly fail at get_or_create. In my case, it occured when a very large copyright expression was encountered.

AvishrantsSh avatar Jun 18 '21 18:06 AvishrantsSh

ut there seems to be some problem that I can't get hold off. Sometimes insertion into the database would suddenly fail at get_or_create. In my case, it occured when a very large copyright expression was encountered.

Pasting the error output here would help to debug.

tdruez avatar Jun 22 '21 08:06 tdruez

Pasting the error output here would help to debug.

The error log is same as submitted earlier:

An error occurred in the current transaction. You can't execute queries until the end of the 'atomic' block.

Traceback:
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/__init__.py", line 96, in execute
    step(self)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipelines/glc_scan.py", line 61, in build_inventory_from_scan
    scancode.create_codebase_resources(project, scanned_codebase)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/pipes/scancode.py", line 360, in create_codebase_resources
    CodebaseResource.objects.get_or_create(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 588, in get_or_create
    return self.create(**params), True
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 1029, in save
    super().save(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 599, in save
    self.add_error(error)
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 629, in add_error
    return self.project.add_error(
  File "/home/avishrant/GitRepo/scancode.io/scanpipe/models.py", line 505, in add_error
    return ProjectError.objects.create(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
    obj.save(force_insert=True, using=self.db)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 726, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 763, in save_base
    updated = self._save_table(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 868, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/base.py", line 906, in _do_insert
    return manager._insert(
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 98, in execute
    return super().execute(sql, params)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/utils.py", line 78, in _execute
    self.db.validate_no_broken_transaction()
  File "/home/avishrant/GitRepo/scancode.io/lib/python3.8/site-packages/django/db/backends/base/base.py", line 447, in validate_no_broken_transaction
    raise TransactionManagementError(

AvishrantsSh avatar Jun 23 '21 04:06 AvishrantsSh

@AvishrantsSh Are you still having this issue?

tdruez avatar Aug 03 '21 08:08 tdruez