pyapacheatlas icon indicating copy to clipboard operation
pyapacheatlas copied to clipboard

Atlas Bulk upload with batch size throwing error

Open sbbagal13 opened this issue 2 years ago • 3 comments

I am trying to upload excel with bulk upload functionality, for smaller size file its working fine , but having more data throws error from server side (504 time-out) , So I tried putting batch size parameter. But then its started throwing below error. Any suggestion to resolve it?

client = AtlasClient(endpoint_url='http://xxxx.xxx.xx/api/atlas/v2',
                     authentication=auth)
result = client.upload_entities(batch=entities, batch_size=1000)

File "C:\Users\sbagal\PycharmProjects\pythonProject\venv\lib\site-packages\pyapacheatlas\core\client.py", line 1243, in upload_entities payload["entities"], batch_size=batch_size)] File "C:\Users\sbagal\PycharmProjects\pythonProject\venv\lib\site-packages\pyapacheatlas\core\util.py", line 274, in batch_dependent_entities if len(candidate_set) > largest_candidate_size: TypeError: object of type 'NoneType' has no len()

sbbagal13 avatar Aug 03 '22 22:08 sbbagal13

@wjohnson Any suggestion on this. I have excel with around 20K entities. But its not able to process complete at once .. so tried loading with specifying batch_size=1000 , Also one question on same line , do we need all required relationship entities to be available in same batch? for ex. if I create table in first batch , can I use it in second batch for column relationship or it again require to have in second batch ?

sbbagal13 avatar Aug 04 '22 16:08 sbbagal13

Hi, @sbbagal13 Thank you for using PyApacheAtlas!

Can you share which version of PyApaceAtlas you're on? The traceback is in a surprising spot. It's like it couldn't find any of the assets to be grouped together.

Your second question makes we wonder how deeply connected are you assets? For example, are all 20,000 related to each other? The batching algorithm looks for relationships and will create batches that include all related entities together. If you have a group of related entities that exceed your batch size, it should throw an exception.

wjohnson avatar Aug 09 '22 20:08 wjohnson

Hi @wjohnson , Thank you for responding. I have tried with latest version 0.13 and get same exception. As you said it might be issue with how the data in organized in excel. Here is how we have data in excel: 1: rdbms_instance (first/single row) 2: rdbms_db (Second row till 1050 rows) 3: rdbms_db (Second row till 55 rows) 4: rdbms_columns (56 row to ~20k rows)

How can I load this data using batches ? what would be the best approach to making batches and batch size as well Do each batch need to maintain the hierarchy (rdbms_instance => rdbms_db => rdbms_tables => rdbms_colums ) ?

sbbagal13 avatar Aug 09 '22 22:08 sbbagal13

Oh yikes, @sbbagal13 - I might batch this up into a few things and take advantage of the AtlasObjectId feature. You can see more about this feature here

I'd follow this order of operations:

  • Upload the rdbms_instance by itself, collect it's guid
  • Upload the rdbms_db by themselves and...
    • Add a column called [Relationship] instance
    • Fill that column with AtlasObjectId(guid:123-abc-456) where 123-abc-456 is the guid of the rdbms_instance
    • Perform an upload with a batch size of 100 or 200 - They should all now be independent.
    • Note the guids
  • Upload the rdbms_table and columns and...
    • Add a column called [Relationship] db and [Relationship] table
    • Fill the db column for each table row with AtlasObjectId(guid:123-abc-456) where 123-abc-456 is the guid of the rdbms_db
    • Fill the table column for each rdbms_column row with the FQN of the table that is being uploaded with it.
    • Perform an upload with a batch size of 1000 or 2000 - They should all now be independent.
    • This assumes that no one table has more than 1,000 columns

That SHOULD solve the problem of everything being related to the one root rdbms instance. Please let me know if you give it a try!

wjohnson avatar Sep 04 '22 05:09 wjohnson

Closing this for now (due to > 1 month since last update) but @sbbagal13 please feel free to reopen if you are still having an issue.

wjohnson avatar Oct 05 '22 13:10 wjohnson