dbx icon indicating copy to clipboard operation
dbx copied to clipboard

dbx sync fails with "Files can be created in parallel, but we limit how many are opened so we don't use memory excessively"

Open andrei-radulescu-banu opened this issue 2 years ago • 0 comments
trafficstars

Expected Behavior

I run $ dbx sync repo --dest-repo <repo_name>. The command should run continuously and sync my repo with the Databricks workspace.

Current Behavior

When editing files in the sandbox, the command crashes every minute or so and displays this stack trace:

(databricks) andrei@alien:~/isee/databricks/datasets$ dbx sync repo --dest-repo datasets
[dbx][2022-12-15 10:07:27.759] Syncing from /home/andrei/isee/databricks/datasets
[dbx][2022-12-15 10:07:27.761] Ignoring patterns from /home/andrei/isee/databricks/datasets/.gitignore
[dbx][2022-12-15 10:07:27.768] Target base path: /Repos/[email protected]/datasets
[dbx][2022-12-15 10:07:27.769] Starting initial copy
[dbx][2022-12-15 10:07:27.769] Restoring sync snapshot from /home/andrei/isee/databricks/datasets/.dbx/sync/datasets-repos-e7e5d267c158d829106d1c80104f2e37
[dbx][2022-12-15 10:07:27.881] Checking if any unmatched files/directories would be deleted
[dbx][2022-12-15 10:07:27.882] Putting /Repos/[email protected]/datasets/common/util/ros.py
[dbx][2022-12-15 10:07:28.711] Done. Watching for changes...
[dbx][2022-12-15 10:08:37.960] Done
[dbx][2022-12-15 10:08:40.368] Putting /Repos/[email protected]/datasets/common/util/ros.py
[dbx][2022-12-15 10:08:40.885] Done
[dbx][2022-12-15 10:10:08.444] Done
[dbx][2022-12-15 10:10:12.831] Putting /Repos/[email protected]/datasets/common/util/ros.py
[dbx][2022-12-15 10:10:13.653] Done
[dbx][2022-12-15 10:10:21.534] Done
[dbx][2022-12-15 10:10:23.148] Putting /Repos/[email protected]/datasets/common/util/ros.py
[dbx][2022-12-15 10:10:24.105] Done
[dbx][2022-12-15 10:10:27.000] Done
[dbx][2022-12-15 10:10:46.899] Putting /Repos/[email protected]/datasets/common/util/ros.py
[dbx][2022-12-15 10:10:47.239] Done
[dbx][2022-12-15 10:11:03.128] Done
[dbx][2022-12-15 10:11:10.249] Putting /Repos/[email protected]/datasets/common/util/ros.py
[dbx][2022-12-15 10:11:10.633] Done
[dbx][2022-12-15 10:11:19.013] Done
[dbx][2022-12-15 10:11:36.656] Putting /Repos/[email protected]/datasets/common/util/#ros.py#
[dbx][2022-12-15 10:11:37.179] HTTP 409: {"message":"Creating file failed. An item with path /Repos/[email protected]/datasets/common/util already exists"}
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/commands/sync/sync.py:318 in repo  │
│                                                                                                  │
│   315 │                                                                                          │
│   316 │   client = ReposClient(user=user_name, repo_name=dest_repo, config=config)               │
│   317 │                                                                                          │
│ ❱ 318 │   main_loop(                                                                             │
│   319 │   │   source=source,                                                                     │
│   320 │   │   matcher=matcher,                                                                   │
│   321 │   │   client=client,                                                                     │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/commands/sync/functions.py:153 in  │
│ main_loop                                                                                        │
│                                                                                                  │
│   150 │   │   │   │   │   time.sleep(sleep_interval)                                             │
│   151 │   │   │   │                                                                              │
│   152 │   │   │   │   # Run incremental copy to sync over changes since the last sync.           │
│ ❱ 153 │   │   │   │   op_count = syncer.incremental_copy()                                       │
│   154 │   │   │   │                                                                              │
│   155 │   │   │   │   # simple way to enable unit testing to break out of loop                   │
│   156 │   │   │   │   if op_count < 0:                                                           │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/sync/__init__.py:449 in            │
│ incremental_copy                                                                                 │
│                                                                                                  │
│   446 │   │                                                                                      │
│   447 │   │   # Use the diff between current snapshot and previous snapshot to apply the same    │
│   448 │   │   # against the remote location.                                                     │
│ ❱ 449 │   │   op_count = asyncio.run(self._apply_snapshot_diff(diff))                            │
│   450 │   │                                                                                      │
│   451 │   │   self.last_snapshot = snapshot                                                      │
│   452                                                                                            │
│                                                                                                  │
│ /usr/lib/python3.8/asyncio/runners.py:44 in run                                                  │
│                                                                                                  │
│   41 │   │   events.set_event_loop(loop)                                                         │
│   42 │   │   if debug is not None:                                                               │
│   43 │   │   │   loop.set_debug(debug)                                                           │
│ ❱ 44 │   │   return loop.run_until_complete(main)                                                │
│   45 │   finally:                                                                                │
│   46 │   │   try:                                                                                │
│   47 │   │   │   _cancel_all_tasks(loop)                                                         │
│                                                                                                  │
│ /usr/lib/python3.8/asyncio/base_events.py:616 in run_until_complete                              │
│                                                                                                  │
│    613 │   │   if not future.done():                                                             │
│    614 │   │   │   raise RuntimeError('Event loop stopped before Future completed.')             │
│    615 │   │                                                                                     │
│ ❱  616 │   │   return future.result()                                                            │
│    617 │                                                                                         │
│    618 │   def stop(self):                                                                       │
│    619 │   │   """Stop running the event loop.                                                   │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/sync/__init__.py:241 in            │
│ _apply_snapshot_diff                                                                             │
│                                                                                                  │
│   238 │   │   │                                                                                  │
│   239 │   │   │   op_count += await self._apply_dirs_deleted(diff, session, deleted_dirs)        │
│   240 │   │   │   op_count += await self._apply_dirs_created(diff, session)                      │
│ ❱ 241 │   │   │   op_count += await self._apply_files_created(diff, session)                     │
│   242 │   │   │   op_count += await self._apply_files_deleted(diff, session, deleted_dirs)       │
│   243 │   │   │   op_count += await self._apply_files_modified(diff, session)                    │
│   244                                                                                            │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/sync/__init__.py:206 in            │
│ _apply_files_created                                                                             │
│                                                                                                  │
│   203 │   │   return op_count                                                                    │
│   204 │                                                                                          │
│   205 │   async def _apply_files_created(self, diff: SnapshotDiff, session: aiohttp.ClientSess   │
│ ❱ 206 │   │   return await self._apply_file_puts(session, diff.files_created, "created")         │
│   207 │                                                                                          │
│   208 │   async def _apply_files_modified(self, diff: SnapshotDiff, session: aiohttp.ClientSes   │
│   209 │   │   return await self._apply_file_puts(session, diff.files_modified, "modified")       │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/sync/__init__.py:202 in            │
│ _apply_file_puts                                                                                 │
│                                                                                                  │
│   199 │   │   │   else:                                                                          │
│   200 │   │   │   │   dbx_echo(f"(noop) File {msg}: {path}")                                     │
│   201 │   │   if tasks:                                                                          │
│ ❱ 202 │   │   │   await asyncio.gather(*tasks)                                                   │
│   203 │   │   return op_count                                                                    │
│   204 │                                                                                          │
│   205 │   async def _apply_files_created(self, diff: SnapshotDiff, session: aiohttp.ClientSess   │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/sync/__init__.py:196 in task       │
│                                                                                                  │
│   193 │   │   │   │   │   # Files can be created in parallel, but we limit how many are opened   │
│   194 │   │   │   │   │   # so we don't use memory excessively.                                  │
│   195 │   │   │   │   │   async with sem:  # noqa                                                │
│ ❱ 196 │   │   │   │   │   │   await self.client.put(get_relative_path(self.source, p), p, sess   │
│   197 │   │   │   │                                                                              │
│   198 │   │   │   │   tasks.append(task(path))                                                   │
│   199 │   │   │   else:                                                                          │
│                                                                                                  │
│ /home/andrei/.venv/databricks/lib/python3.8/site-packages/dbx/sync/clients.py:273 in put         │
│                                                                                                  │
│   270 │   │   │   │   │   else:                                                                  │
│   271 │   │   │   │   │   │   txt = await resp.text()                                            │
│   272 │   │   │   │   │   │   dbx_echo(f"HTTP {resp.status}: {txt}")                             │
│ ❱ 273 │   │   │   │   │   │   raise ClientError(resp.status)                                     │
│   274                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ClientError: 409
(databricks) andrei@alien:~/isee/databricks/datasets$ dbx --version
[dbx][2022-12-15 10:15:41.273] 🧱 Databricks eXtensions aka dbx, version ~> 0.8.7

Steps to Reproduce (for bugs)

I'm continuously editing my sandbox in emacs while running dbx in the background.

Context

Your Environment

  • dbx version used: 0.8.7
  • Databricks Runtime version: 11.3
  • Host side: Ubuntu 20.04

andrei-radulescu-banu avatar Dec 15 '22 15:12 andrei-radulescu-banu