swift-colab icon indicating copy to clipboard operation
swift-colab copied to clipboard

Incorrect rendering of Pandas DataFrame + unexpected LLDB import behavior

Open philipturner opened this issue 2 years ago • 3 comments

I have seen this bug around for quite some time now. IIRC, the bug existed when Google sponsored S4TF. Pandas DataFrames are having their columns rearranged before rendering to the Colab GUI. This happens roughly 50% of the time that you render a DataFrame:

Screen Shot 2022-08-11 at 8 15 36 AM Screen Shot 2022-08-11 at 8 12 47 AM

Code to reproduce:

%install-swiftpm-flags $clear
%install '.package(url: "https://github.com/pvieito/PythonKit.git", .branch("master"))' PythonKit
%install '.package(url: "https://github.com/KarthikRIyer/swiftplot", .branch("master"))' SwiftPlot AGGRenderer
%include "EnableIPythonDisplay.swift"
import Foundation
import PythonKit
import SwiftPlot
import AGGRenderer

let display = Python.import("IPython.display")
let pd = Python.import("pandas")
display.display(pd.DataFrame.from_records([["col 1": 3, "col 2": 5], ["col 1": 8, "col 2": 2]]))

This could be a symptom of feeding the data through LLDB. If it was serialized through JSON with my new inter-process piping mechanism, perhaps it would always render correctly.

philipturner avatar Aug 11 '22 12:08 philipturner

While narrowing down the reproducer, I found another unrelated bug. After restarting the runtime, #if canImport() statements will succeed when the corresponding module can't be loaded.

Screen Shot 2022-08-11 at 8 20 51 AM

Solution: Delete all of the .swiftmodule files from /opt/swift/install-location/modules unless they were recently referenced by a %install command. This requires tracking and storing everything the current Jupyter session has loaded, and saving it after changing the install location with %install-location.

The documented behavior aligns with the Swift-Colab v2.0 instruction that you must call %install before using the Swift module. It differs from Clang modules, which have the opposite convention due to fundamental compiler limitations. They must be visible to LLDB upon launch, so never delete them from /opt/swift/install-location/modules.

That limitation poses another problem: if switching the install location, you can only use Clang modules visible at the last place assigned to %install-location. I could make a global cache of Clang modules across all install locations, although that would become corrupted if one Clang module is malformed. Users might want to use %install-location to containerize code and bypass such corruption. This would be a big problem in Swift-Colab v3.0, where purging the cache means deleting a lot of persistent build products.

Solution: Clearly document that you can call %system rm -rf /opt/swift/install-location/modules. A global module cache could be symlinked to the install location. The URL would not work in local notebooks, which should use a containerized environment without access to /opt. I can document that this path will be different on local notebooks, and tell users where to find it.

Even better idea: Make a /opt/swift/modules directory and don't expose anything symlinked to LLDB, not even the install location. I can now (try to) stop symlinking the install location anywhere. This might significantly diverge from the original design of %install-location, but that design was based on a less flexible build system.

philipturner avatar Aug 11 '22 12:08 philipturner

Narrowed reproducer for this issue's original bug report:

%install-swiftpm-flags $clear
%install '.package(url: "https://github.com/pvieito/PythonKit.git", .branch("master"))' PythonKit
%include "EnableIPythonDisplay.swift"

import PythonKit
let display = Python.import("IPython.display")
let pd = Python.import("pandas")
display.display(pd.DataFrame.from_records([["col 1": 3, "col 2": 5], ["col 1": 8, "col 2": 2]]))

If you restart the runtime, no plots show the very first time you run the cell. This is a problem with the current implementation of KernelCommunicator, where you must finish the first cell that includes EnableIPythonDisplay.swift before receiving display messages. I'm going to reimplement KernelCommunicator soon, which should eliminate that behavior, and hopefully the incorrect DataFrame rendering.

philipturner avatar Aug 11 '22 12:08 philipturner

The problem is a fundamental flaw of PythonKit. When transforming the dictionary literal into a Python object, it first transforms the dictionary into a Swift Dictionary. The hashed storage does not preserve order, so the Python dictionary now has shuffled elements. For example, the code below produces incorrect DataFrame tables:

let array: [PythonObject] = [
  ["col 1": 3, "col 2": 5, "col 3": 4],
  ["col 1": 8, "col 2": 2, "col 3": 4]
]
print(array)
display.display(pd.DataFrame.from_records(array))
Screen Shot 2022-08-11 at 9 11 14 AM

While this code does not:

let array: [PythonObject] = [[:], [:]]
array[0]["col 1"] = 3
array[0]["col 2"] = 5
array[0]["col 3"] = 4
array[1]["col 1"] = 8
array[1]["col 2"] = 2
array[1]["col 3"] = 4
print(array)
display.display(pd.DataFrame.from_records(array))
Screen Shot 2022-08-11 at 9 10 25 AM

philipturner avatar Aug 11 '22 13:08 philipturner