A way to position lights with transforms (maybe a variant of worldToScreen?)
Increasing access
While working on this sketch https://openprocessing.org/sketch/2670771 I was positioning an exit sign that I also wanted to cast light. I was positioning the sign model using translations and rotations. Getting the light to go in the same spot was going to be tough because functions like pointLight don't take those into account, and requires matrix math. This is likely not something everyone is comfortable with, and may result in them not trying to use lights in their scene otherwise.
Most appropriate sub-area of p5.js?
- [ ] Accessibility
- [ ] Color
- [ ] Core/Environment/Rendering
- [ ] Data
- [ ] DOM
- [ ] Events
- [ ] Image
- [ ] IO
- [ ] Math
- [ ] Typography
- [ ] Utilities
- [x] WebGL
- [ ] Build process
- [ ] Unit testing
- [ ] Internationalization
- [ ] Friendly errors
- [ ] Other (specify if possible)
Feature enhancement details
I'm not sure yet what the best way to deal with this is, but, some ideas:
- Make lights take transformations into account
- Benefits: probably the most straightforward conceptually
- Downsides: to not be a breaking change, this would likely have to be optional, and lighting overloads are already quite complex (see the long list for
spotLight, for example)
- Add a way to get world-space coordinates from local coordinates
- Benefits: we have something similar for screen coordinates with
worldToScreenandscreenToWorld. This is what I ended up doing in my sketch, defining a similarworldPointmethod at the top - Downsides:
- The target space is world space, but I think we maybe named those other methods slightly inaccurately, since they also use "world" but in a less standard definition. They go from a local coordinate to a screen coordinate. In shader hooks, we use the terms object space, world space, and camera space. Based on those, it would more accurately be
objectToScreen. Not sure how to navigate that naming just yet, open to suggestions! - It requires some more steps to use it -- you have to first grab the world coordinate given the current transforms, and then pass it into a lighting function
- The target space is world space, but I think we maybe named those other methods slightly inaccurately, since they also use "world" but in a less standard definition. They go from a local coordinate to a screen coordinate. In shader hooks, we use the terms object space, world space, and camera space. Based on those, it would more accurately be
- Benefits: we have something similar for screen coordinates with
Hi @davepagurek!
I’m so glad you posted this issue. It led me to some exciting ideas! I think I have a solution to all the problems you raised. I'll start with the overall concept. Then I'll explain how it solves (a) your light problem, (b) the issue you uncovered with worldToScreen() and screenToWorld(), and (c) other related, longstanding issues. I'd love to hear your feedback.
Transform class
As we’ve discussed, I’m working on a proposal for a Transform class that’s essentially an object-oriented version of the existing Transform features. It provides a user-friendly interface that abstracts away the necessary matrix math. Even for users who know a lot about matrices, putting them behind a user-friendly interface has many advantages:
- It makes user code readable and self-documenting
- It prevents errors
- It enforces conventions
- It even turns some multiline operations into one-liners.
Here are some of the core methods, to give you a sense of it:
// getting/setting reference frame
xAxis()
yAxis()
zAxis()
origin()
// building transforms from standard operations
translate()
scale()
rotate()
// building transforms from general operations
applyTransform()
applyToTransform()
invert()
// using transforms
applyToPoint()
applyToDirection()
applyToNormal()
With the Transform class in hand, we can easily provide access to the important Transform objects, including the usual model, view, and projection transforms. Access to such features has been requested for a long time, and it'd solve current problems faced by WEBGL users. At the moment, they're resorting to using unstable, undocumented internal features, as you pointed out. That's a desire path! We can pave that path by providing the features outlined below.
Standalone getters
The most basic getter is a function for getting the currently active transform (the one that's set using standalone features like translate() and rotate()). Like much of p5's existing Transform API, this feature has the same name as a feature on the native CanvasRenderingContext2D.
getTransform([dimension])- Description: the model transform
- Conversion: local to world
- Context: Works with
P2D(2D) andWEBGL(2D and 3D)
Advanced users may want to convert straight from local coordinates to screen-plus-depth coordinates (i.e. screen space). For them, we can also have a getter for the transform that covers the whole graphics pipeline at once. We can name it after the pipeline to indicate that it's useful for 3D graphics, and it can gracefully degrade to getTransform() in the 2D case.
getPipeline([dimension])- Description: the model-view-projection-viewport transform
- Conversion: local to screen
- Context: Works with
P2D(2D) andWEBGL(2D and 3D) - Note: This provides an equivalent of the misnamed
worldToScreen(), as well asscreenToWorld()viainvert().
In both cases, for renderers capable of 2D and 3D graphics, dimension may be TWO_D or THREE_D. It defaults to THREE_D. In the THREE_D case, screen space has coordinates $(x, y, d)$, where $d$ indicates the point's depth from the camera before it's projected onto the screen. The depth coordinate falls between 0 (corresponding to the near viewing plane) and 1 (corresponding to the far viewing plane); it allows us to recover a point's original position in world space, if we need it.
Implementation note:
Both getTransform(TWO_D) and getPipeline(TWO_D) would return the 2D model transform (implemented with a 3×3 matrix), which would be extracted from the 3D model transform (implemented with a 4×4 matrix).
Class-based getters and setters
The following features expand access to all the key matrices in the graphics pipeline. They also include a simple but powerful API for getting composite transforms like model-view, view-projection, and model-view-projection.
Camera.prototype.getTransform(source, target)- Description:
source,targetcan each take any value in the set {WORLD,EYE,CLIP,SCREEN} - Conversion: source to target
- Description:
Camera.prototype.setEyeTransform(transform)- Description: sets transform corresponding to inverse view matrix (alternatively set via
camera()) - Conversion: eye to world
- Note: places the camera's eye in the world; its
origin()'s coordinates are the existingeyeX,eyeY,eyeZproperties
- Description: sets transform corresponding to inverse view matrix (alternatively set via
Camera.prototype.setProjectionTransform(transform)- Description: sets transform corresponding to projection matrix (alternatively set via
ortho()/frustum()/perspective()) - Conversion: eye to clip
- Description: sets transform corresponding to projection matrix (alternatively set via
Here, CLIP space refers to the unit cube, i.e. points in this space have all three coordinates in the interval [0, 1]. Although NDC (normalized device coordinates) is perhaps a bit more accurate, CLIP doesn't rely on abbreviations and is sometimes used to refer to this space.
Example: Suppose we have a point in screen-plus-depth space, and we want to know its depth from the camera in world units. We just grab the transform that takes us to eye space, with camera.getTransform(SCREEN, EYE). Once we transform into eye space, the z-value tells us the depth from the camera.
Hypothetically, if we eventually add a Model class for 3D objects (built from a geometry/shape and a material), we could also add the following features.
Model.prototype.getTransform(source, target)- Description:
source,targetcan each take any value in the set {LOCAL,WORLD,EYE,CLIP,SCREEN} - Conversion: source to target
- Description:
Model.prototype.setLocalTransform(transform)- Description: sets the model's local transform, corresponding to the model matrix
- Conversion: local to world
Click to reveal comments on select design considerations (for the curious)
-
Benefits of parameterized getter: The class-based
getTransform(source, target)provides a super easy, economical, discoverable, and flexible way to directly get any of the composite transforms like the model-view transform, as well as their inverses. It also allows us to potentially implement optimizations on behalf of the user. -
Return value of getters (copies not references): The getters would return copies rather than references, to protect internal state. This is what the native
getTransform()method ofCanvasRenderingContext2Ddoes, and it's especially important for the composite transforms. Allowing the user to set those would mean we'd need to be able to factor their input into their component pieces and update them accordingly, which would add unnecessary complexity (if it's possible at all). -
Feature selection and naming: The local and eye transforms are both named after their respective source spaces, making their reference frames intuitive. For example, the
origin()of the eye transform is simplyeyeX,eyeY,eyeZ, as noted above. This is one reason to have the eye transform represent the inverse of the view transform. Another reason is that three.js sets a strong precedent for making the inverse-view transform the default feature. That allows us to address an inconsistency in the usual names: the model transform is named after its source space whereas the view transform is named after its target space. Those names would cause confusion if included directly in our API. -
Reasons to include one-shot setters: In the initial feature set, we might consider omitting
setEyeTransform()andsetProjectionTransform()for the sake of economy. But there's probably a reasonable case to be made for including them. Intuitively, one of the advantages of having transform objects is reusability and flexibility, so allowing users to directly set an eye or projection transform from a transform object seems to make sense. It'd also allow us to establish a useful naming convention for the inverse-view transform, and it'd let us document the main component transforms more clearly. And, if we eventually develop aModelclass, thensetLocalTransform()would be more essential; including that and not including dedicated setters for the eye and projection transforms would introduce an inconsistency that'd reduce predictability.
Small blocker in existing features: Can we fix this?
There's one aspect of the current feature set that would prevent users from working with the active view and projection transforms: there doesn't seem to be a way for users to get the active camera instance, or at least, they can't get it from camera().
If camera() were to work like other joint getters/setters, then users could get the active camera instance with the line let activeCamera = camera(). After that, they could access the active view and projection transforms via activeCamera.getTransform(). Since the standalone getTransform() feature lets users work with the model transform, they could easily access all three of the component transforms in the standard graphics pipeline (as well as combinations of those transforms and their inverses).
Do you know why camera() doesn't currently return the active camera instance, @davepagurek? Also, somewhat tangentially, do you know if there's a reason why the functionality of setCamera() isn't just implemented as an overload on camera()?
Lights
Now I'll come back to the exit sign that you wanted to light up in your sketch. A user facing the same problem could get the transform used to position and orient the sign, via getTransform(). Then they could apply it to their light's position and direction using applyToPoint() and applyToDirection(), respectively. And they could do all that without knowing any matrix math. That'd solve the problem without a breaking change, even if the solution isn't built in to the light functions themselves. If we think it's a good idea, we can change the light functions themselves in the next major version release.
Cleaning up worldToScreen() and screenToWorld()
Problem
Thanks for pointing out the API issue with the worldToScreen name. That will cause a lot of confusion. Specifically, "world" conveys the opposite of the intended meaning (i.e. global instead of local). So I think we do need to deprecate worldToScreen()/screenToWorld() and introduce an alternative.
To clarify the naming in my proposal, I'll also note that "screen" is problematic in this context. Sheesh! API design is hard. Fundamentally, the issue is this:
- In everyday understanding, screens are two-dimensional.
- As a term of art, screen space is also two-dimensional.
- As a p5.js feature, screens are… three dimensional.
On the surface, referring to clip space as screen space looks like a user-friendly white lie that conveys the core idea. However, the lie turns out to produce considerable complexity. This includes confusing inconsistencies and inaccuracies in the documentation, where screen space is variously described as both two-dimensional and three-dimensional. While I'm all for simplification, I think an important design principle is to simplify as much as possible, but no more. This is especially true for relatively advanced features such as these.
Solution
The features I proposed provide users with the functionality of both worldToScreen() and screenToWorld(). It replaces them with const localToScreen = getPipeline(dimension) and const screenToLocal = getPipeline(dimension).invert(). Here, the dimension is explicit, so users know when screen space includes an extra depth coordinate and when it doesn't. And the new API can be extended in very useful ways. For example, the parameterized getter Camera.prototype.getTransform(source, target) provides a lot of extra features in a very economical and intuitive package.
This is pretty cool, since these kinds of features have been requested for a long time. See, for example, #1553 and #4743; the latter issue lists requests going back to 2014!
Edits
In order to have an organized plan in one place, I'm keeping this comment up to date with improvements based on the subsequent discussion. Since some comments may not account for these changes, the edits are documented here to reduce confusion.
EDIT 1: Revised the design after a bunch more analysis, in order to address some hidden sources of confusion.
EDIT 2: Revised the design again, after a lot more analysis, to resolve a flaw. The result is more intuitive and more powerful.
EDIT 3: Revised the writing and added structure to this comment, to make it easier to parse.
EDIT 4: Added SCREEN space for the viewport transform, equivalent to the internal projectedToScreenMatrix.
EDIT 5: Added dimension parameter to getTransform() and getPipeline().
EDIT 6: Added description of the depth coordinate in SCREEN space.
EDIT 7: Added clarification regarding CLIP space.
I really like the direction of this! I really like your API for getting access to camera transforms.
While clip space is an advanced thing, the x/y values of it are something non-advanced users might want to use. "Pipeline" is definitely accurate here, but might not immediately stand out in the reference to users as the way to get a transform into screen space. One thought is to add getScreenTransform, which returns just the x/y of getPipeline?
For why camera is different than setCamera: currently, camera() is used to set position/orientation properties on the active camera object, and not to swap out the active camera object. That distinction is definitely a little confusing. I'm not sure what the best workaround is here, balancing addressing that with minimizing breaking changes. An option I initially considered was making camera(cam) set an active camera and camera() return it, but that might muddy the waters if it still has its position/orientation setting overloads. Another option could be to keep camera() as a position/orientation setter, and introduce activeCamera(cam) to set the active camera, and activeCam() to get the active camera, deprecating setCamera (not immediately deleting it, but marking it for removal in the future)?
I agree that we probably don't want to consider any breaking changes for lights just yet, so giving access to an easy way to transform positions into world space for lights gets the job done for now.
oh also one other question: do you have thoughts on how the inverse direction would look? with the APIs described, the current worldToScreen can be replaced. To do screenToWorld, I guess we'd need the user to invert one of the matrices, or switch to using activeCamera().getTransform(CLIP, WORLD). Is matrix inverse an API we're making public? If so, I think that mostly handles it. (There's the issue again about how going from a 2D screen to a 3D local coordinate is weird because you sorta need to start from a 3D screen space, but I think defaulting z to 0 like we currently do is good enough.)
Thanks Dave!
I'm really glad you like the API for getting camera transforms. I feel like that's one of the most exciting aspects of this design. Discovering it was a real breakthrough.
I'll respond to all your questions below. I'm sorry this response is so long. I do think we can ultimately provide a simple and empowering set of features. But accomplishing that requires a maze of considerations to be carefully navigated. So I'll elaborate on the issues and how I think my design solves them.
Why not "screen transform"?
One thought is to add getScreenTransform, which returns just the x/y of getPipeline?
Yeah, the "screen transform" name is attractive. It seems like maybe you're trying to resolve the problems I indicated by restoring the concept of screens as two-dimensional entities. I like the spirit of this, but I don't think it works, unfortunately. The getTransform() and getPipeline() features return Transform instances, which are designed to represent 2D or 3D transforms. For example, if we have let t = getTransform(), then we can do let q = t.applyToPoint(p) where p is a point. If we're working with a 3D renderer, then p and q will both be 3D vectors. So, an actual transform wouldn't just return two out of three coordinates.
Having said that, I still considered "screen transform" as an alternative name for the pipeline feature, opening my mind to the possibility that we might be able to make sense of a 3D screen. On the surface, it seems like the "screen" concept might convey the intuition we want, and it would go well with the precedent set by Processing's screenX(), screenY(), and screenZ(). Unfortunately, there are deeper costs that lead to a lot of confusion. Perhaps I've already convinced you, but I think it's probably worthwhile to lay out some of the costs explicitly.
The Processing reference already has rather confusing descriptions such as this description of screenZ(): "Takes a three-dimensional X, Y, Z position and returns the Z value for where it will appear on a (two-dimensional) screen." What does it mean for a third dimension to tell us where a point will appear in a two-dimensional coordinate system? It really doesn't tell us that. This concept of a 3D screen just feels fraught. I've only ever seen it in Processing and now p5.js, and in both cases, it has led to documentation that doesn't make sense.
It's still worth looking at the concept more carefully in our current context, but again the costs seem overwhelming. Below, I'll list costs of using a name like getScreenTransform() as an alternative to getPipeline().
- Source vs. target confusion: In the API I proposed, transforms are either named after a space (e.g. eye space) or a process (e.g. projection). More specifically, all transforms named after spaces are named after the source space, rather than the target space. If
screenTransform()is named after the target space, it breaks internal consistency and predictability. This makes it really hard to remember the meaning of things. For me, this is a probably a dealbreaker by itself, but the next point shows additional downstream consequences. - Reference frame confusion: The reason to name transform objects after their source space is that it makes their
xAxis(),yAxis(),zAxis(), andorigin()properties intuitive. For example, for the eye transform, the origin corresponds to the existingeyeX,eyeY, andeyeZproperties. If we use the namegetScreenTransform(), users would expect thexAxis(),yAxis(), andorigin()properties to represent the screen's axes and origin, which would be incorrect. - Dimension confusion: There's nothing about the name
getScreenTransform()that suggests it's only really useful for 3D graphics. Users may wonder, "Why is this 3D only?" We could makegetScreenTransform()work in 2D just by making it the same asgetTransform(), or we could throw a warning or an error. But relying on warnings or errors isn't ideal, and making the feature work in 2D would leave room for confusion: in 2D mode, screen space would really be 2D, but in 3D mode it'd be 3D. If a screen transform outputs 2D screen coordinates in 2D, users would likely expect it to output 2D coordinates in 3D mode too. After all, screens are two dimensional in both everyday language and standard graphics terminology. - Screen vs. clip space confusion: I think we really want clip space, not screen space. For example, the main motivating example for
worldToScreen()(in the original GitHub discussion), was labelling the vertices of a rotating 3D shape. To keep the labels legible, we want them to stay in the plane of the screen, and we want to hide them when they're on the back of the shape; the depth coordinate of clip space tells us when to hide them. So the depth coordinate is useful. Clip space is also more general: we can get the screen coordinates from clip space, but not vice versa. Lastly, if we have a function that takes three coordinates as input, but it just outputs x and y coordinates, we won't be able to work with its inverse. - Documentation confusion: The white lie that clip space is the same as screen space isn't harmless, unfortunately. It's not that the name "screen transform" might cause confusion. It's that it already has caused confusion. I already commented on the Processing documentation. The p5.js documentation has similar problems, and it adds new ones. It describes the
worldToScreen()output as a 2D vector. Meanwhile, its inverse asks for x, y, and z screen coordinates as input without a satisfactory explanation of why the screen is now three-dimensional. There are other aspects of this that I'll comment on near the end.
Why "pipeline"?
I'll break the case down into a few sections, starting with your primary concern.
Discoverability
"Pipeline" is definitely accurate here, but might not immediately stand out in the reference to users as the way to get a transform into screen space.
I actually think that pipeline() will end up being more discoverable than the existing screenToWorld() and worldToScreen() features.
Although pipeline() doesn't have "screen" in the name, it will fit well under the Transform section of the reference, where we have all of our standalone transform features. In particular, there will be just two getters in that section: getTransform() and getPipeline(). If users are looking to "get a transform into screen space," this section is where they're most likely to look.
Also, we can make the top-line description that appears under getPipeline() in the Transform section something like this: "Gets the pipeline transform from active to screen-plus-depth coordinates." This avoids the confusion of oversimplification while also being discoverable.
For comparison, the current screenToWorld() and worldToScreen() features are placed under the Environment section, along with a lot of features unrelated to transforms. That makes those features harder to find, and the fact that they were placed there indicates they don't clearly fit under the Transform section. Indeed, they would feel out of place compared to the rest of that API, but getPipeline() fits perfectly. The API is a clean match, and it actually gets a transform.
User context
With the exception of the new screenToWorld() and worldToScreen() features, all the transforms that get a user's visualization onto the screen are applied internally, on the user's behalf. The new features empower users by giving them access to manual coordinate conversions, but performing manual conversions is relatively advanced.
By the time users have needs that are as subtle as keeping 3D vertex labels on a rotating object in the plane of the screen for extra legibility (the motivating use case cited above), they may be more curious than intimidated by a term like "pipeline," or they may recognize the term instantly if they have any other experience in 3D graphics. So in this context, I think it's sensible to provide an honest name, rather than an oversimplified one that produces additional confusion. By choosing a name like "pipeline," we respect the user and potentially introduce them to a valuable concept, if they haven't already learned it.
Problems solved by getPipeline()
To clarify all this, I'll run through a list of problems getPipeline() solves.
- Accuracy: It avoids the not-so-white lie that screen space is the same as clip space.
- Source vs. target space: It avoids the problem of
getScreenTransform()being named after a target space. - Spatial context: While it would be nice to keep our vocabulary as small as possible, I think the benefits of adding "pipeline" to our vocabulary are overwhelming. This term is standard and is commonly associated with 3D graphics, so the name provides context about how it's meant to be used; users who are new to the term can learn from the documentation that
getPipeline()converts from the current 3D coordinates to screen coordinates plus a depth value (another set of 3D coordinates). In contrast,getScreenTransform()sounds like something that should be useful in 2D. - Graceful degradation: A
getPipeline()feature can be made to degrade togetTransform()in the 2D case, since the 2D pipeline just consists of one step, which is the local transform. This is logical, with the 2D pipeline ending in 2D coordinates and the 3D pipeline ending in 3D coordinates. - Visual metaphor: In everyday language, a pipeline is a conduit that goes all the way from start to end. It takes us from our convenient local coordinates all the way to screen-plus-depth coordinates for display.
- Reference frame: Users won't be tempted to apply a meaningful interpretation to
x/y/zAxis()andorigin(), since those concepts don't seem relevant to a pipeline. - Discoverability: As noted above,
pipeline()is arguably more discoverable thanscreenToWorld()andworldToScreen(), due to the way that it fits into a cohesive set of transform features.
Getting the active camera
My original thinking was that we could have only camera(). We could deprecate and ultimately remove setCamera(), replacing it with an overload to camera(). So, we'd have the following overloads:
camera([x], [y], [z], [centerX], [centerY], [centerZ], [upX], [upY], [upZ]) // frustum is implicitly the same
camera(cam) // cam is a p5.Camera object
The feature would always return the current camera instance. If called without arguments, it acts as a simple getter. If called with arguments, it updates the current camera and then returns the updated camera.
This is at least roughly analogous to fill(), for example, which has overloads including the following:
fill(gray, [alpha]) // some values are implied (e.g. in rgb, this would be gray-gray-gray I guess)
fill(color) // color is a p5.Color object
Currently, fill() doesn't have a documented return value, and I think it secretly returns the current p5 instance for chaining, but we should probably start having property setters like this act as getters when they don't receive any arguments.
Inverses
3D Replacement for screenToWorld()
oh also one other question: do you have thoughts on how the inverse direction would look? with the APIs described, the current worldToScreen can be replaced. To do screenToWorld, I guess we'd need the user to invert one of the matrices...
Yep! Users would be able to work with the inverse of a transform object through transform.invert(). For the equivalent of screenToWorld(), we'd do getPipeline().invert().
...or switch to using activeCamera().getTransform(CLIP, WORLD).
The problem here is that screenToWorld() doesn't actually convert from screen to world, or even clip to world. It's really easy to forget, but the name is incorrect, as you pointed out. It actually converts from clip to local space. That's what getPipeline().invert() would give us.
Although we could get the same transform with getTransform(CLIP, LOCAL), that would only work on a speculative Model class like the one I described in my previous post. The camera's getTransform() method shouldn't need to know about a model's local coordinates (in general usage, we could have many models in a scene, each with their own local coordinates).
Transform vs. Matrix clarification
Is matrix inverse an API we're making public?
The invert() method would be a user-facing feature of the Transform class. But to be clear, we'd describe this as the transform that goes in the reverse direction, not as an inverse matrix. In the design I'll be proposing, Transform is distinct from Matrix in precisely the same way that Color is distinct from Vector:
- The
Transformclass provides special interpretations likexAxis(),yAxis(),zAxis(), andorigin()for the columns of a small, fixed-size matrix. TheMatrixclass provides general matrices of any shape. There is a small amount of overlap, in that they both support a method for inversion. - The
Colorclass provides special names likered(),green(),blue(), andalpha()for components of a small, fixed-size vector (or a similar internal data structure). TheVectorclass provides general vectors of any length. There is a small amount of overlap, in that both have a method for interpolation.
2D replacement for screenToWorld()
There's the issue again about how going from a 2D screen to a 3D local coordinate is weird because you sorta need to start from a 3D screen space, but I think defaulting z to 0 like we currently do is good enough.)
Right, so a 3D version of screenToWorld() is already provided by getPipeline().invert(), as I mentioned above. That just leaves the situation where the user wants to provide 2D coordinates only, without specifying a depth value. In general, there are infinitely many possible 3D outputs for a given 2D input like that, so this only seems to make sense if the user really wants a 2D output. In fact, although the documentation for screenToWorld() says it converts from 2D coordinates to 3D coordinates, it's effectively designed to produce 2D output in this case:
- It outputs a z-coordinate of (approximately) zero—this isn't currently documented, but that's how it works, as you noted.
- Based on the formula it uses, it appears to manage this by assuming that the local x-axis and y-axis are in the plane of the screen, which again would be valid for a 2D sketch.
- The code example on the reference page actually uses
P2Dmode, notWEBGLmode. - The feature's author said in the PR that "Sadly, the function is not too useful for 3D as-is".
So, I think that practically speaking, this feature is really meant for converting from 2D screen space to 2D local space. I don't think we need to think about converting from 2D to 3D space. This leaves us with two scenarios to consider.
Scenario 1: 2D sketch in P2D mode
In this scenario, users could use getTransform().invert(). Pragmatically, maybe we should stop here and call it a day. But I'll sketch out a second scenario just in case.
Scenario 2: 2D sketch in WEBGL mode (or another fundamentally 3D renderer)
Here, I'm referring to a sketch where users specify only x and y coordinates. Under the hood, there are z-coordinates, but they're all zero, and the user never sees them, so they think of the sketch as being entirely 2D.
If users want to handle coordinate conversions manually in this scenario, that means they're venturing into do-it-yourself territory. So, maybe it's not unreasonable to ask that they recognize they're working with WEBGL, and that this means they have to supply clip coordinates. They could do that with getPipeline.invert().
However, users would need to determine the depth coordinate in clip space that would produce a local z-coordinate of zero. I need to look into this a bit more to figure out what they'd need to do. If it's simple enough, perhaps we could just provide a code example on the getPipeline() reference page.
In the future we could possibly read from the depth buffer to supply the depth of whatever's visible on screen in WebGL mode, although that would first require using a main canvas framebuffer to get read access to depth info, so it's probably not something we'd want to do in the short term. I think it makes sense in the mean time to direct users to do their comparisons in screen space for 3D sketches to avoid the possible pitfalls (and maybe having a traditional image-based 3D picking example somewhere?)
If users want to handle coordinate conversions manually in this scenario, that means they're venturing into do-it-yourself territory. So, maybe it's not unreasonable to ask that they recognize they're working with WEBGL, and that this means they have to supply clip coordinates. They could do that with getPipeline.invert().
Ironing out some details here: the end of the full pipeline is a bit ambiguous in WebGL mode. In 2D mode, screen/clip space is kind of the same, being relative to the top left and being in the range [0, width]/[0, height] when the mouse is over the canvas. In WebGL, the matrices map coordinates to [-1, 1] in x and y because that's what WebGL needs. The range of what you can see on screen at z=0 is in the [-width/2, width/2]/[-height/2, height/2] range. The mouse coordinates are still in the same space as in 2D mode.
In the scenario where you're wanting to check the mouse coordinated against a rotated rectangle, to make that work, you'd want to be using a matrix that handles screen/mouse space coordinates, and not the [-1, 1]-ranged one the (P * V * M)-1 matrix gives you. The current worldToScreen/screenToWorld methods do that extra step for you, and I think this makes sense to keep. Does keeping the target space of getPipeline be the [0, width]/[0, height] space make sense for you too?
In this scenario, users could use getTransform().invert(). Pragmatically, maybe we should stop here and call it a day. But I'll sketch out a second scenario just in case.
Ideally I'd like as much as possible from the 2D-sketch-in-2D-mode scenario to translate directly to 2D-sketch-in-WebGL-mode, which screenToWorld/worldToScreen do. getTransform() works in 2D mode but would produce incorrect results for WebGL mode, while it sounds like getPipeline() would work for both if getPipeline is effectively the same as getTransform in 2D mode? How do you feel about directing users to getPipeline even in 2D mode to set them up for an easier transition if they do a WebGL mode sketch?
Hi Dave!
This is exciting. I think you made great points, and this gave me a new idea. I wrote up a draft reply but it's half baked at the moment. I'll post a proper reply when I get a chance (probably tomorrow).
Hi Dave!
Here are my thoughts about the points you raised. I look forward to your feedback.
Picking
I think it makes sense in the mean time to direct users to do their comparisons in screen space for 3D sketches to avoid the possible pitfalls (and maybe having a traditional image-based 3D picking example somewhere?)
Yeah, it could be useful to demonstrate picking (a.k.a. hit-testing) with a color id, e.g. in the Examples section of the website. It’s a helpful technique for both 2D and 3D sketches. It'd also demonstrate a nice use of a Framebuffer in WEBGL mode or a Graphics object in P2D mode. Also, in the 3D case, this wouldn’t require working with both 3D local coordinates and 2D screen coordinates like ray casting would.
Output of getPipeline()
Does keeping the target space of getPipeline be the [0, width]/[0, height] space make sense for you too?
Yep! This detail was in the back of my mind. I’m really glad you brought it to the surface. It leads to a nice improvement, as well as some interesting observations about the power of the API, including how it relates to the depth buffer.
Screen coordinates
In addition to being convenient, returning screen coordinates would also be consistent: the standard graphics pipeline really ends with the screen, so it makes sense for our pipeline transform to end there too. (This makes sense for both the P2D and WEBGL renderers, and the screen’s coordinate system is the same in both cases. So, the fact that they have different origins in world coordinates doesn't cause any real inconsistency.) I can revise my original API description to indicate that getPipeline() applies not just the model-view-projection transform, but actually a model-view-projection-viewport transform.
Depth coordinate
I do think it's worthwhile to consider the full 3D output. Right now, worldToScreen() outputs values in [0, width] x [0, height] x [0, 1], where 0 and 1 correspond to the depth of the near and far viewing planes, respectively. It's not documented that way, but that's how it works (see this demo). I think this is also how WebGL's depth buffer works (see e.g. depthRange()). Basically, after the projection transform and perspective division convert depths to [-1, 1], they're mapped to [0, 1].
It'd make sense for the pipeline transform to output values in this same range, and we could document its output as “screen-plus-depth coordinates.” We can explain that the depth coordinate indicates depth from which the point originates (relative to the camera), before it's mapped to the screen. Having depth values in [0, 1] may also be convenient in some ways, since 0 and 1 always correspond to the nearest and farthest depths that can be viewed, even if we don't know the camera's settings.
However, it's interesting to note that the Unity scripting API outputs depth values in the [near, far] range with its Camera.WorldToScreenPoint(). Then the depth value represents depth from the camera in world units. This may be convenient for game development.
Fortunately, the proposed API is so flexible that it handles that case too! Say we have a point in screen-plus-depth space, and we want to know its depth from the camera in world units. We just grab the transform that takes us to eye space, with camera.getTransform(SCREEN, EYE). Once we transform into eye space, the z-value tells us the depth from the camera. In particular, this would allow us to easily convert values from the depth buffer to world depth. Nice! (We just need to add SCREEN to the list of supported spaces, which is an improvement that I'll elaborate on below.)
Clip space vs. NDC
Your comments raised another detail to the surface. If we were to allow users to supply 4D homogeneous coordinates $(x, y, z, w)$, then the projection transform would end in 4D clip space; however, since we’re abstracting that away behind applyToPoint() (and similar methods), we’d apply perspective division on behalf of the user. That’d mean the model-view-projection transform would end in NDC space $[-1, 1]^3$. More precisely, any point $(x, y, z, w)$ with $x$, $y$, and $z$ in $[-w, w]^3$ will end up there (other points would ordinarily be clipped).
Based on this, it might be more precise to replace the CLIP parameter with NDC in the proposed API. However, clip space is the usual term for the target space of the projection transform, and these two terms are sometimes used interchangeably (e.g. see this MDN explainer).
Also, for cases where we need to work directly with homogeneous coordinates, we can still do that. For example, we could use projectionTransform.toMatrix().matrixMult() instead of projectionTransform.applyToPoint(). So, at least indirectly, we could use the projection transform to convert directly to clip space, without applying the perspective division that'd convert to NDC space. So I'll keep CLIP as the name for the target space of the projection transform, unless you disagree.
Viewport transform
In light of these points, we can incorporate a viewport transform that would convert from NDC space to screen-plus-depth space. It’d be the equivalent of the internal projectedToScreenMatrix. The proposed API could accommodate this by extending the list of spaces for the general transform getters, so that they include SCREEN.
We do run into a familiar naming problem here, since SCREEN would really refer to screen-plus-depth space. However, the name seems less problematic in this context; of the problems I identified regarding a "screen transform" name, only the "Documentation confusion" problem seems to apply. In this case, that seems solvable. We could document it without referring to a 3D screen, much like Unity does with its Camera.worldToScreenPoint and Camera.ScreenToWorldPoint features. As I mentioned before, we can explain that the depth value indicates the depth from which a point originated, before it was projected onto the 2D screen.
getTransform() vs. getPipeline()
getTransform() works in 2D mode but would produce incorrect results for WebGL mode, while it sounds like getPipeline() would work for both if getPipeline is effectively the same as getTransform in 2D mode? How do you feel about directing users to getPipeline even in 2D mode to set them up for an easier transition if they do a WebGL mode sketch?
I'm not totally sure if I understand your suggestion here. The getTransform() feature is the only way to access the model transform. So in the 3D case, I think these have distinct, valid uses:
getTransform(): local to worldgetPipeline(): local to screen (+ depth)
It's only in the 2D case that these collapse to the same behavior, since the world and screen spaces are identical in that case.
Uniformity across 2D-in-2D and 2D-in-3D
Ideally I'd like as much as possible from the 2D-sketch-in-2D-mode scenario to translate directly to 2D-sketch-in-WebGL-mode, which screenToWorld/worldToScreen do.
I appreciate you pushing for this. I 100% agree that we should aim for uniformity in 2D sketches across renderers, and I really don’t want a UX regression. Fortunately, on further inspection, I think there may be a nice solution! We could have getTransform() and getPipeline() each support dimension arguments TWO_D and THREE_D. These are the same arguments I was planning to propose for createTransform().
These arguments would only be needed if a 3D renderer is in use. Then, if THREE_D is the default, the only change that'd need to be made to convert 2D-in-2D sketch code to 2D-in-3D sketch code (aside from changing the renderer) would be to pass TWO_D to the transform getter. Here, getTransform(TWO_D) and getPipeline(TWO_D) would both return the 2D model transform (implemented with a $3\times3$ matrix), which would be extracted from the 3D model transform (implemented with a $4\times4$ matrix).
We could even make 2D-in-2D sketch code work as-is for a 2D-in-3D sketch, if we make TWO_D the default. Then getTransform() and getPipeline() would return the 2D model transform in both cases. That might be slightly awkward, though, since the user would need to explicitly ask for a 3D transform when they're using a fundamentally 3D renderer.
Thoughts? I'm assuming for now that we aren't planning on having an explicit global 2D/3D mode.
Edit: Added more detail to the section about the output of getPipeline(), and clarified that we can work with the projection transform directly (without perspective division) via toMatrix(), e.g. in case we need to do that internally.
I'm not totally sure if I understand your suggestion here. The
getTransform()feature is the only way to access the model transform. So in the 3D case, I think these have distinct, valid uses:
Agreed! I'm suggesting that when documenting usage of this in 2D mode, we stick to getPipeline in examples as WebGL has a superset of the possibilities of 2D here. So e.g. for the use case of checking collisions between the mouse and a shape in 2D mode, even though both getTransform() and getPipeline() would work, we would show getPipeline() in examples so that the same mouse collision detection code would continue to work in 3D. I'm imagining then we would document getTransform() as being primarily for WebGL, and showing just the 3D-specific use cases there. I think that resolves most of the similarity I was hoping to get between 2D and 3D!
The TWO_D/THREE_D idea is interesting though, and could be useful for cases where we aren't doing a comparison with the mouse (e.g. if you wanted to place a marker at the end of a chain of joints, whose orientation doesn't change as the joints change.)
Hi Dave!
I think we're on the same page on the broad themes. I'm not totally clear about some of your points, though. We might just have different mental models, so I'll try to clarify a few things.
TWO_D and THREE_D
The TWO_D/THREE_D idea is interesting though, and could be useful for cases where we aren't doing a comparison with the mouse
When I proposed TWO_D/THREE_D as parameters of getTransform() and getPipeline(), I actually had mouse interaction in mind. So we might have some kind of disconnect here. Specifically, let's say we want to allow a 2D-in-P2D sketch to run as a 2D-in-WEBGL sketch. In WEBGL mode, getPipeline() returns a transform object that operates on 3D vectors. It's an instance of the general Transform class, so it wouldn't guess a z-coordinate like screenToWorld(). Instead, a user would leverage getPipeline(TWO_D).invert() to extract a 2D transform from the built-in 3D transform. I might need to work through the details to confirm that what I'm proposing makes sense, but from a high level, I think it'd work.
getTransform() vs. getPipeline() (redux)
I'm imagining then we would document getTransform() as being primarily for WebGL, and showing just the 3D-specific use cases there.
I love how you're thinking ahead about the documentation. Clearly delineating getTransform() and getPipeline() is definitely important. But, I must be thinking about the details differently than you. I think these two functions would both usefully generalize the 2D functionality to 3D sketches, but in different ways. For some 2D use cases, I think we'd generalize with getTransform(), and for other 2D use cases, I think we'd generalize with getPipeline(). So I think both features would have 2D examples, but the use cases would be different.
getTransform()
As an example, suppose we add a feature for interpolating between two given transforms (this is on my list of candidate features). An application would be the animation of an object or camera moving from one pose (position plus orientation) to another. The Godot docs provide a nice animation illustrating this use case. For the sake of the current discussion, let's say we want to draw a shape in fixed start and end poses in every frame, and we also want to draw it in an intermediate pose that moves from the start to the end across frames. I think we'd want to use getTransform() for this, since it'd generalize from the 2D-in-P2D case, to the 2D-in-WEBGL and 3D-in-WEBGL cases.
Overall, my intuition is that getTransform() is the more basic feature, for a few reasons. First, the name suggests it's how we get the transform that's built with the familiar standalone features. Second, the name is actually the same as the corresponding native method on CanvasRenderingContext2D. Third, "pipeline" typically refers to a 3D context, so having getTransform() be the feature that's primarily 3D seems a bit backwards. (Although, I wouldn't really say that getPipeline() is primarily 3D either.)
getPipeline()
When it comes to using screen coordinates to interact with an object in local coordinates—as in the current code example for screenToWorld()— I think getPipeline(TWO_D).invert() would allow a 2D-in-P2D sketch to run as a 2D-in-WEBGL sketch. So for this case, we'd want the 2D code example to go on the getPipeline() reference page, as you suggested.
Next steps
I hope some of that made sense. I think it might be helpful to schedule a real-time discussion so we can iron out what's left (including your chain-of-joints example, and picking / hit-testing). We could post a summary of our conclusions here.