VisionOS, a New and Exciting XR Development Framework.

Umar Patel
10 min readSep 9, 2024

--

My experience and takeaways from a year and a half of development with SwiftUI, RealityKit, and MapKit in the visionOS environment.

An immersive historical map experience for visionOS

Introduction

VisionOS is Apple’s operating system built specifically for development on the Apple VisionPro, the new virtual reality headset Apple released earlier this year. However, developers have been able to build, test, and experiment with the framework for just over a year, since June of 2023 when the beta version of visionOS became available after the headset’s announcement. After developing several applications over the past year at the Human Computer Interaction Lab and recently the David Ramsey Map Center at Stanford University, I wanted to share my perspective on developing with the new framework. I also wanted to share my perspective on the opportunities spatial computing provides to build upon some of the existing structures and features in Apple’s developer toolkit in contexts ranging from furniture placement to historical mapping.

Framework Overview

As a developer and user, it’s clear that the visionOS framework allows for a spatially dynamic user experience that is hard to come by outside of an immersive space. If you’ve tried out some existing visionOS apps, or even some of Apple’s sample projects, one of the first things you might notice (both on the headset and in the simulator) is the flexibility in moving window groups around your space to allow for highly customized positioning of app content. We’ve all been through that often frustrating feeling of having 20+ chrome windows open at once. On a 2D screen, it becomes difficult to visualize the different windows simultaneously, and a hassle to constantly switch between them. In visionOS, developers and users alike can take advantage of their full 360° environment space to position 2D window groups and RealityKit content.

In this article, I wanted to share five key points that defined my development experience on the new platform and how to take advantage of existing and established iOS frameworks that translate well to visionOS. Let’s get into it!

1. VisionOS is made for Multimodal Interaction

Last summer, as part of Stanford’s Human Computer Interaction Lab, I built a furniture placement demo application using the visionOS framework (the beta version at the time), where our goal was to show how leveraging multiple modes of interaction (gaze, hand gestures, and speech) simultaneously could lead to more natural and seamless user experiences with the environment. When considering examples of real-world spatial interactions (such as furniture placement and movement), it’s easy to see why this would be important. For instance, if you are reorienting items in a space, you will more often than not point to the items you are referring to using your hands or look at the region in the space where you might want to direct attention to. One might use indicator words or phrases such as “this”, “that”, “over there”, etc., that indicate position when coupled with the hand or eye gestures overlayed on top of the dictation.

An experimental furniture demo application analyzing multimodal interaction.

However, it can be difficult to fully interpret and convert various multimodal inputs coming from different actions into executable code (i.e. translating “Move that over there”, where the user looks at a chair and points to a new position in the room → update position of entity A from position (a, b, c) to position (x, y, z)). But in visionOS, you can to combine the text a user speaks into the microphone through the SFSpeechRecognizer speech-to-text class, the user’s gaze, and spatial tap gestures derived from the user’s interaction with surrounding RealityKit entities. Ultimately, we used an algorithm that captures time stamps of each segment (or word) of the utterance and matches that to the corresponding entities that the user’s gaze and gestures correspond to.

An annotated rotate command which opens a SwiftUI attachment for the corresponding entity.

By implementing this, a user is able to dictate a common utterance such as “Move that over there”, looking and tapping at the entity and position of interest, respectively, and taking the corresponding gesture information and annotate the command so that it can be converted into a more typical machine command (such as moving an entity from position a to position b in the world space). Ultimately, this allows you to leverage the three modalities in a way that can initiate accurate environment changes based on natural, more realistic user commands and gestures.

2. 2D Content Retains an Important Place in VisionOS

With the release of the Apple Vision Pro in February, I remember many conversations in the developer space concluding that the best applications would be ones that minimized 2D content and focused more on the immersive spaces and interaction with the 3D environment and virtual entities. What I’ve found through my own development experience on the Vision Pro headset, however, is that while there is certainly a lot of potential working with 3D content in visionOS, fostered primarily through Apple’s RealityKit and ARKit frameworks, the window groups are not merely moveable SwiftUI views. 2D window groups can be highly integrated amongst each other and with the immersive spaces the app uses. For instance, you might want to develop a feature that open an info panel when the user interacts with a RealityKit entity, another window group, or voices a specific command. You might also want to anchor attachments displaying SwiftUI view(s) in RealityKit to certain entities or positions to act as labels, signs, or interactive 2D content in an immersive space.

A user’s command (through speech recognization)triggers the corresponding feature on the cockpit to begin flashing.

Additionally, in a map overlay experience sprawled with point of interest annotations, a user can tap one to bring up a window displaying more information about the POI, and even offer more options to explore the point of interest further (such as opening an immersive view or zooming into the specified region of the map). We will discuss maps more in the next section.

Opening 2D windows and immersive spaces in a historical map experience.
A RealityKit attachment displaying a ScrollView of color options when the user requests to change the color of an entity.

Of course, it is still quite early and so we will have to see where the developer space and future applications head. My bet though is that both 2D and 3D content alike be utilized in equally unique and engaging ways for future visionOS applications.

3. Maps (MapKit) offer Highly Interactive Experiences in XR

There was a pretty cool moment this past summer watching one of my co-workers test out an immersive map application we were building where he would throw the window displaying the map view across the room, zoom in using the two-handed pinch out gesture, and just walk straight up to the map, smiling and begin interacting with the annotations. As a map fanatic growing up, I totally resonate with the enjoyment playing with maps, but it was even more meaningful to see someone else engage with a map in such a manner that wouldn’t be possible on a normal 2D screen.

Being able to resize and interact with map annotations is in of itself interesting. But what is also intriguing is the layered content that can make map interactions in an immersive space dynamic. For instance, adding map overlays that display temporally or contextually different material on top can add an extra dimension of exploration that accompanies the map, such as toggling overlays displaying different data or zooming into specific regions of interest. And in visionOS, you can add annotations for points of interests that allow you to zoom further in or even enter a fully immersive space displaying content related to that site.

Zooming into an arrondissement of Paris and toggling the historical overlay.

On a more technical note, UIKit’s MKMapView class provides an easy-to-use addOverlay function that allows users to map image overlays using coordinate span data. You can also configure coordinate-based buttons and add annotations to your map, using them as markers to explore different regions of your geographical content. You can then integrate SwiftUI elements through opening windows or immersive spaces through interacting with those points. My experience implementing a fully SwiftUI framework for the overlay and annotations was a bit complex, mostly through the difficulty in mapping screen space coordinates to the MapView. This would prove difficult to maintain in dynamic cases like updating UI element positions on the screen when zooming in or out or changing the window size.

4. Transitioning Between Views and Spaces Requires Careful State Bookkeeping

VisionOS allows for 3 different types of environment experiences: an augmented view that allows you to place windows (displaying SwiftUI views and elements) in the space; a hybrid scene that combines those dynamic windows with 3D RealityKit content; and a fully immersive space that resembles that of traditional virtual reality experience outside of the user’s current space. These are called windows, volumes, and spaces, respectively.

Transitioning between the various experiences requires scrupulous app state tracking. If you’ve programmed in SwiftUI, you would have at some point come across utilizing @State and @Binding fields when needing to pass references of variables to other views and pages that require them or need to use them in some way. In most iOS applications, this is typically done through declaring a state variable in a parent view, then passing its reference as a parameter to a child view through a navigation link.

In visionOS, however, with the addition of window groups and immersive spaces that are initialized when the app loads, the reference passing structure is a bit different in certain use cases. Variables shared across the various windows or spaces must be passed in to each window group and view that will use it during the app lifecycle.

In cases working with multiple windows and spaces, it’s very common to set boolean state variables and bindings that indicate which views are currently open. This is very important if you have content whose status is conditional on which views or spaces are open, or if a certain action is dependent on whether a state is active or not.

In most cases, SwiftUI’s onDisappear and onAppear closure for views are a great place to set and update these variables. You might also set these variables when opening or closing windows or immersive spaces through the use of the corresponding environment variables, explained more here. But it becomes trickier when you are in a fully immersive space, as you have to be careful when dismissing a single window that is present in the scene. In some instances, closing the window using the window toolbar will not dismiss the immersive space (even if you set the correct conditions and instruction in the onDisappear closure), and so the user might be stuck with no way to leave (a bit scary, don’t you think). However, I’ve found there are certain safety measures developers can make in order to ensure the user still has the ability to escape, one of which is leveraging RealityKit attachments. These are essentially SwiftUI views that you can anchor in an immersive space and are fully interactive. Placing an attachment in your space with a button or link to dismiss the immersive space can prevent a user from accidentally dismissing a window view and having to restart the app if they want to leave.

An attachment that is anchored in the immersive space that a user can use to go back to the main experience (in the case the user accidentally dismissed a window displaying the close button).

In some instances, pressing the Digital Crown might be an option to escape the immersive view, but in my experiences developing there is some unpredictability with that action, and so it is likely more stable to use a SwiftUI attachment to escape the space. With that said, at this point, I believe there is no “right” or “wrong” way to “escape” an immersive view. The most important thing is to do what makes most sense for your application and whatever maintains the state variables in the desired manner.

5. It is Imperative as a Developer to be Comfortable using the Simulator

In an ideal situation, all developers, beginners and experienced alike, will have full access to an Apple Vision Pro to develop and test their applications on. However, this may not always be the case. This makes it vital for developers to take advantage of the simulator to test your immersive experiences and products and ensure the functionalities are in par with expectations. While this is fairly straightforward for testing SwiftUI elements in your visionOS product, it might be tricky to test ARKit without the headset. Specifically, important features such as planeDetection, part of the ARWorldTrackingConfiguration class, are not supported on the visionOS simulator at the moment, which might make it difficult to test collision features you might have in your app.

But this should not discourage you from including such features in your app. One workaround I’ve found that works well is to create plane entities to simulate different real-world features like floors, walls, tables, etc., and add your corresponding collision or input target components to it. You can also create custom component tags which you set those entities to in order to distinguish them from other RealityKit entities in your space. In your code, you can then dictate what happens when the different entities interact. While certainly a bit makeshift, this helps in testing RealityKit features within the simulator and assisting the overall development process.

A green plane used as a floor entity to test collisions and spatial tap gestures on the visionOS simulator.

Conclusion

Ultimately, what I’ve found is that the visionOS framework provides a tremendous amount of opportunity to develop engaging and immersive content in a wide range of fields. Because of the easy integration with existing iOS frameworks, it allows developers to focus on leveraging the capabilities of more XR-geared tools (such as RealityKit and ARKit) on top of the more tenured SwiftUI and UIKit features and APIs. At the same time, the structure of window groups and immersive spaces offer new and unique ways for users to interact with 2D content in dynamic environment. Good luck developing more spatial computing experiences!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Umar Patel
Umar Patel

Written by Umar Patel

0 Followers

Umar here! I'm a UX & XR developer who graduated with a Masters in CS from Stanford University and am currently working as a Research Developer at Stanford HAI.

No responses yet

Write a response