Social Intelligence in Humans and Robots
Workshop RSS 2024 - July 19 (1:45pm - 6pm in Delft, Netherlands)
In person locatoin: ME Hall D - Watt
Video recordings are available on our YouTube channel: link
Time (Delft Time, Netherlands) | |
---|---|
1:45 pm - 2:00 pm | Organizers Introductory Remarks |
2:00 pm - 2:30 pm | Carolina Parada Foundation Models for Social Robots Abstract
Foundation models have unlocked major advancements in AI. In this talk, I will discuss examples of how foundation models are enabling a step function in progress towards more social robots, including enabling robots to understand, reason in context, hold situated conversations with humans, create expressive robot behaviors, transfer visual and semantic understanding to real world actions, and learn from humans as they interact with them. It is still early in this research journey but it is an exciting one because we can confidently be part of this fantastic fast and dynamic field of foundation models and not only ride the wave of innovation, but help shape it. Foundation models still have significant gaps in human-robot interaction contexts. I will share early insights showing that HRI could be key to evolving the foundation models themselves, enabling even more powerful interactions, and improving robot learning over time.
|
2:30 pm - 3:00 pm | Yonatan Bisk Everything Fails, Everything is Ambiguous Abstract
It's easiest to assume we live in a world of perfectly executable programs, where the context is clear, modules work, and language is unambiguous. In fact, many of us pretend that is the case, eschewing issues of uncertainty or theory of mind. But in the real world, tools (real and LM) fail, language is ambiguous, and the people around us change the context. So, how do we build agents that account for all of these variables, and what new challenges do they introduce? In this talk, I'll introduce how we've been thinking about the problem, and some preliminary steps.
|
3:00 pm - 3:30 pm | Andreea Bobu Aligning Robot and Human Representations Abstract
To perform tasks that humans want in the world, robots rely on a representation of salient task features; for example, to hand me a cup of coffee, the robot considers features like efficiency and cup orientation in its behavior. Prior methods try to learn both a representation and a downstream task jointly from data sets of human behavior, but this unfortunately picks up on spurious correlations and results in behaviors that do not generalize. In my view, what’s holding us back from successful human-robot interaction is that human and robot representations are often misaligned: for example, our assistive robot moved a cup inches away from my face -- which is technically collision-free behavior -- because it lacked an understanding of personal space. Instead of treating people as static data sources, my key insight is that robots must engage with humans in an interactive process for finding a shared representation for more efficient, transparent, and seamless downstream learning. In this talk, I focus on a divide and conquer approach: explicitly focus human input on teaching robots good representations before using them for learning downstream tasks. This means that instead of relying on inputs designed to teach the representation implicitly, we have the opportunity to design human input that is explicitly targeted at teaching the representation and can do so efficiently. I introduce a new type of representation-specific input that lets the human teach new features, I enable robots to reason about the uncertainty in their current representation and automatically detect misalignment, and I propose a novel human behavior model to learn robust behaviors on top of human-aligned representations. By explicitly tackling representation alignment, I believe we can ultimately achieve seamless interaction with humans where each agent truly grasps why the other behaves the way they do.
|
3:30 pm - 4:00 pm | Break and Poster Session |
4:00 pm - 4:30 pm | Contributed Talks
|
4:30 pm - 5:00 pm | Michael Franke Understanding Language Models: On Japanese Rooms & Minimal World Models Abstract
Searle’s famous Chinese Room Argument is an excellent tool for probing our intuitions about why rule-based AI systems are not felt to develop internal understanding in spite of superficially great input-output performance in language use. Building on previous related work (e.g., Bender & Koller’s Octopus Test), I present a thought experiment more parallel to Searle’s, which I call the Japanese Room Argument, to serve as a scaffolding for intuitions about whether language models generate a form of language understanding /by necessity/ if scaled in training size and model capacity to approximate perfect input-output alignment with humans. To complement the intuitive JRA, I also present a formal Minimal Models Argument, which goes roughly like this: if world models for humans and LMs are close to optimal for their respective purposes, the world models of LMs will almost surely be different, as human representations are likely to be co-optimized for multiple tasks, including non-linguistic tasks and linguistic tasks that require /normative/ social interaction embedded in time.
|
5:00 pm - 5:30 pm | Séverin Lemaignan Modelling the social sphere in the age of LLMs Abstract
While LLMs and related foundational models are transforming how we interact with robots, most of the research to date focuses on mobile manipulation tasks, usually not accounting much for social interactions. One key to unlock this 'social intelligence for robots' is the design of a general model of the social sphere -- relationships between people, their mental state, their semantic-rich interactions with their environment -- that would be appropriate for integration with eg LLMs. In this talk, I will present our efforts in this direction, introducing our preliminary work on *social embeddings*, and how they can be operationalized on real-world cognitive architectures for interactive robots.
|
5:30 pm - 6:00 pm | Jacob Andreas Good Old-fashioned LLMs (or, Autoformalizing the World) Abstract
Classical formal approaches to artificial intelligence, based on manipulation of symbolic structures, have a number of appealing properties---they generalize (and fail) in predictable ways, provide interpretable traces of behavior, and can be formally verified or manually audited for correctness. Why are they so rarely used in the modern era? One of the major challenges in the development of symbolic AI systems is what McCarthy called the "frame problem": the impossibility of enumerating a set of symbolic rules that fully characterize the behavior of every system in every circumstance. Modern deep learning approaches avoid this representational challenge, but at the cost of interpretability, robustness, and sample-efficiency. How do we build learning systems that are as flexible as neural models but as understandable and generalizable as symbolic ones? In this talk, I'll describe a recent line of work aimed at automatically building "just-in-time" formal models tailored to be just expressive enough to solve tasks of interest. In this approach, neural sequence models pre-trained on text and code are used to place priors over symbolic model descriptions, which are then verified and refined interactively, yielding symbolic planning representations for sequential decision-making. The resulting systems provide human-interpretable traces of behavior and can leverage human-like common-sense and background knowledge during planning, without requiring human system designers in the loop.
|
6:00 pm - 6:02 pm | Organizers Closing Remarks |