Site icon Tripton Art – Online Guide to Monetizing Creativity

10 Links to the Hottest Artificial Intelligence News – AI News – AI Blog

Artificial Intelligence News: Make your own mockups online with AI

Staying informed about Artificial Intelligence (AI) and following AI news is essential for several reasons. First, it provides a competitive edge—knowledge of AI trends helps individuals and businesses identify opportunities and stay ahead. Second, staying updated fuels innovation, allowing exploration of new applications. Third, as AI adoption accelerates across industries, relevance hinges on understanding its impact. Finally, networking within the AI community fosters collaboration and diverse perspectives. In this cosmic dance of technology, staying informed isn’t optional—it’s our navigational star toward limitless horizons. To stay up to date with the latest in AI heck out the updates below.

Get the Scoop

  • OpenAI Created Her: The Birth of GPT-4o
    by Steve Digital on May 13, 2024 at 4:48 pm

    OpenAI’s GPT-4o, with its advanced multimodal capabilities, real-time responsiveness, emotional intelligence, and potential for personalized interactions, brings us significantly closer to the human-like AI depicted in the movie “Her.”

  • Q*
    by Steve Digital on November 23, 2023 at 6:18 am

    The maker of ChatGPT had made progress on Q* (pronounced Q-Star), which some internally believe could be a breakthrough in the startup’s search for superintelligence, also known as artificial general intelligence (AGI).

  • The 10 stages of robots becoming our new overlords
    by Steve Digital on May 4, 2023 at 5:34 am

    Today we examine the transition of robots from friendly, useful, and helpful robots to evil, sinister (and killer) robots with an analysis of the transition from each stage to the next.

  • 360° 3D
    by Steve Digital on April 6, 2023 at 5:14 am

    You can now create an interactive 3D landscape using AI. All it takes is a prompt describing the scenery.

  • To D-ID or not to D-ID?
    by Steve Digital on January 23, 2023 at 2:19 pm

    We created this video with D-ID in under 5 minutes … here’s how you can do it too …

Latest News

Artificial Intelligence for Everyone

AI Generated Images

  • Modern AI Image Generation Technologies
    by DDG on May 22, 2024 at 1:09 pm

    A Simple Introduction to AI Image Algorithms and Hardware The rapid evolution of AI image generation technologies has dramatically transformed the landscape of visual arts. These technologies leverage advanced machine learning algorithms and powerful hardware to create stunning and innovative artworks. This post will help the technically curious reader gain a general understanding of how […] The post Modern AI Image Generation Technologies appeared first on Deep Dream Generator Blog.

  • AI Art and the Concept of Beauty
    by DDG on May 17, 2024 at 11:45 am

    AI Art Aesthetics The concept of beauty in art has been a subject of philosophical debate for centuries. The branch of philosophy that strives to understand beauty is called aesthetics. From the classical ideals of harmony and proportion in ancient Greece to the subjective interpretations of beauty in modern art, the definition of what makes […] The post AI Art and the Concept of Beauty appeared first on Deep Dream Generator Blog.

  • The Ethics of AI Art
    by DDG on May 16, 2024 at 11:45 am

    Why the Ethics of AI Art Matter The intersection of artificial intelligence and art is not just a technical marvel but also a philosophical and ethical frontier. As AI-generated art becomes more prevalent, the questions surrounding its ethical implications grow in complexity and importance. AI art represents a significant evolution in the creative process, merging […] The post The Ethics of AI Art appeared first on Deep Dream Generator Blog.

  • Best AI Dreams of April 2024
    by DDG on May 15, 2024 at 2:44 pm

    April has been a month of vibrant creativity and algorithmic genius on Deep Dream Generator, as AI artists from around the globe have pushed the boundaries of imagination. In this blog post, we celebrate the most popular AI-generated images that have captivated the community. These masterpieces, ranked by likes and comments, not only reflect the […] The post Best AI Dreams of April 2024 appeared first on Deep Dream Generator Blog.

  • Elevating Your Prompt Engineering Skills for AI Image Creation
    by DDG on May 15, 2024 at 12:45 pm

    Intermediate Prompt-Writing Techniques Obviously, the more AI art you create, the more experienced you become. Most of this expertise is about writing better image prompts. As you transition from a beginner to an intermediate prompt engineer, it’s crucial to refine and expand your techniques to communicate more complex artistic visions to AI systems, such as […] The post Elevating Your Prompt Engineering Skills for AI Image Creation appeared first on Deep Dream Generator Blog.

AI at UC Berkeley

The Berkeley Artificial Intelligence Research (BAIR) Lab at UC Berkeley is a vibrant hub where researchers converge across diverse areas: computer vision, machine learning, natural language processing, planning, control, and robotics. With over 50 faculty members and more than 300 graduate students and postdoctoral researchers, BAIR pioneers fundamental advances in AI. Themes like multi-modal deep learning, human-compatible AI, and the intersection of AI with other scientific disciplines and the humanities fuel their cosmic exploration.

  • TinyAgent: Function Calling at the Edge
    on May 29, 2024 at 11:00 pm

    The ability of LLMs to execute commands through plain language (e.g. English) has enabled agentic systems that can complete a user query by orchestrating the right set of tools (e.g. ToolFormer, Gorilla). This, along with the recent multi-modal efforts such as the GPT-4o or Gemini-1.5 model, has expanded the realm of possibilities with AI agents. While this is quite exciting, the large model size and computational requirements of these models often requires their inference to be performed on the cloud. This can create several challenges for their widespread adoption. First and foremost, uploading data such as video, audio, or text documents to a third party vendor on the cloud, can result in privacy issues. Second, this requires cloud/Wi-Fi connectivity which is not always possible. For instance, a robot deployed in the real world may not always have a stable connection. Besides that, latency could also be an issue as uploading large amounts of data to the cloud and waiting for the response could slow down response time, resulting in unacceptable time-to-solution. These challenges could be solved if we deploy the LLM models locally at the edge. However, current LLMs like GPT-4o or Gemini-1.5 are too large for local deployment. One contributing factor is that a lot of the model size ends up memorizing general information about the world into its parametric memory which may not be necessary for a specialized downstream application. For instance, if you ask a general factual question from these models like a historical event or well-known figures, they can produce the results using their parametric memory, even without having additional context in their prompt. However, it seems like this implicit memorization of training data into the parametric memory is correlated with “emergent” phenomena in LLMs such as in-context learning and complex reasoning, which has been the driving force behind scaling the model size. However, this leads to an intriguing research question: Can a smaller language model with significantly less parametric memory emulate such emergent ability of these larger language models? Achieving this would significantly reduce the computational footprint of agentic systems and thus enable efficient and privacy-preserving edge deployment. Our study demonstrates that this is feasible for small language models through training with specialized, high-quality data that does not require recalling generic world knowledge. Such a system could particularly be useful for semantic systems where the AI agent’s role is to understand the user query in natural language and, instead of responding with a ChatGPT-type question answer response, orchestrate the right set of tools and APIs to accomplish the user’s command. For example, in a Siri-like application, a user may ask a language model to create a calendar invite with particular attendees. If a predefined script for creating calendar items already exists, the LLM simply needs to learn how to invoke this script with the correct input arguments (such as attendees’ email addresses, event title, and time). This process does not require recalling/memorization of world knowledge from sources like Wikipedia, but rather requires reasoning and learning to call the right functions and to correctly orchestrate them. Our goal is to develop Small Language Models (SLM) that are capable of complex reasoning that could be deployed securely and privately at the edge. Here we will discuss the research directions that we are pursuing to that end. First, we discuss how we can enable small open-source models to perform accurate function calling, which is a key component of agentic systems. It turns out that off-the-shelf small models have very low function calling capabilities. We discuss how we address this by systematically curating high-quality data for function calling, using a specialized Mac assistant agent as our driving application. We then show that fine-tuning the model on this high quality curated dataset, can enable SLMs to even exceed GPT-4-Turbo’s function calling performance. We then show that this could be further improved and made efficient through a new Tool RAG method. Finally, we show how the final models could be deployed efficiently at the edge with real time responses. Demo of TinyAgent-1B along with Whisper-v3 running locally deployed locally on a Macbook M3 Pro. The framework is open sourced and available at https://github.com/SqueezeAILab/TinyAgent Teaching LLMs to do Function Calling Figure 1: Overview of the LLMCompiler Function Calling Planner. The Planner understands the user query and generates a sequence of tasks with their inter-dependencies. These tasks are then dispatched by the LLMCompiler framework to accomplish the user command. In this example, Task \$1 and \$2 are fetched together to retrieve the email addresses of Sid and Lutfi independently. After each task is performed, the results are forwarded to Task \$3 which creates the calendar event. Before executing Task \$3, LLMCompiler replaces the placeholder variables (e.g., the variable \$1 and \$2 in Task \$3) with actual values. As mentioned above, our main interest is applications where the AI agent translates the user query into a sequence of function calls to complete the tasks. In such applications, the model doesn’t need to write the function definition itself since the functions (or APIs) are mostly pre-defined and already available. Therefore, what the model needs to do is to determine (i) which functions to call, (ii) the corresponding input arguments, and (iii) the right order of calling these functions (i.e. function orchestration) based on the required interdependency across the function calls. The first question is to find an effective way to equip SLMs to perform function calling. Large models such as GPT-4 are able to perform function calling, but how can this be achieved with open source models? LLMCompiler is a recent framework from our group that enables this by instructing the LLM to output a function calling plan that includes the set of functions that it needs to call along with the input arguments and their dependencies (see the example in Figure 1). Once this function calling plan is generated, we can parse it and call each function based on the dependencies. The critical part here is to teach the model to create this function calling plan with the right syntax and dependency. The original LLMCompiler paper only considered large models, such as LLaMA-2 70B, which have complex reasoning capabilities to create the plan when provided with sufficient instructions in their prompts. However, can smaller models be prompted the same way to output the correct function calling plan? Unfortunately, our experiments showed that off-the-shelf small models such as TinyLLaMA-1.1B (or even the larger Wizard-2-7B model) are not able to output the correct plans. The errors ranged from problems such as using the wrong set of functions, hallucinated names, wrong dependencies, inconsistent syntax, etc. This is rather expected because these small models have been trained on generic datasets and primarily targeted to achieve good accuracy on general benchmarks which mostly test the model’s world knowledge and general reasoning or basic instruction following capability. To address this, we explored if fine-tuning these models on a high-quality dataset specially curated for function calling and planning can improve the accuracy of these small language models for a targeted task, potentially outperforming larger models. Next, we first discuss how we generated such a dataset, and then discuss the fine tuning approach. Dataset Generation Figure 2: TinyAgent is an assistant that can interact with various MacOS applications to assist the user. The commands can be given to it through either text through a spotlight input, or through voice. As a driving application, we consider a local agentic system for Apple’s Macbook that solves user’s day-to-day tasks, as shown in Figure 2. Particularly, the agent is equipped with 16 different functions that can interact with different applications on Mac, which includes: Email: Compose a new email or reply to/forward emails Contacts: Retrieve phone numbers or email addresses from the contacts database SMS: Send text messages to contact(s) Calendar: Create calendar events with details such as title, time, attendees, etc. Notes: Create, open, or append content to notes in various folders Reminder: Set reminders for various activities and tasks File management: Open, read, or summarize documents in various file paths Zoom meetings: Schedule and organize Zoom meetings Predefined Apple scripts exist for each of these functions/tools, and all that the model needs to do is to take advantage of the predefined APIs and determine the right function calling plan to accomplish a given task, such as in Figure 1. But as discussed previously, we need some data for evaluating and training small language models since their off-the-shelf function calling capability is subpar. Creating handcrafted data with diverse function calling plans is both challenging and not scalable. However, we can curate synthetic data using an LLM like GPT-4-Turbo. Such an approach is becoming a common method where a capable LLM is instructed to generate data similar to a given set of sample examples or templates (see LLM2LLM and Self-Instruct). In our work, we used a similar approach, but instead of providing the LLM with generic user queries as templates, we provide it with various sets of functions and instruct it to generate realistic user queries that require those functions to accomplish the task, along with the associated function calling plan and input arguments, like the example shown in Figure 1. To verify the validity of the generated data, we incorporated sanity checks on the function calling plan to make sure that they form a feasible graph, and that the function names and input argument types are correct. With this approach, we created 80K training data, 1K validation data, and 1K testing data, with a total cost of only ~$500. Fine-tuning for Improved Function Calling Reasoning Figure 3: Graph Isomorphism Success Rate. The model scores a success rate of 1 only if the DAG of its generated plan is isomorphic to the DAG of the ground truth plan; and 0 otherwise. In above example, for the top case, although the order of the get_email_address calls are different from the ground truth plan (the ground truth plan gets the email address of Lutfi before Sid, and the generated plan gets the email address of Sid before Lutfi), since the two DAGs are isomorphic to each other, the plan gets 1 success rate. For the bottom case, since the predicted DAG contains a wrong node, corresponding to a wrong function call, the plan gets 0 success rate. With our dataset in place, we can now proceed to fine-tune off-the-shelf SLMs to enhance their function calling capability. We started with two base small models: TinyLlama-1.1B (instruct-32k version) and Wizard-2-7B. For fine-tuning these models, we first need to define a metric to evaluate their performance. Our objective is for these models to accurately generate the right plan, which involves not only selecting the right set of functions, but also correctly orchestrating them in the right order. Therefore, we define a success rate metric that assigns 1 if both criteria are met, and 0 otherwise. Checking whether the model has selected the right set function calls is straightforward. To additionally ensure that the orchestration of these functions is correct, we construct a Directed Acyclic Graph (DAG) of the function calls based on the dependencies, as shown in Figure 3, where each node represents a function call and a directed edge from node A to B represents their interdependency (i.e. function B can only be executed after the execution of function A). Then we compare if this DAG is identical to that of the ground truth plan to verify the accuracy of the dependencies. After defining our evaluation metric, we applied LoRA to fine-tune the models for 3 epochs using a learning rate of 7e-5 over the 80K training examples, and selected the best checkpoint based on validation performance. For fine-tuning, our prompt included not only the descriptions of the ground truth functions (i.e. functions used in the ground truth plan) but also other irrelevant functions as negative samples. We found the negative samples to be particularly effective for teaching the model how to select appropriate tools for a given query, hence improving the post-training performance. Furthermore, we also include several in-context examples demonstrating how queries are translated into a function calling plans. These in-context examples are selected through a Retrieval Augmented Generation (RAG) process based on the user query from the data in the training dataset. Using the above settings, we fine-tuned TinyLlama-1.1B/Wizard-2-7B models. After fine-tuning, the 1.1B model improved the success rate from 12.71% to 78.89%, and the 7B model performance improved from 41.25% to 83.09%, which is ~4% higher than GPT-4-Turbo. Efficient Inference with Tool RAG Figure 4: Efficient Tool Selection Based on User Input. Not all user inputs require all available tools; hence, it is imperative to select the right set of tools to minimize the prompt size and increase performance. In this case, the LLM only needs the functions that get email addresses and create a calendar event in its prompt to accomplish its task. Our primary goal is to be able to deploy the TinyAgent model locally on a Macbook, which has limited computational and memory resources available as compared to the GPUs that closed-source models like GPT are deployed on. To achieve efficient performance with low latency we need to ensure that not only the model size is small, but that the input prompt is as concise as possible. The latter is an important contributor to latency and computational resource consumption due to the quadratic complexity of attention on sequence length. The fine-tuned TinyAgent model discussed previously was fine-tuned with the description of all available tools in its prompt. However, this is pretty inefficient. We can significantly reduce the prompt size by only including the description of relevant tools based on the user query. For instance, consider the example shown in Figure 4 above, where the user is asking to create a calendar invite with two people. In this case, the LLM only needs the functions that get email addresses and create a calendar event in its prompt. To take advantage of this observation, we need to determine which functions are required to accomplish the user’s command, which we refer to as Tool RAG given its similarity with how Retrieval Augmented Generation (RAG) works. However, there is an important subtlety. If we use a basic RAG method where we compute the embedding of the user query and use that to retrieve the relevant tools, we get very low performance. This is because completing a user’s query often requires using several auxiliary tools which may be missed with a simple RAG method if the embedding of the auxiliary tool is not similar to the user query. For instance, the example shown in Figure 4 requires calling get_email_address function even though the user query is just asking about creating a calendar invitation. This can be addressed by treating the problem as a classification of which tools are needed. To that end, we fine-tuned a DeBERTa-v3-small model on the training data to perform a 16-way classification as shown in Figure 5. The user query is given as an input to this model, and then we pass the CLS token at the end through a simple fully connected layer of size 768×16 to transform it into a 16 dimensional vector (which is the total size of our tools). The output of this layer is passed through a sigmoid layer to produce the probability of selecting each tool. During inference, we select the tools that have probably higher than 50%, and if so, we include their description in the prompt. On average we noticed that only 3.97 tools are retrieved with a recall of 0.998, whereas the basic RAG requires using the top 6 tools to achieve a tool recall of 0.968. Figure 5: Overview of our Tool RAG scheme. We formulate tool retrieval as a multi-label classification problem. The user query is given as input to the fine-tuned DeBERTa-v3-small model, which outputs a 16-dimensional vector indicating tool probabilities. Tools with probabilities higher than 50% are selected, averaging 3.97 tools per query compared to 6 tools in basic RAG. We evaluated the model performance after incorporating Tool RAG. The results are shown in Table 1 below, where we report the performance of the simple RAG system along with the fine-tuned DeBERTa approach. As one can see, the DeBERTa based Tool RAG method achieves almost perfect recall performance, improves the baseline accuracy, while reducing the prompt size by ~2x tokens. Table 1: Comparison of TinyAgent performance with DeBERTa to Basic RAG and no RAG settings. Tool RAG Method Tool Recall Prompt Size (Tokens) TinyAgent 1.1B Success Rate (%) TinyAgent 7B Success Rate (%) No RAG (all tools in the prompt) 1 2762 78.89 83.09 Basic RAG 0.949 (top 3) 1674 74.88 78.50 Fine-tuned DeBERTa-v3-small (Ours) 0.998 (tools with >50% prob) 1397 80.06 84.95 Fast Edge Deployment with Quantization Deploying models at the edge, such as on consumer MacBooks, can still be challenging even for small models of O(1B) parameters, since loading the model parameters can consume a large portion of the available memory. A solution to these issues is quantization, which allows us to store the model at a reduced bit precision. Quantization not only reduces the storage requirements and model footprint, but also cuts down the time and resources needed to load model weights into memory, thereby reducing the overall inference latency as well (see this for more information on quantization). For more efficient deployment of the models, we quantized the models into 4-bit with a group size of 32, which is supported by the llama.cpp framework with quantization aware training. As shown in Table 2, the 4-bit models result in 30% better latency, along with a 4x reduction in the model size. We also notice slight accuracy improvement which is due to the additional fine-tuning with simulated quantization. Table 2: Latency, size, and success rate of TinyAgent models before and after quantization. Latency is the end-to-end latency of the function calling planner, including the prompt processing time and generation. Model Weight Precision Latency (seconds) Model Size (GB) Success Rate (%) GPT-3.5 Unknown 3.2 Unknown 65.04 GPT-4-Turbo Unknown 3.9 Unknown 79.08 TinyAgent-1.1B 16 3.9 2.2 80.06 TinyAgent-1.1B 4 2.9 0.68 80.35 TinyAgent-7B 16 19.5 14.5 84.95 TinyAgent-7B 4 13.1 4.37 85.14 Putting it all together Below is the demo of the final TinyAgent-1.1B model deployed on a Macbook Pro M3 which you can actually download and install on your Mac and test as well. It not only runs all of the model inference locally on your computer, but it also allows you to provide commands through audio. We process the audio locally as well using the Whisper-v3 model from OpenAI deployed locally using the whisper.cpp framework. The greatest surprise for us was that the accuracy of the 1.1B model exceeds that of GPT-4-Turbo, and is markedly fast while deployed locally and privately on device. To summarize, we introduced TinyAgent and showed that it is indeed possible to train a small language model and use it to power a semantic system that processes user queries. In particular, we considered a Siri-like assistant for Mac as a driving application. The key components for enabling it is to (i) teach off-the-shelf SLMs to perform function calling through LLMCompiler framework, (ii) curate high quality function calling data for the task at hand, (iii) fine-tune the off-the-shelf model on the generated data, and (iv) enable efficient deployment by optimizing the prompt size through only retrieving the necessary tools based on the user query through a method called ToolRAG, as well as quantized model deployment to reduce inference resource consumption. After these steps, our final models achieved 80.06% and 84.95% for the TinyAgent1.1.B and 7B models which exceed GPT-4-Turbo’s success rate of 79.08% on this task. Acknowledgements We would like to thank Apple for sponsoring this project, as well as support from Microsoft through Accelerating Foundation Models Research Program. We also thank Sunjin Choi for his insights in energy cost associated with local and cloud deployment. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred. BibTex for this post: @misc{tiny-agent, title={TinyAgent: Function Calling at the Edge}, author={Erdogan, Lutfi Eren and Lee, Nicholas and Jha, Siddharth and Kim, Sehoon and Tabrizi, Ryan and Moon, Suhong and Hooper, Coleman and Anumanchipalli, Gopala and Keutzer, Kurt and Gholami, Amir}, howpublished={\url{https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/}}, year={2024} }

  • Modeling Extremely Large Images with xT
    on March 21, 2024 at 9:00 am

    As computer vision researchers, we believe that every pixel can tell a story. However, there seems to be a writer’s block settling into the field when it comes to dealing with large images. Large images are no longer rare—the cameras we carry in our pockets and those orbiting our planet snap pictures so big and detailed that they stretch our current best models and hardware to their breaking points when handling them. Generally, we face a quadratic increase in memory usage as a function of image size. Today, we make one of two sub-optimal choices when handling large images: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. We take another look at these approaches and introduce $x$T, a new framework to model large images end-to-end on contemporary GPUs while effectively aggregating global context with local details. Architecture for the $x$T framework. Why Bother with Big Images Anyway? Why bother handling large images anyways? Picture yourself in front of your TV, watching your favorite football team. The field is dotted with players all over with action occurring only on a small portion of the screen at a time. Would you be satisified, however, if you could only see a small region around where the ball currently was? Alternatively, would you be satisified watching the game in low resolution? Every pixel tells a story, no matter how far apart they are. This is true in all domains from your TV screen to a pathologist viewing a gigapixel slide to diagnose tiny patches of cancer. These images are treasure troves of information. If we can’t fully explore the wealth because our tools can’t handle the map, what’s the point? Sports are fun when you know what’s going on. That’s precisely where the frustration lies today. The bigger the image, the more we need to simultaneously zoom out to see the whole picture and zoom in for the nitty-gritty details, making it a challenge to grasp both the forest and the trees simultaneously. Most current methods force a choice between losing sight of the forest or missing the trees, and neither option is great. How $x$T Tries to Fix This Imagine trying to solve a massive jigsaw puzzle. Instead of tackling the whole thing at once, which would be overwhelming, you start with smaller sections, get a good look at each piece, and then figure out how they fit into the bigger picture. That’s basically what we do with large images with $x$T. $x$T takes these gigantic images and chops them into smaller, more digestible pieces hierarchically. This isn’t just about making things smaller, though. It’s about understanding each piece in its own right and then, using some clever techniques, figuring out how these pieces connect on a larger scale. It’s like having a conversation with each part of the image, learning its story, and then sharing those stories with the other parts to get the full narrative. Nested Tokenization At the core of $x$T lies the concept of nested tokenization. In simple terms, tokenization in the realm of computer vision is akin to chopping up an image into pieces (tokens) that a model can digest and analyze. However, $x$T takes this a step further by introducing a hierarchy into the process—hence, nested. Imagine you’re tasked with analyzing a detailed city map. Instead of trying to take in the entire map at once, you break it down into districts, then neighborhoods within those districts, and finally, streets within those neighborhoods. This hierarchical breakdown makes it easier to manage and understand the details of the map while keeping track of where everything fits in the larger picture. That’s the essence of nested tokenization—we split an image into regions, each which can be split into further sub-regions depending on the input size expected by a vision backbone (what we call a region encoder), before being patchified to be processed by that region encoder. This nested approach allows us to extract features at different scales on a local level. Coordinating Region and Context Encoders Once an image is neatly divided into tokens, $x$T employs two types of encoders to make sense of these pieces: the region encoder and the context encoder. Each plays a distinct role in piecing together the image’s full story. The region encoder is a standalone “local expert” which converts independent regions into detailed representations. However, since each region is processed in isolation, no information is shared across the image at large. The region encoder can be any state-of-the-art vision backbone. In our experiments we have utilized hierarchical vision transformers such as Swin and Hiera and also CNNs such as ConvNeXt! Enter the context encoder, the big-picture guru. Its job is to take the detailed representations from the region encoders and stitch them together, ensuring that the insights from one token are considered in the context of the others. The context encoder is generally a long-sequence model. We experiment with Transformer-XL (and our variant of it called Hyper) and Mamba, though you could use Longformer and other new advances in this area. Even though these long-sequence models are generally made for language, we demonstrate that it is possible to use them effectively for vision tasks. The magic of $x$T is in how these components—the nested tokenization, region encoders, and context encoders—come together. By first breaking down the image into manageable pieces and then systematically analyzing these pieces both in isolation and in conjunction, $x$T manages to maintain the fidelity of the original image’s details while also integrating long-distance context the overarching context while fitting massive images, end-to-end, on contemporary GPUs. Results We evaluate $x$T on challenging benchmark tasks that span well-established computer vision baselines to rigorous large image tasks. Particularly, we experiment with iNaturalist 2018 for fine-grained species classification, xView3-SAR for context-dependent segmentation, and MS-COCO for detection. Powerful vision models used with $x$T set a new frontier on downstream tasks such as fine-grained species classification. Our experiments show that $x$T can achieve higher accuracy on all downstream tasks with fewer parameters while using much less memory per region than state-of-the-art baselines*. We are able to model images as large as 29,000 x 25,000 pixels large on 40GB A100s while comparable baselines run out of memory at only 2,800 x 2,800 pixels. Powerful vision models used with $x$T set a new frontier on downstream tasks such as fine-grained species classification. *Depending on your choice of context model, such as Transformer-XL. Why This Matters More Than You Think This approach isn’t just cool; it’s necessary. For scientists tracking climate change or doctors diagnosing diseases, it’s a game-changer. It means creating models which understand the full story, not just bits and pieces. In environmental monitoring, for example, being able to see both the broader changes over vast landscapes and the details of specific areas can help in understanding the bigger picture of climate impact. In healthcare, it could mean the difference between catching a disease early or not. We are not claiming to have solved all the world’s problems in one go. We are hoping that with $x$T we have opened the door to what’s possible. We’re stepping into a new era where we don’t have to compromise on the clarity or breadth of our vision. $x$T is our big leap towards models that can juggle the intricacies of large-scale images without breaking a sweat. There’s a lot more ground to cover. Research will evolve, and hopefully, so will our ability to process even bigger and more complex images. In fact, we are working on follow-ons to $x$T which will expand this frontier further. In Conclusion For a complete treatment of this work, please check out the paper on arXiv. The project page contains a link to our released code and weights. If you find the work useful, please cite it as below: @article{xTLargeImageModeling, title={xT: Nested Tokenization for Larger Context in Large Images}, author={Gupta, Ritwik and Li, Shufan and Zhu, Tyler and Malik, Jitendra and Darrell, Trevor and Mangalam, Karttikeya}, journal={arXiv preprint arXiv:2403.01915}, year={2024} }

  • 2024 BAIR Graduate Directory
    on March 11, 2024 at 9:00 am

    Every year, the Berkeley Artificial Intelligence Research (BAIR) Lab graduates some of the most talented and innovative minds in artificial intelligence and machine learning. Our Ph.D. graduates have each expanded the frontiers of AI research and are now ready to embark on new adventures in academia, industry, and beyond. These fantastic individuals bring with them a wealth of knowledge, fresh ideas, and a drive to continue contributing to the advancement of AI. Their work at BAIR, ranging from deep learning, robotics, and natural language processing to computer vision, security, and much more, has contributed significantly to their fields and has had transformative impacts on society. This website is dedicated to showcasing our colleagues, making it easier for academic institutions, research organizations, and industry leaders to discover and recruit from the newest generation of AI pioneers. Here, you’ll find detailed profiles, research interests, and contact information for each of our graduates. We invite you to explore the potential collaborations and opportunities these graduates present as they seek to apply their expertise and insights in new environments. Join us in celebrating the achievements of BAIR’s latest PhD graduates. Their journey is just beginning, and the future they will help build is bright! Thank you to our friends at the Stanford AI Lab for this idea! Abdus Salam Azad Email: salam_azad@berkeley.edu Website: https://www.azadsalam.org/ Advisor(s): Ion Stoica Research Blurb: My research interest lies broadly in the field of Machine Learning and Artificial Intelligence. During my PhD I have focused on Environment Generation/ Curriculum Learning methods for training Autonomous Agents with Reinforcement Learning. Specifically, I work on methods that algorithmically generates diverse training environments (i.e., learning scenarios) for autonomous agents to improve generalization and sample efficiency. Currently, I am working on Large Language Model (LLM) based autonomous agents. Jobs Interested In: Research Scientist, ML Engineer Alicia Tsai Email: aliciatsai@berkeley.edu Website: https://www.aliciatsai.com/ Advisor(s): Laurent El Ghaoui Research Blurb: My research delves into the theoretical aspects of deep implicit models, beginning with a unified “state-space” representation that simplifies notation. Additionally, my work explores various training challenges associated with deep learning, including problems amenable to convex and non-convex optimization. In addition to theoretical exploration, my research extends the potential applications to various problem domains, including natural language processing, and natural science. Jobs Interested In: Research Scientist, Applied Scientist, Machine Learning Engineer Catherine Weaver Email: catherine22@berkeley.edu Website: https://cwj22.github.io Advisor(s): Masayoshi Tomizuka, Wei Zhan Research Blurb: My research focuses on machine learning and control algorithms for the challenging task of autonomous racing in Gran Turismo Sport. I leverage my background in Mechanical Engineering to discover how machine learning and model-based optimal control can create safe, high-performance control systems for robotics and autonomous systems. A particular emphasis of mine has been how to leverage offline datasets (e.g. human player’s racing trajectories) to inform better, more sample efficient control algorithms. Jobs Interested In: Research Scientist and Robotics/Controls Engineer Chawin Sitawarin Email: chawin.sitawarin@gmail.com Website: https://chawins.github.io/ Advisor(s): David Wagner Research Blurb: I am broadly interested in the security and safety aspects of machine learning systems. Most of my previous works are in the domain of adversarial machine learning, particularly adversarial examples and robustness of machine learning algorithms. More recently, I am excited about emerging security and privacy risks on large language models. Jobs Interested In: Research scientist Dhruv Shah Email: shah@cs.berkeley.edu Website: http://cs.berkeley.edu/~shah/ Advisor(s): Sergey Levine Research Blurb: I train big(-ish) models and make robots smarter. Jobs Interested In: Research scientist, roboticist Eliza Kosoy Email: eko@berkeley.edu Website: https://www.elizakosoy.com/ Advisor(s): Alison Gopnik Research Blurb: Eliza Kosoy works at the intersection of child development and AI with Prof. Alison Gopnik. Her work includes creating evaluative benchmarks for LLMs rooted in child development and studying how children and adults use GenAI models such as ChatGPT/Dalle and form mental models about them. She’s an intern at Google working on the AI/UX team and previously with the Empathy Lab. She has published in Neurips, ICML, ICLR, Cogsci and cognition. Her thesis work created a unified virtual environment for testing children and AI models in one place for the purposes of training RL models. She also has experience building startups and STEM hardware coding toys. Jobs Interested In: Research Scientist (child development and AI), AI safety (specializing in children), User Experience (UX) Researcher (specializing in mixed methods, youth, AI, LLMs), Education and AI (STEM toys) Fangyu Wu Email: fangyuwu@berkeley.edu Website: https://fangyuwu.com/ Advisor(s): Alexandre Bayen Research Blurb: Under the mentorship of Prof. Alexandre Bayen, Fangyu focuses on the application of optimization methods to multi-agent robotic systems, particularly in the planning and control of automated vehicles. Jobs Interested In: Faculty, or research scientist in control, optimization, and robotics Frances Ding Email: frances@berkeley.edu Website: https://www.francesding.com/ Advisor(s): Jacob Steinhardt, Moritz Hardt Research Blurb: My research focus is in machine learning for protein modeling. I work on improving protein property classification and protein design, as well as understanding what different protein models learn. I have previously worked on sequence models for DNA and RNA, and benchmarks for evaluating the interpretability and fairness of ML models across domains. Jobs Interested In: Research scientist Jianlan Luo Email: jianlanluo@eecs.berkeley.edu Website: https://people.eecs.berkeley.edu/~jianlanluo/ Advisor(s): Sergey Levine Research Blurb: My research interests are broadly in scalable algorithms and practice of machine learning, robotics, and controls; particularly their intersections. Jobs Interested In: Faculty, Research Scientist Kathy Jang Email: kathyjang@gmail.com Website: https://kathyjang.com Advisor(s): Alexandre Bayen Research Blurb: My thesis work has specialized in reinforcement learning for autonomous vehicles, focusing on enhancing decision-making and efficiency in applied settings. In future work, I’m eager to apply these principles to broader challenges across domains like natural language processing. With my background, my aim is to see the direct impact of my efforts by contributing to innovative AI research and solutions. Jobs Interested In: ML research scientist/engineer Kevin Lin Email: k-lin@berkeley.edu Website: https://people.eecs.berkeley.edu/~kevinlin/ Advisor(s): Dan Klein, Joseph E. Gonzalez Research Blurb: My research focuses on understanding and improving how language models use and provide information. Jobs Interested In: Research Scientist Nikhil Ghosh Email: nikhil_ghosh@berkeley.edu Website: https://nikhil-ghosh-berkeley.github.io/ Advisor(s): Bin Yu, Song Mei Research Blurb: I am interested in developing a better foundational understanding of deep learning and improving practical systems, using both theoretical and empirical methodology. Currently, I am especially interested in improving the efficiency of large models by studying how to properly scale hyperparameters with model size. Jobs Interested In: Research Scientist Olivia Watkins Email: oliviawatkins@berkeley.edu Website: https://aliengirlliv.github.io/oliviawatkins Advisor(s): Pieter Abbeel and Trevor Darrell Research Blurb: My work involves RL, BC, learning from humans, and using common-sense foundation model reasoning for agent learning. I’m excited about language agent learning, supervision, alignment & robustness. Jobs Interested In: Research scientist Ruiming Cao Email: rcao@berkeley.edu Website: https://rmcao.net Advisor(s): Laura Waller Research Blurb: My research is on computational imaging, particularly the space-time modeling for dynamic scene recovery and motion estimation. I also work on optical microscopy techniques, optimization-based optical design, event camera processing, novel view rendering. Jobs Interested In: Research scientist, postdoc, faculty Ryan Hoque Email: ryanhoque@berkeley.edu Website: https://ryanhoque.github.io Advisor(s): Ken Goldberg Research Blurb: Imitation learning and reinforcement learning algorithms that scale to large robot fleets performing manipulation and other complex tasks. Jobs Interested In: Research Scientist Sam Toyer Email: sdt@berkeley.edu Website: https://www.qxcv.net/ Advisor(s): Stuart Russell Research Blurb: My research focuses on making language models secure, robust and safe. I also have experience in vision, planning, imitation learning, reinforcement learning, and reward learning. Jobs Interested In: Research scientist Shishir G. Patil Email: shishirpatil2007@gmail.com Website: https://shishirpatil.github.io/ Advisor(s): Joseph Gonzalez Research Blurb: Gorilla LLM – Teaching LLMs to use tools (https://gorilla.cs.berkeley.edu/); LLM Execution Engine: Guaranteeing reversibility, robustness, and minimizing blast-radius for LLM-Agents incorporated into user and enterprise workflows; POET: Memory bound, and energy efficient fine-tuning of LLMs on edge devices such as smartphones and laptops (https://poet.cs.berkeley.edu/). Jobs Interested In: Research Scientist Suzie Petryk Email: spetryk@berkeley.edu Website: https://suziepetryk.com/ Advisor(s): Trevor Darrell, Joseph Gonzalez Research Blurb: I work on improving the reliability and safety of multimodal models. My focus has been on localizing and reducing hallucinations for vision + language models, along with measuring and using uncertainty and mitigating bias. My interests lay in applying solutions to these challenges in actual production scenarios, rather than solely in academic environments. Jobs Interested In: Applied research scientist in generative AI, safety, and/or accessibility Xingyu Lin Email: xingyu@berkeley.edu Website: https://xingyu-lin.github.io/ Advisor(s): Pieter Abbeel Research Blurb: My research lies in robotics, machine learning, and computer vision, with the primary goal of learning generalizable robot skills from two angles: (1) Learning structured world models with spatial and temporal abstractions. (2) Pre-training visual representation and skills to enable knowledge transfer from Internet-scale vision datasets and simulators. Jobs Interested In: Faculty, or research scientist Yaodong Yu Email: yyu@eecs.berkeley.edu Website: https://yaodongyu.github.io/ Advisor(s): Michael I. Jordan, Yi Ma Research Blurb: My research interests are broadly in theory and practice of trustworthy machine learning, including interpretability, privacy, and robustness. Jobs Interested In: Faculty

  • The Shift from Models to Compound AI Systems
    on February 18, 2024 at 9:00 am

    AI caught everyone’s attention in 2023 with Large Language Models (LLMs) that can be instructed to perform general tasks, such as translation or coding, just by prompting. This naturally led to an intense focus on models as the primary ingredient in AI application development, with everyone wondering what capabilities new LLMs will bring. As more developers begin to build using LLMs, however, we believe that this focus is rapidly changing: state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models. For example, Google’s AlphaCode 2 set state-of-the-art results in programming through a carefully engineered system that uses LLMs to generate up to 1 million possible solutions for a task and then filter down the set. AlphaGeometry, likewise, combines an LLM with a traditional symbolic solver to tackle olympiad problems. In enterprises, our colleagues at Databricks found that 60% of LLM applications use some form of retrieval-augmented generation (RAG), and 30% use multi-step chains. Even researchers working on traditional language model tasks, who used to report results from a single LLM call, are now reporting results from increasingly complex inference strategies: Microsoft wrote about a chaining strategy that exceeded GPT-4’s accuracy on medical exams by 9%, and Google’s Gemini launch post measured its MMLU benchmark results using a new CoT@32 inference strategy that calls the model 32 times, which raised questions about its comparison to just a single call to GPT-4. This shift to compound systems opens many interesting design questions, but it is also exciting, because it means leading AI results can be achieved through clever engineering, not just scaling up training. In this post, we analyze the trend toward compound AI systems and what it means for AI developers. Why are developers building compound systems? Is this paradigm here to stay as models improve? And what are the emerging tools for developing and optimizing such systems—an area that has received far less research than model training? We argue that compound AI systems will likely be the best way to maximize AI results in the future, and might be one of the most impactful trends in AI in 2024. Increasingly many new AI results are from compound systems. Why Use Compound AI Systems? We define a Compound AI System as a system that tackles AI tasks using multiple interacting components, including multiple calls to models, retrievers, or external tools. In contrast, an AI Model is simply a statistical model, e.g., a Transformer that predicts the next token in text. Even though AI models are continually getting better, and there is no clear end in sight to their scaling, more and more state-of-the-art results are obtained using compound systems. Why is that? We have seen several distinct reasons: Some tasks are easier to improve via system design. While LLMs appear to follow remarkable scaling laws that predictably yield better results with more compute, in many applications, scaling offers lower returns-vs-cost than building a compound system. For example, suppose that the current best LLM can solve coding contest problems 30% of the time, and tripling its training budget would increase this to 35%; this is still not reliable enough to win a coding contest! In contrast, engineering a system that samples from the model multiple times, tests each sample, etc. might increase performance to 80% with today’s models, as shown in work like AlphaCode. Even more importantly, iterating on a system design is often much faster than waiting for training runs. We believe that in any high-value application, developers will want to use every tool available to maximize AI quality, so they will use system ideas in addition to scaling. We frequently see this with LLM users, where a good LLM creates a compelling but frustratingly unreliable first demo, and engineering teams then go on to systematically raise quality. Systems can be dynamic. Machine learning models are inherently limited because they are trained on static datasets, so their “knowledge” is fixed. Therefore, developers need to combine models with other components, such as search and retrieval, to incorporate timely data. In addition, training lets a model “see” the whole training set, so more complex systems are needed to build AI applications with access controls (e.g., answer a user’s questions based only on files the user has access to). Improving control and trust is easier with systems. Neural network models alone are hard to control: while training will influence them, it is nearly impossible to guarantee that a model will avoid certain behaviors. Using an AI system instead of a model can help developers control behavior more tightly, e.g., by filtering model outputs. Likewise, even the best LLMs still hallucinate, but a system combining, say, LLMs with retrieval can increase user trust by providing citations or automatically verifying facts. Performance goals vary widely. Each AI model has a fixed quality level and cost, but applications often need to vary these parameters. In some applications, such as inline code suggestions, the best AI models are too expensive, so tools like Github Copilot use carefully tuned smaller models and various search heuristics to provide results. In other applications, even the largest models, like GPT-4, are too cheap! Many users would be willing to pay a few dollars for a correct legal opinion, instead of the few cents it takes to ask GPT-4, but a developer would need to design an AI system to utilize this larger budget. The shift to compound systems in Generative AI also matches the industry trends in other AI fields, such as self-driving cars: most of the state-of-the-art implementations are systems with multiple specialized components (more discussion here). For these reasons, we believe compound AI systems will remain a leading paradigm even as models improve. Developing Compound AI Systems While compound AI systems can offer clear benefits, the art of designing, optimizing, and operating them is still emerging. On the surface, an AI system is a combination of traditional software and AI models, but there are many interesting design questions. For example, should the overall “control logic” be written in traditional code (e.g., Python code that calls an LLM), or should it be driven by an AI model (e.g. LLM agents that call external tools)? Likewise, in a compound system, where should a developer invest resources—for example, in a RAG pipeline, is it better to spend more FLOPS on the retriever or the LLM, or even to call an LLM multiple times? Finally, how can we optimize an AI system with discrete components end-to-end to maximize a metric, the same way we can train a neural network? In this section, we detail a few example AI systems, then discuss these challenges and recent research on them. The AI System Design Space Below are few recent compound AI systems to show the breadth of design choices: AI System Components Design Results AlphaCode 2 Fine-tuned LLMs for sampling and scoring programs Code execution module Clustering model Generates up to 1 million solutions for a coding problem then filters and scores them Matches 85th percentile of humans on coding contests AlphaGeometry Fine-tuned LLM Symbolic math engine Iteratively suggests constructions in a geometry problem via LLM and checks deduced facts produced by symbolic engine Between silver and gold International Math Olympiad medalists on timed test Medprompt GPT-4 LLM Nearest-neighbor search in database of correct examples LLM-generated chain-of-thought examples Multiple samples and ensembling Answers medical questions by searching for similar examples to construct a few-shot prompt, adding model-generated chain-of-thought for each example, and generating and judging up to 11 solutions Outperforms specialized medical models like Med-PaLM used with simpler prompting strategies Gemini on MMLU Gemini LLM Custom inference logic Gemini’s CoT@32 inference strategy for the MMLU benchmark samples 32 chain-of-thought answers from the model, and returns the top choice if enough of them agree, or uses generation without chain-of-thought if not 90.04% on MMLU, compared to 86.4% for GPT-4 with 5-shot prompting or 83.7% for Gemini with 5-shot prompting ChatGPT Plus LLM Web Browser plugin for retrieving timely content Code Interpreter plugin for executing Python DALL-E image generator The ChatGPT Plus offering can call tools such as web browsing to answer questions; the LLM determines when and how to call each tool as it responds Popular consumer AI product with millions of paid subscribers RAG, ORQA, Bing, Baleen, etc LLM (sometimes called multiple times) Retrieval system Combine LLMs with retrieval systems in various ways, e.g., asking an LLM to generate a search query, or directly searching for the current context Widely used technique in search engines and enterprise apps Key Challenges in Compound AI Systems Compound AI systems pose new challenges in design, optimization and operation compared to AI models. Design Space The range of possible system designs for a given task is vast. For example, even in the simple case of retrieval-augmented generation (RAG) with a retriever and language model, there are: (i) many retrieval and language models to choose from, (ii) other techniques to improve retrieval quality, such as query expansion or reranking models, and (iii) techniques to improve the LLM’s generated output (e.g., running another LLM to check that the output relates to the retrieved passages). Developers have to explore this vast space to find a good design. In addition, developers need to allocate limited resources, like latency and cost budgets, among the system components. For example, if you want to answer RAG questions in 100 milliseconds, should you budget to spend 20 ms on the retriever and 80 on the LLM, or the other way around? Optimization Often in ML, maximizing the quality of a compound system requires co-optimizing the components to work well together. For example, consider a simple RAG application where an LLM sees a user question, generates a search query to send to a retriever, and then generates an answer. Ideally, the LLM would be tuned to generate queries that work well for that particular retriever, and the retriever would be tuned to prefer answers that work well for that LLM. In single model development a la PyTorch, users can easily optimize a model end-to-end because the whole model is differentiable. However, compound AI systems contain non-differentiable components like search engines or code interpreters, and thus require new methods of optimization. Optimizing these compound AI systems is still a new research area; for example, DSPy offers a general optimizer for pipelines of pretrained LLMs and other components, while others systems, like LaMDA, Toolformer and AlphaGeometry, use tool calls during model training to optimize models for those tools. Operation Machine learning operations (MLOps) become more challenging for compound AI systems. For example, while it is easy to track success rates for a traditional ML model like a spam classifier, how should developers track and debug the performance of an LLM agent for the same task, which might use a variable number of “reflection” steps or external API calls to classify a message? We believe that a new generation of MLOps tools will be developed to tackle these problems. Interesting problems include: Monitoring: How can developers most efficiently log, analyze, and debug traces from complex AI systems? DataOps: Because many AI systems involve data serving components like vector DBs, and their behavior depends on the quality of data served, any focus on operations for these systems should additionally span data pipelines. Security: Research has shown that compound AI systems, such as an LLM chatbot with a content filter, can create unforeseen security risks compared to individual models. New tools will be required to secure these systems. Emerging Paradigms To tackle the challenges of building compound AI systems, multiple new approaches are arising in the industry and in research. We highlight a few of the most widely used ones and examples from our research on tackling these challenges. Designing AI Systems: Composition Frameworks and Strategies. Many developers are now using “language model programming” frameworks that let them build applications out of multiple calls to AI models and other components. These include component libraries like LangChain and LlamaIndex that developers call from traditional programs, agent frameworks like AutoGPT and BabyAGI that let an LLM drive the application, and tools for controlling LM outputs, like Guardrails, Outlines, LMQL and SGLang. In parallel, researchers are developing numerous new inference strategies to generate better outputs using calls to models and tools, such as chain-of-thought, self-consistency, WikiChat, RAG and others. Automatically Optimizing Quality: DSPy. Coming from academia, DSPy is the first framework that aims to optimize a system composed of LLM calls and other tools to maximize a target metric. Users write an application out of calls to LLMs and other tools, and provide a target metric such as accuracy on a validation set, and then DSPy automatically tunes the pipeline by creating prompt instructions, few-shot examples, and other parameter choices for each module to maximize end-to-end performance. The effect is similar to end-to-end optimization of a multi-layer neural network in PyTorch, except that the modules in DSPy are not always differentiable layers. To do that, DSPy leverages the linguistic abilities of LLMs in a clean way: to specify each module, users write a natural language signature, such as user_question -> search_query, where the names of the input and output fields are meaningful, and DSPy automatically turns this into suitable prompts with instructions, few-shot examples, or even weight updates to the underlying language models. Optimizing Cost: FrugalGPT and AI Gateways. The wide range of AI models and services available makes it challenging to pick the right one for an application. Moreover, different models may perform better on different inputs. FrugalGPT is a framework to automatically route inputs to different AI model cascades to maximize quality subject to a target budget. Based on a small set of examples, it learns a routing strategy that can outperform the best LLM services by up to 4% at the same cost, or reduce cost by up to 90% while matching their quality. FrugalGPT is an example of a broader emerging concept of AI gateways or routers, implemented in software like Databricks AI Gateway, OpenRouter, and Martian, to optimize the performance of each component of an AI application. These systems work even better when an AI task is broken into smaller modular steps in a compound system, and the gateway can optimize routing separately for each step. Operation: LLMOps and DataOps. AI applications have always required careful monitoring of both model outputs and data pipelines to run reliably. With compound AI systems, however, the behavior of the system on each input can be considerably more complex, so it is important to track all the steps taken by the application and intermediate outputs. Software like LangSmith, Phoenix Traces, and Databricks Inference Tables can track, visualize and evaluate these outputs at a fine granularity, in some cases also correlating them with data pipeline quality and downstream metrics. In the research world, DSPy Assertions seeks to leverage feedback from monitoring checks directly in AI systems to improve outputs, and AI-based quality evaluation methods like MT-Bench, FAVA and ARES aim to automate quality monitoring. Conclusion Generative AI has excited every developer by unlocking a wide range of capabilities through natural language prompting. As developers aim to move beyond demos and maximize the quality of their AI applications, however, they are increasingly turning to compound AI systems as a natural way to control and enhance the capabilities of LLMs. Figuring out the best practices for developing compound AI systems is still an open question, but there are already exciting approaches to aid with design, end-to-end optimization, and operation. We believe that compound AI systems will remain the best way to maximize the quality and reliability of AI applications going forward, and may be one of the most important trends in AI in 2024. BibTex for this post: @misc{compound-ai-blog, title={The Shift from Models to Compound AI Systems}, author={Matei Zaharia and Omar Khattab and Lingjiao Chen and Jared Quincy Davis and Heather Miller and Chris Potts and James Zou and Michael Carbin and Jonathan Frankle and Naveen Rao and Ali Ghodsi}, howpublished={\url{https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/}}, year={2024} }

  • Ghostbuster: Detecting Text Ghostwritten by Large Language Models
    on November 14, 2023 at 12:30 pm

    The structure of Ghostbuster, our new state-of-the-art method for detecting AI-generated text. Large language models like ChatGPT write impressively well—so well, in fact, that they’ve become a problem. Students have begun using these models to ghostwrite assignments, leading some schools to ban ChatGPT. In addition, these models are also prone to producing text with factual errors, so wary readers may want to know if generative AI tools have been used to ghostwrite news articles or other sources before trusting them. What can teachers and consumers do? Existing tools to detect AI-generated text sometimes do poorly on data that differs from what they were trained on. In addition, if these models falsely classify real human writing as AI-generated, they can jeopardize students whose genuine work is called into question. Our recent paper introduces Ghostbuster, a state-of-the-art method for detecting AI-generated text. Ghostbuster works by finding the probability of generating each token in a document under several weaker language models, then combining functions based on these probabilities as input to a final classifier. Ghostbuster doesn’t need to know what model was used to generate a document, nor the probability of generating the document under that specific model. This property makes Ghostbuster particularly useful for detecting text potentially generated by an unknown model or a black-box model, such as the popular commercial models ChatGPT and Claude, for which probabilities aren’t available. We’re particularly interested in ensuring that Ghostbuster generalizes well, so we evaluated across a range of ways that text could be generated, including different domains (using newly collected datasets of essays, news, and stories), language models, or prompts. Examples of human-authored and AI-generated text from our datasets. Why this Approach? Many current AI-generated text detection systems are brittle to classifying different types of text (e.g., different writing styles, or different text generation models or prompts). Simpler models that use perplexity alone typically can’t capture more complex features and do especially poorly on new writing domains. In fact, we found that a perplexity-only baseline was worse than random on some domains, including non-native English speaker data. Meanwhile, classifiers based on large language models like RoBERTa easily capture complex features, but overfit to the training data and generalize poorly: we found that a RoBERTa baseline had catastrophic worst-case generalization performance, sometimes even worse than a perplexity-only baseline. Zero-shot methods that classify text without training on labeled data, by calculating the probability that the text was generated by a specific model, also tend to do poorly when a different model was actually used to generate the text. How Ghostbuster Works Ghostbuster uses a three-stage training process: computing probabilities, selecting features, and classifier training. Computing probabilities: We converted each document into a series of vectors by computing the probability of generating each word in the document under a series of weaker language models (a unigram model, a trigram model, and two non-instruction-tuned GPT-3 models, ada and davinci). Selecting features: We used a structured search procedure to select features, which works by (1) defining a set of vector and scalar operations that combine the probabilities, and (2) searching for useful combinations of these operations using forward feature selection, repeatedly adding the best remaining feature. Classifier training: We trained a linear classifier on the best probability-based features and some additional manually-selected features. Results When trained and tested on the same domain, Ghostbuster achieved 99.0 F1 across all three datasets, outperforming GPTZero by a margin of 5.9 F1 and DetectGPT by 41.6 F1. Out of domain, Ghostbuster achieved 97.0 F1 averaged across all conditions, outperforming DetectGPT by 39.6 F1 and GPTZero by 7.5 F1. Our RoBERTa baseline achieved 98.1 F1 when evaluated in-domain on all datasets, but its generalization performance was inconsistent. Ghostbuster outperformed the RoBERTa baseline on all domains except creative writing out-of-domain, and had much better out-of-domain performance than RoBERTa on average (13.8 F1 margin). Results on Ghostbuster’s in-domain and out-of-domain performance. To ensure that Ghostbuster is robust to the range of ways that a user might prompt a model, such as requesting different writing styles or reading levels, we evaluated Ghostbuster’s robustness to several prompt variants. Ghostbuster outperformed all other tested approaches on these prompt variants with 99.5 F1. To test generalization across models, we evaluated performance on text generated by Claude, where Ghostbuster also outperformed all other tested approaches with 92.2 F1. AI-generated text detectors have been fooled by lightly editing the generated text. We examined Ghostbuster’s robustness to edits, such as swapping sentences or paragraphs, reordering characters, or replacing words with synonyms. Most changes at the sentence or paragraph level didn’t significantly affect performance, though performance decreased smoothly if the text was edited through repeated paraphrasing, using commercial detection evaders such as Undetectable AI, or making numerous word- or character-level changes. Performance was also best on longer documents. Since AI-generated text detectors may misclassify non-native English speakers’ text as AI-generated, we evaluated Ghostbuster’s performance on non-native English speakers’ writing. All tested models had over 95% accuracy on two of three tested datasets, but did worse on the third set of shorter essays. However, document length may be the main factor here, since Ghostbuster does nearly as well on these documents (74.7 F1) as it does on other out-of-domain documents of similar length (75.6 to 93.1 F1). Users who wish to apply Ghostbuster to real-world cases of potential off-limits usage of text generation (e.g., ChatGPT-written student essays) should note that errors are more likely for shorter text, domains far from those Ghostbuster trained on (e.g., different varieties of English), text by non-native speakers of English, human-edited model generations, or text generated by prompting an AI model to modify a human-authored input. To avoid perpetuating algorithmic harms, we strongly discourage automatically penalizing alleged usage of text generation without human supervision. Instead, we recommend cautious, human-in-the-loop use of Ghostbuster if classifying someone’s writing as AI-generated could harm them. Ghostbuster can also help with a variety of lower-risk applications, including filtering AI-generated text out of language model training data and checking if online sources of information are AI-generated. Conclusion Ghostbuster is a state-of-the-art AI-generated text detection model, with 99.0 F1 performance across tested domains, representing substantial progress over existing models. It generalizes well to different domains, prompts, and models, and it’s well-suited to identifying text from black-box or unknown models because it doesn’t require access to probabilities from the specific model used to generate the document. Future directions for Ghostbuster include providing explanations for model decisions and improving robustness to attacks that specifically try to fool detectors. AI-generated text detection approaches can also be used alongside alternatives such as watermarking. We also hope that Ghostbuster can help across a variety of applications, such as filtering language model training data or flagging AI-generated content on the web. Try Ghostbuster here: ghostbuster.app Learn more about Ghostbuster here: [ paper ] [ code ] Try guessing if text is AI-generated yourself here: ghostbuster.app/experiment

Exit mobile version