AI Research Highlights | Week 46, 2023
1. MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning
Researchers from the Ant group proposed MFTCoder, an open-source project of CodeFuse for multitasking Code-LLMs, which includes models, datasets, training codebases, and inference guides. The focus is on addressing the issues of data balance and convergence speed that commonly arise in previous multitask finetuning methods. Extensive experiments show that the MFT approach outperforms individual fine-tuning for each task or data merging from multiple tasks. Notably, when implementing MFTCoder with the CodeLlama-34B-Python base model, it achieves an impressive pass@1 score of 74.4% on the humanEval evaluation dataset, surpassing the performance of GPT-4 (67%, zero-shot) The project can be found here.
2. Prompt Cache: Modular Attention Reuse for Low-Latency Inference
The researchers from Yale University introduced Prompt Cache, an acceleration technique based on the insight that attention states can be reused across LLM prompts. Prompt Cache utilizes a prompt schema to delineate such reused text segments, formulating them into a modular and positionally coherent structure termed “prompt modules”. This allows LLM users to incorporate these modules seamlessly into their prompts, thereby leveraging them for context with negligible latency implications. Our evaluations on benchmark data sets indicate TTFT latency reductions of up to 8× on GPUs and 60× on CPUs.
3. Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
Researchers from UCLA presented a method named Rephrase and Respond (RaR), which allows LLMs to rephrase and expand questions posed by humans and provide responses in a single prompt. They also introduced a two-step variant of RaR, where a rephrasing LLM first rephrases the question and then passes the original and rephrased questions together to a different responding LLM. This facilitates the effective utilization of rephrased questions generated by one LLM with another. RaR significantly improved the performance of different models across a wide range of tasks. What's more, the authors showed that RaR is complementary to CoT and can be combined with CoT to achieve even better performance. The project can be found here.
4. Can LLMs Follow Simple Rules?
This paper proposed Rule-following Language Evaluation Scenarios (RuLES) as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, the researchers identified 6 categories of attack strategies and collected two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories.
5. OtterHD: A High-Resolution Multi-modality Model
Researchers from Nanyang Technological University presented the OtterHD-8B model, which builds on the innovative architecture of Fuyu-8B. This model effectively processes images of various resolutions, moving away from the traditional limitation of fixed-resolution inputs seen in most LMMs. Specifically designed for following instructions, OtterHD-8B excels in dealing with high-resolution images. This becomes especially evident when tested against the new MagnifierBench benchmark that is designed to evaluate the capability of LMMs to discern fine details in complex scenes, highlighting the crucial role of resolution flexibility in contemporary LMMs. The results not only spotlight the promise of Fuyu-like architectures for future studies but also underscore the need for benchmarks like MagnifierBench to rigorously test LLMs’ fine-grained perception.
6. LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios. The project can be found here.
7. Cognitively Inspired Components for Social Conversational Agents
This paper presented a survey highlighting a potential solution to both categories of problems through the introduction of cognitively inspired additions to current conversational agents (CA). Through computational facsimiles of semantic and episodic memory, emotion, working memory, and the ability to learn, it is possible to address both the technical and social problems encountered by CAs, such as scope with retrieval agents and the often nonsensical answers of former generative agents.
8. JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
This paper proposed a multi-task agent JARVIS-1 designed for the complex environment of Minecraft, which marks a significant advancement in achieving human-like planning within an open-world setting. By leveraging pre-trained Multi-modal LMs, JARVIS-1 not only effectively interprets multimodal inputs but also adeptly translates them into actions. Its integration of a multimodal memory, which draws from both ingrained knowledge and real-time game experiences, enhances its decision-making capabilities. Notably, its achievement in the long-horizon diamond pickaxe task, where it achieved a completion rate that surpasses VPT by up to five times, underscores its potential and the strides made in this domain. The project can be found here.
9. Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
The researchers introduced Lumos, Language Agents with Unified Data Formats, Modular Design, and Open-Source LLMs. Lumos consists of planning, grounding, and execution modules built based on LLAMA-2-7B and off-the-shelf APIs. It utilizes a unified data format that encompasses multiple task types and is trained with ~40K diverse high-quality subgoal/action annotations from ground-truth reasoning steps in existing benchmarks with GPT-4. Lumos is comparable or even beats GPT-series agents on web/complex QA tasks Mind2Web and HotpotQA, and larger open agents on math tasks. The project can be found here.
10. ADaPT: As-Needed Decomposition and Planning with Language Models
This paper introduced As-Needed Decomposition and Planning for complex Tasks (ADaPT), an approach that explicitly plans and decomposes complex sub-tasks when the LLM is unable to execute them. ADaPT recursively decomposes sub-tasks to adapt to both task complexity and LLM capability. A complex task is first assigned to the executor. If the executor does not succeed, then ADaPT calls the planner to decompose the task into sub-tasks along with a logical operator ("And" or "Or" ) indicating how to compose them. Each sub-task (or step) is then assigned recursively to ADaPT and is combined using the logical operator. In the end, the success of sub-tasks after recursive decomposition ensures overall task success. ADaPT outperformed methods like ReAct, Reflexion, Plan-and-Execute in ALFWorld, WebShop, and TextCraft.
*The researchers behind the publications deserve full credit for their work.