Session: Machine Learning - Data Handling and Task Learning
Chair: Zeman Li
Cluster: nan
Talk 1: PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts
Speaker: Zeman Li
Abstract: Modern machine learning models are trained on diverse datasets and tasks to improve generalization. A key challenge in multitask learning is determining the optimal data mixing and sampling strategy across different data sources. Prior research in this multi-task learning setting has primarily focused on mitigating gradient conflicts between tasks. However, we observe that many real-world multitask learning scenarios-such as multilingual training and multi-domain learning in large foundation models-exhibit predominantly positive task interactions with minimal or no gradient conflict. Building on this insight, we introduce PiKE (Positive gradient interaction-based K-task weights Estimator), an adaptive data mixing algorithm that dynamically adjusts task contributions throughout training. PiKE optimizes task sampling to minimize overall loss, effectively leveraging positive gradient interactions with almost no additional computational overhead. We establish theoretical convergence guarantees for PiKE and demonstrate its superiority over static and non-adaptive mixing strategies. Additionally, we extend PiKE to promote fair learning across tasks, ensuring balanced progress and preventing task underrepresentation. Empirical evaluations on large-scale language model pretraining show that PiKE consistently outperforms existing heuristic and static mixing strategies, leading to faster convergence and improved downstream task performance. Li, Z., Deng, Y., Zhong, P., Razaviyayn, M., & Mirrokni, V. (2025). PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts. arXiv preprint arXiv:2502.06244.
Talk 2: Sample Reweighting for Large Models by Leveraging Weights and Losses of Smaller Models
Speaker: Mahdi Salmani
Abstract: Sample reweighting is an effective approach for mitigating the impact of noisy data by adjusting the importance of individual samples, enabling the model to focus on informative examples to improve performance. This is especially valuable for Large Language Models (LLMs), which leverage vast datasets to capture complex language patterns and drive advancements in AI applications. However, finding optimal sample weights may not be feasible due to the high cost of using conventional methods, such as those in [1], during the pre-training of larger models. One potential solution is to use weights obtained by a smaller model directly as weights for data in a larger model. However, as we will see in this talk, this may not be effective, as the optimal weight distribution for the smaller model can be too distant from that of the larger model, leading to suboptimal results. There are also papers [2] that use losses from a small model to prioritize training data. In this talk, we explore using both the weights and losses of the smaller model together as an alternative for training the larger model. References [1] Ren, M., et al. (2018). Learning to Reweight Examples for Robust Deep Learning. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4342-4350. [2] Mindermann, S., et al. (2022). Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt. Proceedings of the 39th International Conference on Machine Learning, PMLR 162:15520-15542.
Talk 3: Learning Optimal Robust Policies under Observational Data with Causal Transport
Speaker: Ruijia Zhang
Abstract: We propose a causal distributionally robust learning framework that accounts for potential distributional shifts in observational data. To hedge against uncertainty, we introduce a novel ambiguity set based on a two-stage nested transport distance, which characterizes the similarity between the empirical distribution of observational data and the true distribution of potential treatment outcomes. It penalizes deviations in covariates and treatment-specific conditional distributions while preserving the underlying causal structure. We derive a dual reformulation and establish conditions under which the robust optimization problem admits a linear programming representation, ensuring computational tractability.