Proton's biased article on Deepseek

JOMusic@lemmy.ml · 4 months ago

Proton's biased article on Deepseek

lily33@lemm.ee · edit-2 4 months ago

What makes these consumer-oriented models different is that that rather than being trained on raw data, they are trained on synthetic data from pre-existing models. That’s what the “Qwen” or “Llama” parts mean in the name. The 7B model is trained on synthetic data produced by Qwen, so it is effectively a compressed version of Qen. However, neither Qwen nor Llama can “reason,” they do not have an internal monologue.

You got that backwards. They’re other models - qwen or llama - fine-tuned on synthetic data generated by Deepseek-R1. Specifically, reasoning data, so that they can learn some of its reasoning ability.

But the base model - and so the base capability there - is that of the corresponding qwen or llama model. Calling them “Deepseek-R1-something” doesn’t change what they fundamentally are, it’s just marketing.

pcalau12i@lemmygrad.ml · edit-2 4 months ago

There is no “fundamentally” here, you are referring to some abstraction that doesn’t exist. The models are modified during the fine-tuning process, and the process trains them to learn to adopt DeepSeek R1’s reasoning technique. You are acting like there is some “essence” underlying the model which is the same between the original Qwen and this model. There isn’t. It is a hybrid and its own thing. There is no such thing as “base capability,” the model is not two separate pieces that can be judged independently. You can only evaluate the model as a whole. Your comment is just incredibly bizarre to respond to because you are referring to non-existent abstractions and not actually speaking of anything concretely real.

The model is neither Qwen nor DeepSeek R1, it is DeepSeek R1 Qwen Distill as the name says. it would be like saying it’s false advertising to say a mule is a hybrid of a donkey and a horse because the “base capabilities” is a donkey and so it has nothing to do with horses, and it’s really just a donkey at the end of the day. The statement is so bizarre I just do not even know how to address it. It is a hybrid, it’s its own distinct third thing that is a hybrid of them both. The model’s capabilities can only be judged as it exists, and its capabilities differ from Qwen and the original DeepSeek R1 as actually scored by various metrics.

Do you not know what fine-tuning is? It refers to actually adjusting the weights in the model, and it is the weights that define the model. And this fine-tuning is being done alongside DeepSeek R1, meaning it is being adjusted to take on capabilities of R1 within the model. It gains R1 capabilities at the expense of Qwen capabilities as DeepSeek R1 Qwen Distill performs better on reasoning tasks but actually not as well as baseline models on non-reasoning tasks. The weights literally have information both of Qwen and R1 within them at the same time.

Speaking of its “base capabilities” is a meaningless floating abstraction which cannot be empirically measured and doesn’t refer to anything concretely real. It only has its real concrete capabilities, not some hypothetical imagined capabilities. You accuse them of “marketing” even though it is literally free. All DeepSeek sells is compute to run models, but you can pay any company to run these distill models. They have no financial benefit for misleading people about the distill models.

You genuinely are not making any coherent sense at all, you are insisting a hybrid model which is objectively different and objectively scores and performs differently should be given the exact same name, for reasons you cannot seem to actually articulate. It clearly needs a different name, and since it was created utilizing the DeepSeek R1 model’s distillation process to fine-tune it, it seems to make sense to call it DeepSeek R1 Qwen Distill. Yet for some reason you insist this is lying and misrepresenting it and it actually has literally nothing to do with DeepSeek R1 at all and it should just be called Qwen and we should pretend it is literally the same model despite it not being the same model as its training weights are different (you can do a “diff” on the two model files if you don’t believe me!) and it performs differently on the same metrics.

There is simply no rational reason to intentionally want to mislabel the model as just being Qwen and having no relevance to DeepSeek R1. You yourself admitted that the weights are trained on R1 data so they necessarily contain some R1 capabilities. If DeepSeek was lying and trying to hide that the distill models are based on Qwen and Llama, they wouldn’t have literally put that in the name to let everyone know, and released a paper explaining exactly how those were produced.

It is clear to me that you and your other friends here have some sort of alternative agenda that makes you not want to label it correctly. DeepSeek is open about the distill models using Qwen and Llama, but you want them to be closed and not reveal that they also used DeepSeek R1. The current name for it is perfectly fine and pretending it is just a Qwen model (or Llama, for the other distilled versioned) is straight-up misinformation, and anyone who downloads the models and runs them themselves will clearly see immediately that they perform differently. It is a hybrid model correctly called what they are: DeepSeek R1 Qwen Distill and DeepSeek R1 Llama Distill.